imported>SimLibrarian: capitalization

2024-09-06T16:27:54Z

capitalization

New page

The '''canopy clustering algorithm''' is an unsupervised pre-[[Data clustering|clustering]] algorithm introduced by [[Andrew McCallum]], Kamal Nigam and Lyle Ungar in 2000.<ref name ="original" /> It is often used as preprocessing step for the [[K-means algorithm]] or the [[hierarchical clustering]] algorithm. It is intended to speed up [[Computer cluster|clustering]] operations on large [[data set]]s, where using another algorithm directly may be impractical due to the size of the data set.

==Description==
The algorithm proceeds as follows, using two thresholds <math>T_1</math> (the loose distance) and <math>T_2</math> (the tight distance), where <math>T_1 > T_2</math> .<ref name ="original">McCallum, A.; Nigam, K.; and Ungar L.H. (2000) [http://www.kamalnigam.com/papers/canopy-kdd00.pdf "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching"], Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 {{doi|10.1145/347090.347123 }}</ref><ref>{{cite web |url=http://courses.cs.washington.edu/courses/cse590q/04au/slides/DannyMcCallumKDD00.ppt |title=The Canopies Algorithm |website=courses.cs.washington.edu |access-date=2014-09-06}}</ref>
# Begin with the set of data points to be clustered.
# Remove a point from the set, beginning a new 'canopy' containing this point.
# For each point left in the set, assign it to the new canopy if its distance to the first point of the canopy is less than the loose distance <math>T_1</math>.
# If the distance of the point is additionally less than the tight distance <math>T_2</math>, remove it from the original set.
# Repeat from step 2 until there are no more data points in the set to cluster.
# These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm.

An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.

==Applicability==
Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the [[curse of dimensionality]]. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means.

Its benefits include:
* The number of instances of training data that must be compared at each step is reduced.
* There is some evidence that the resulting clusters are improved.<ref>[https://mahout.apache.org/docs/latest/algorithms/clustering/canopy/ Mahout description of Canopy-Clustering]
Retrieved 2022-07-02.</ref>

==References==
{{Reflist}}

[[Category:Cluster analysis algorithms]]

Canopy clustering algorithm - Revision history

imported>SimLibrarian: capitalization