<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://debianws.lexgopc.com/wiki143/index.php?action=history&amp;feed=atom&amp;title=Canopy_clustering_algorithm</id>
	<title>Canopy clustering algorithm - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://debianws.lexgopc.com/wiki143/index.php?action=history&amp;feed=atom&amp;title=Canopy_clustering_algorithm"/>
	<link rel="alternate" type="text/html" href="http://debianws.lexgopc.com/wiki143/index.php?title=Canopy_clustering_algorithm&amp;action=history"/>
	<updated>2026-05-05T21:03:00Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.1</generator>
	<entry>
		<id>http://debianws.lexgopc.com/wiki143/index.php?title=Canopy_clustering_algorithm&amp;diff=5867376&amp;oldid=prev</id>
		<title>imported&gt;SimLibrarian: capitalization</title>
		<link rel="alternate" type="text/html" href="http://debianws.lexgopc.com/wiki143/index.php?title=Canopy_clustering_algorithm&amp;diff=5867376&amp;oldid=prev"/>
		<updated>2024-09-06T16:27:54Z</updated>

		<summary type="html">&lt;p&gt;capitalization&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;The &amp;#039;&amp;#039;&amp;#039;canopy clustering algorithm&amp;#039;&amp;#039;&amp;#039; is an unsupervised pre-[[Data clustering|clustering]] algorithm introduced by [[Andrew McCallum]], Kamal Nigam and Lyle Ungar in 2000.&amp;lt;ref name =&amp;quot;original&amp;quot; /&amp;gt; It is often used as preprocessing step for the [[K-means algorithm]] or the [[hierarchical clustering]] algorithm. It is intended to speed up [[Computer cluster|clustering]] operations on large [[data set]]s, where using another algorithm directly may be impractical due to the size of the data set.&lt;br /&gt;
&lt;br /&gt;
==Description==&lt;br /&gt;
The algorithm proceeds as follows, using two thresholds &amp;lt;math&amp;gt;T_1&amp;lt;/math&amp;gt; (the loose distance) and &amp;lt;math&amp;gt;T_2&amp;lt;/math&amp;gt; (the tight distance), where &amp;lt;math&amp;gt;T_1 &amp;gt; T_2&amp;lt;/math&amp;gt; .&amp;lt;ref name =&amp;quot;original&amp;quot;&amp;gt;McCallum, A.; Nigam, K.; and Ungar L.H. (2000) [http://www.kamalnigam.com/papers/canopy-kdd00.pdf &amp;quot;Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching&amp;quot;], Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 {{doi|10.1145/347090.347123 }}&amp;lt;/ref&amp;gt;&amp;lt;ref&amp;gt;{{cite web |url=http://courses.cs.washington.edu/courses/cse590q/04au/slides/DannyMcCallumKDD00.ppt |title=The Canopies Algorithm |website=courses.cs.washington.edu |access-date=2014-09-06}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
# Begin with the set of data points to be clustered.&lt;br /&gt;
# Remove a point from the set, beginning a new &amp;#039;canopy&amp;#039; containing this point.&lt;br /&gt;
# For each point left in the set, assign it to the new canopy if its distance to the first point of the canopy is less than the loose distance &amp;lt;math&amp;gt;T_1&amp;lt;/math&amp;gt;.&lt;br /&gt;
# If the distance of the point is additionally less than the tight distance &amp;lt;math&amp;gt;T_2&amp;lt;/math&amp;gt;, remove it from the original set.&lt;br /&gt;
# Repeat from step 2 until there are no more data points in the set to cluster.&lt;br /&gt;
# These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm.&lt;br /&gt;
&lt;br /&gt;
An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.&lt;br /&gt;
&lt;br /&gt;
==Applicability==&lt;br /&gt;
Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the [[curse of dimensionality]]. Only when a cheap and approximative – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means.&lt;br /&gt;
&lt;br /&gt;
Its benefits include:&lt;br /&gt;
* The number of instances of training data that must be compared at each step is reduced.&lt;br /&gt;
* There is some evidence that the resulting clusters are improved.&amp;lt;ref&amp;gt;[https://mahout.apache.org/docs/latest/algorithms/clustering/canopy/ Mahout description of Canopy-Clustering]&lt;br /&gt;
 Retrieved 2022-07-02.&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==References==&lt;br /&gt;
{{Reflist}}&lt;br /&gt;
&lt;br /&gt;
[[Category:Cluster analysis algorithms]]&lt;/div&gt;</summary>
		<author><name>imported&gt;SimLibrarian</name></author>
	</entry>
</feed>