<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>http://debianws.lexgopc.com/wiki143/index.php?action=history&amp;feed=atom&amp;title=Kernel_method</id>
	<title>Kernel method - Revision history</title>
	<link rel="self" type="application/atom+xml" href="http://debianws.lexgopc.com/wiki143/index.php?action=history&amp;feed=atom&amp;title=Kernel_method"/>
	<link rel="alternate" type="text/html" href="http://debianws.lexgopc.com/wiki143/index.php?title=Kernel_method&amp;action=history"/>
	<updated>2026-05-05T21:49:42Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.43.1</generator>
	<entry>
		<id>http://debianws.lexgopc.com/wiki143/index.php?title=Kernel_method&amp;diff=2273805&amp;oldid=prev</id>
		<title>imported&gt;Dough34: fix capitalization, remove space</title>
		<link rel="alternate" type="text/html" href="http://debianws.lexgopc.com/wiki143/index.php?title=Kernel_method&amp;diff=2273805&amp;oldid=prev"/>
		<updated>2025-02-13T19:58:41Z</updated>

		<summary type="html">&lt;p&gt;fix capitalization, remove space&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;{{Short description|Class of algorithms for pattern analysis}}&lt;br /&gt;
{{Machine learning bar}}&lt;br /&gt;
&lt;br /&gt;
In [[machine learning]], &amp;#039;&amp;#039;&amp;#039;kernel machines&amp;#039;&amp;#039;&amp;#039; are a class of algorithms for [[pattern analysis]], whose best known member is the [[support-vector machine]] (SVM). These methods involve using linear classifiers to solve nonlinear problems.&amp;lt;ref&amp;gt;{{Cite web |title=Kernel method |url=https://www.engati.com/glossary/kernel-method |access-date=2023-04-04 |website=Engati |language=en}}&amp;lt;/ref&amp;gt; The general task of [[pattern analysis]] is to find and study general types of relations (for example [[Cluster analysis|clusters]], [[ranking]]s, [[principal components]], [[correlation]]s, [[Statistical classification|classifications]]) in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into [[feature vector]] representations via a user-specified &amp;#039;&amp;#039;feature map&amp;#039;&amp;#039;: in contrast, kernel methods require only a user-specified &amp;#039;&amp;#039;kernel&amp;#039;&amp;#039;, i.e., a [[similarity function]] over all pairs of data points computed using [[inner products]]. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the [[representer theorem]]. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing. &lt;br /&gt;
&lt;br /&gt;
Kernel methods  owe their name to the use of [[Positive-definite kernel|kernel function]]s, which enable them to operate in a high-dimensional, &amp;#039;&amp;#039;implicit&amp;#039;&amp;#039; [[feature space]] without ever computing the coordinates of the data in that space, but rather by simply computing the [[inner product]]s between the [[Image (mathematics)|images]] of all pairs of data in the feature space.  This operation is often computationally cheaper than the explicit computation of the coordinates.  This approach is called the &amp;quot;&amp;#039;&amp;#039;&amp;#039;kernel trick&amp;#039;&amp;#039;&amp;#039;&amp;quot;.&amp;lt;ref&amp;gt;{{Cite book|title=Pattern Recognition|last=Theodoridis|first=Sergios|publisher=Elsevier B.V.|year=2008|isbn=9780080949123|pages=203}}&amp;lt;/ref&amp;gt; Kernel functions have been introduced for sequence data, [[Graph kernel|graphs]], text, images, as well as vectors.&lt;br /&gt;
&lt;br /&gt;
Algorithms capable of operating with kernels include the [[kernel perceptron]], support-vector machines (SVM), [[Gaussian process]]es, [[principal components analysis]] (PCA), [[canonical correlation analysis]], [[ridge regression]], [[spectral clustering]], [[Adaptive filter|linear adaptive filters]] and many others.&lt;br /&gt;
&lt;br /&gt;
Most kernel algorithms are based on [[convex optimization]] or [[Eigenvalue, eigenvector and eigenspace|eigenproblems]] and are statistically well-founded. Typically, their statistical properties are analyzed using [[statistical learning theory]] (for example, using [[Rademacher complexity]]).&lt;br /&gt;
&lt;br /&gt;
==Motivation and informal explanation==&lt;br /&gt;
Kernel methods can be thought of as [[instance-based learning|instance-based learners]]: rather than learning some fixed set of parameters corresponding to the features of their inputs, they instead &amp;quot;remember&amp;quot; the &amp;lt;math&amp;gt;i&amp;lt;/math&amp;gt;-th training example &amp;lt;math&amp;gt;(\mathbf{x}_i, y_i)&amp;lt;/math&amp;gt; and learn for it a corresponding weight &amp;lt;math&amp;gt;w_i&amp;lt;/math&amp;gt;.  Prediction for unlabeled inputs, i.e., those not in the training set, is treated by the application of a [[similarity function]] &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt;, called a &amp;#039;&amp;#039;&amp;#039;kernel&amp;#039;&amp;#039;&amp;#039;, between the unlabeled input &amp;lt;math&amp;gt;\mathbf{x&amp;#039;}&amp;lt;/math&amp;gt; and each of the training inputs &amp;lt;math&amp;gt;\mathbf{x}_i&amp;lt;/math&amp;gt;.  For instance, a kernelized [[binary classifier]] typically computes a weighted sum of similarities&lt;br /&gt;
&amp;lt;math display=&amp;quot;block&amp;quot;&amp;gt;\hat{y} = \sgn \sum_{i=1}^n w_i y_i k(\mathbf{x}_i, \mathbf{x&amp;#039;}),&amp;lt;/math&amp;gt;&lt;br /&gt;
where&lt;br /&gt;
* &amp;lt;math&amp;gt;\hat{y} \in \{-1, +1\}&amp;lt;/math&amp;gt; is the kernelized binary classifier&amp;#039;s predicted label for the unlabeled input &amp;lt;math&amp;gt;\mathbf{x&amp;#039;}&amp;lt;/math&amp;gt; whose hidden true label &amp;lt;math&amp;gt;y&amp;lt;/math&amp;gt; is of interest;&lt;br /&gt;
* &amp;lt;math&amp;gt;k \colon \mathcal{X} \times \mathcal{X} \to \mathbb{R}&amp;lt;/math&amp;gt; is the kernel function that measures similarity between any pair of inputs &amp;lt;math&amp;gt;\mathbf{x}, \mathbf{x&amp;#039;} \in \mathcal{X}&amp;lt;/math&amp;gt;;&lt;br /&gt;
* the sum ranges over the {{mvar|n}} labeled examples &amp;lt;math&amp;gt;\{(\mathbf{x}_i, y_i)\}_{i=1}^n&amp;lt;/math&amp;gt; in the classifier&amp;#039;s training set, with &amp;lt;math&amp;gt;y_i \in \{-1, +1\}&amp;lt;/math&amp;gt;;&lt;br /&gt;
* the &amp;lt;math&amp;gt;w_i \in \mathbb{R}&amp;lt;/math&amp;gt; are the weights for the training examples, as determined by the learning algorithm;&lt;br /&gt;
* the [[sign function]] &amp;lt;math&amp;gt;\sgn&amp;lt;/math&amp;gt; determines whether the predicted classification &amp;lt;math&amp;gt;\hat{y}&amp;lt;/math&amp;gt; comes out positive or negative.&lt;br /&gt;
&lt;br /&gt;
Kernel classifiers were described as early as the 1960s, with the invention of the [[kernel perceptron]].&amp;lt;ref&amp;gt;{{cite journal |last1=Aizerman |first1=M. A. |first2=Emmanuel M. | last2 = Braverman |first3=L. I. |last3=Rozonoer |title=Theoretical foundations of the potential function method in pattern recognition learning |journal=Automation and Remote Control |volume=25 |year=1964 |pages=821–837}} Cited in {{Cite conference |last1=Guyon |first1=Isabelle |first2=B. |last2=Boser |first3=Vladimir |last3=Vapnik |title=Automatic capacity tuning of very large VC-dimension classifiers |conference=Advances in neural information processing systems |year=1993 |citeseerx = 10.1.1.17.7215}}&amp;lt;/ref&amp;gt; They rose to great prominence with the popularity of the [[support-vector machine]] (SVM) in the 1990s, when the SVM was found to be competitive with [[artificial neural network|neural networks]] on tasks such as [[handwriting recognition]].&lt;br /&gt;
&lt;br /&gt;
==Mathematics: the kernel trick==&lt;br /&gt;
[[Image:Kernel trick idea.svg|thumb|500px|SVM with feature map given by &amp;lt;math&amp;gt;\varphi((a, b)) = (a, b, a^2 + b^2)&amp;lt;/math&amp;gt; and thus with the kernel function &amp;lt;math&amp;gt;k(\mathbf{x}, \mathbf{y}) = \mathbf{x} \cdot \mathbf{y} + \left\| \mathbf{x} \right\| ^2  \left\| \mathbf{y} \right\| ^2  &amp;lt;/math&amp;gt;. The training points are mapped to a 3-dimensional space where a separating hyperplane can be easily found.]]&lt;br /&gt;
The kernel trick avoids the explicit mapping that is needed to get linear [[learning algorithms]] to learn a nonlinear function or [[decision boundary]].  For all &amp;lt;math&amp;gt;\mathbf{x}&amp;lt;/math&amp;gt; and &amp;lt;math&amp;gt;\mathbf{x&amp;#039;}&amp;lt;/math&amp;gt; in the input space &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt;, certain functions &amp;lt;math&amp;gt;k(\mathbf{x}, \mathbf{x&amp;#039;})&amp;lt;/math&amp;gt; can be expressed as an [[inner product]] in another space &amp;lt;math&amp;gt;\mathcal{V}&amp;lt;/math&amp;gt;. The function &amp;lt;math&amp;gt;k \colon \mathcal{X} \times \mathcal{X} \to \mathbb{R}&amp;lt;/math&amp;gt; is often referred to as a &amp;#039;&amp;#039;kernel&amp;#039;&amp;#039; or a &amp;#039;&amp;#039;[[kernel function]]&amp;#039;&amp;#039;. The word &amp;quot;kernel&amp;quot; is used in mathematics to denote a weighting function for a weighted sum or [[integral]].&lt;br /&gt;
&lt;br /&gt;
Certain problems in machine learning have more structure than an arbitrary weighting function &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt;.  The computation is made much simpler if the kernel can be written in the form of a &amp;quot;feature map&amp;quot; &amp;lt;math&amp;gt;\varphi\colon \mathcal{X} \to \mathcal{V}&amp;lt;/math&amp;gt; which satisfies&lt;br /&gt;
&amp;lt;math display=&amp;quot;block&amp;quot;&amp;gt;k(\mathbf{x}, \mathbf{x&amp;#039;}) = \langle \varphi(\mathbf{x}), \varphi(\mathbf{x&amp;#039;}) \rangle_\mathcal{V}.&amp;lt;/math&amp;gt;The key restriction is that &amp;lt;math&amp;gt; \langle \cdot, \cdot \rangle_\mathcal{V}&amp;lt;/math&amp;gt; must be a proper inner product. On the other hand, an explicit representation for &amp;lt;math&amp;gt;\varphi&amp;lt;/math&amp;gt; is not necessary, as long as &amp;lt;math&amp;gt;\mathcal{V}&amp;lt;/math&amp;gt; is an [[inner product space]].  The alternative follows from [[Mercer&amp;#039;s theorem]]: an implicitly defined function &amp;lt;math&amp;gt;\varphi&amp;lt;/math&amp;gt; exists whenever the space &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt; can be equipped with a suitable [[Measure (mathematics)|measure]] ensuring the function &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; satisfies [[Mercer&amp;#039;s condition]].&lt;br /&gt;
&lt;br /&gt;
Mercer&amp;#039;s theorem is similar to a generalization of the result from linear algebra that [[Positive-definite matrix#Other characterizations|associates an inner product to any positive-definite matrix]]. In fact, Mercer&amp;#039;s condition can be reduced to this simpler case. If we choose as our measure the [[counting measure]] &amp;lt;math&amp;gt; \mu(T) = |T| &amp;lt;/math&amp;gt; for all &amp;lt;math&amp;gt; T \subset X &amp;lt;/math&amp;gt;, which counts the number of points inside the set &amp;lt;math&amp;gt;T&amp;lt;/math&amp;gt;, then the integral in Mercer&amp;#039;s theorem reduces to a summation&amp;lt;math display=&amp;quot;block&amp;quot;&amp;gt; \sum_{i=1}^n\sum_{j=1}^n k(\mathbf{x}_i, \mathbf{x}_j) c_i c_j \geq 0.&amp;lt;/math&amp;gt;If this summation holds for all finite sequences of points &amp;lt;math&amp;gt;(\mathbf{x}_1, \dotsc, \mathbf{x}_n)&amp;lt;/math&amp;gt; in &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt; and all choices of &amp;lt;math&amp;gt;n&amp;lt;/math&amp;gt; real-valued coefficients &amp;lt;math&amp;gt;(c_1, \dots, c_n)&amp;lt;/math&amp;gt; (cf. [[positive definite kernel]]), then the function &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; satisfies Mercer&amp;#039;s condition.&lt;br /&gt;
&lt;br /&gt;
Some algorithms that depend on arbitrary relationships in the native space &amp;lt;math&amp;gt;\mathcal{X}&amp;lt;/math&amp;gt; would, in fact, have a linear interpretation in a different setting: the range space of &amp;lt;math&amp;gt;\varphi&amp;lt;/math&amp;gt;. The linear interpretation gives us insight about the algorithm. Furthermore, there is often no need to compute &amp;lt;math&amp;gt;\varphi&amp;lt;/math&amp;gt; directly during computation, as is the case with [[support-vector machine]]s. Some cite this running time shortcut as the primary benefit. Researchers also use it to justify the meanings and properties of existing algorithms.&lt;br /&gt;
&lt;br /&gt;
Theoretically, a [[Gram matrix]] &amp;lt;math&amp;gt;\mathbf{K} \in \mathbb{R}^{n \times n}&amp;lt;/math&amp;gt; with respect to &amp;lt;math&amp;gt;\{\mathbf{x}_1, \dotsc, \mathbf{x}_n\}&amp;lt;/math&amp;gt; (sometimes also called a &amp;quot;kernel matrix&amp;quot;&amp;lt;ref&amp;gt;{{cite journal | first1 = Thomas | last1 = Hofmann | first2 = Bernhard | last2 = Schölkopf | first3 = Alexander J. | last3 = Smola | date = 2008 | title = Kernel Methods in Machine Learning | journal = The Annals of Statistics | volume = 36 | issue = 3 | doi = 10.1214/009053607000000677 | s2cid = 88516979 |url=https://projecteuclid.org/download/pdfview_1/euclid.aos/1211819561| doi-access = free | arxiv = math/0701907 }}&amp;lt;/ref&amp;gt;), where &amp;lt;math&amp;gt;K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)&amp;lt;/math&amp;gt;, must be [[Positive-definite matrix|positive semi-definite (PSD)]].&amp;lt;ref&amp;gt;{{Cite Mehryar Afshin Ameet 2012}}&amp;lt;/ref&amp;gt; Empirically, for machine learning heuristics, choices of a function &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; that do not satisfy Mercer&amp;#039;s condition may still perform reasonably if &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; at least approximates the intuitive idea of similarity.&amp;lt;ref&amp;gt;{{cite web|url=http://www.svms.org/mercer/|title=Support Vector Machines: Mercer&amp;#039;s Condition|first=Martin|last=Sewell|publisher=Support Vector Machines|access-date=2014-05-30|archive-date=2018-10-15|archive-url=https://web.archive.org/web/20181015031456/http://www.svms.org/mercer/|url-status=dead}}&amp;lt;/ref&amp;gt; Regardless of whether &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; is a Mercer kernel, &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; may still be referred to as a &amp;quot;kernel&amp;quot;.&lt;br /&gt;
&lt;br /&gt;
If the kernel function &amp;lt;math&amp;gt;k&amp;lt;/math&amp;gt; is also a [[covariance function]] as used in [[Gaussian processes]], then the Gram matrix &amp;lt;math&amp;gt;\mathbf{K}&amp;lt;/math&amp;gt; can also be called a [[covariance matrix]].&amp;lt;ref&amp;gt;{{cite book | first1 = Carl Edward | last1 = Rasmussen | first2 = Christopher K. I. | last2 = Williams | date = 2006 | title = Gaussian Processes for Machine Learning|publisher=MIT Press|isbn=0-262-18253-X }} {{page needed|date=November 2023}}&amp;lt;/ref&amp;gt;&lt;br /&gt;
&lt;br /&gt;
==Applications==&lt;br /&gt;
Application areas of kernel methods are diverse and include [[geostatistics]],&amp;lt;ref&amp;gt;{{cite journal | last1 = Honarkhah | first1 = M. | last2 = Caers | first2 = J. | date = 2010 | doi = 10.1007/s11004-010-9276-7 | title = Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling |journal=[[Mathematical Geosciences]] | volume = 42 | issue = 5 | pages = 487–517 | bibcode = 2010MaGeo..42..487H | s2cid = 73657847 }}&amp;lt;/ref&amp;gt; [[kriging]], [[inverse distance weighting]], [[3D reconstruction]], [[bioinformatics]], [[cheminformatics]], [[information extraction]] and [[handwriting recognition]].&lt;br /&gt;
&lt;br /&gt;
==Popular kernels==&lt;br /&gt;
*[[Fisher kernel]]&lt;br /&gt;
*[[Graph kernel]]s&lt;br /&gt;
*[[Kernel smoother]]&lt;br /&gt;
*[[Polynomial kernel]]&lt;br /&gt;
*[[Radial basis function kernel]] (RBF)&lt;br /&gt;
*[[String kernel]]s&lt;br /&gt;
*[[Neural tangent kernel]]&lt;br /&gt;
*[[Neural network Gaussian process]] (NNGP) kernel&lt;br /&gt;
&lt;br /&gt;
==See also==&lt;br /&gt;
*[[Kernel methods for vector output]]&lt;br /&gt;
*[[Kernel density estimation]]&lt;br /&gt;
*[[Representer theorem]]&lt;br /&gt;
*[[Similarity learning]]&lt;br /&gt;
*[[Cover&amp;#039;s theorem]]&lt;br /&gt;
&lt;br /&gt;
== References ==&lt;br /&gt;
{{Reflist|30em}}&lt;br /&gt;
&lt;br /&gt;
== Further reading ==&lt;br /&gt;
* {{cite book | author-link1 = John Shawe-Taylor | first1 = J. | last1 = Shawe-Taylor | author-link2 = Nello Cristianini | first2 = N. | last2 = Cristianini | title = Kernel Methods for Pattern Analysis | publisher = Cambridge University Press | year = 2004|isbn=9780511809682}}&lt;br /&gt;
* {{cite book | first1 = W. | last1 = Liu | first2 = J. | last2 = Principe | first3 = S. | last3 = Haykin | title = Kernel Adaptive Filtering: A Comprehensive Introduction | publisher = Wiley | year = 2010 | isbn = 9781118211212 |url=https://books.google.com/books?id=eWUwB_P5pW0C }}&lt;br /&gt;
* {{cite book |first1=B. |last1=Schölkopf |author-link=Bernhard Schölkopf |first2=A. J. |last2=Smola |first3=F. |last3=Bach |title=Learning with Kernels : Support Vector Machines, Regularization, Optimization, and Beyond |publisher=MIT Press |year=2018 |isbn=978-0-262-53657-8 |url=https://books.google.com/books?id=ZQxiuAEACAAJ }}&lt;br /&gt;
&lt;br /&gt;
==External links==&lt;br /&gt;
* [http://www.kernel-machines.org Kernel-Machines Org]—community website&lt;br /&gt;
* [http://onlineprediction.net/?n=Main.KernelMethods onlineprediction.net Kernel Methods Article]&lt;br /&gt;
&lt;br /&gt;
{{DEFAULTSORT:Kernel Methods}}&lt;br /&gt;
[[Category:Kernel methods for machine learning]]&lt;br /&gt;
[[Category:Geostatistics]]&lt;br /&gt;
[[Category:Classification algorithms]]&lt;/div&gt;</summary>
		<author><name>imported&gt;Dough34</name></author>
	</entry>
</feed>