La playa

No me gusta la playa, jamás me ha gustado, eso siempre lo repito. Yo soy de montaña, de cordillera, vengo de altura, de clima frío, me gusta el campo, perderme por senderos, abrazar árboles. Quizá…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Kernel Methods

Before I begin, here is a list of pre-requisites in order to gain maximum from this blog :

The linear regression algorithm outputs a linear hypothesis for a training set. But what if the data is spread non-linearly? Will the hypothesis be accurate? No. We need another family of models to do this task.

In linear regression we used to find hypothesis as :

But what if I need to model this kind of data?

In order to make things simple see this as linear model over a set of feature variables. Starting with a motivating example I’ll define a function Ψ(x) : ℝ→ ℝ⁵ as :

Now consider θ ∈ ℝ⁵ as vector that contains the entries θ₀,θ₁,θ₂, θ₃, θ₄. Now we can write the equation as y=θᵀ Ψ(x). While we’ll study about kernels we take x as attributes (initial input) and Ψ(x) as a feature map of feature variables θ₀,θ₁,θ₂, θ₃, θ₄.

If I say that there exists a trick by which you can reduce your computations and actually take an example & map it to an infinite dimensional vector, you’ll be amazed right? This is the beauty and a simple introduction to the kernel trick. When the data set cannot be separated in current dimensions, we add another dimension to it, and check maybe now we can progress or not. This is also useful to draw hypothesis when the data is not linearly separable i.e there exist no hyperplane that can separate the data.

Let us together cover how and why this kernel trick was devised.

Remember the gradient descent update rule in linear regression? Let us modify it with a feature map (by replacing x with Ψ(x))

Here Ψ(x)∈ ᵖ (p is potentially infinite)

Consider an example of a feature map Ψ(x) with monomial terms with order less than or equal to 4. If the order of computation of x is d, then for Ψ(x) it is d⁴.

If θ₀=0, then let us assume θᵗ can be expressed as a linear combination of feature vectors i.e θ vectors at any stage of gradient descent. We can prove this assumption by induction.

The above recursive relation in β, can be performed for i=1,2,3….,n until we converge in β. A point to be noted is that Ψ(xʲ)ᵀ Ψ(xⁱ) isn’t changing in every iteration, hence they can be pre-computed.

The above summation involves dot product between 2 potentially infinite dimensional vectors. This would require a lot of time as well as storage. Here enters the concept of kernels.

Let us now define a kernel K which is a function from χ*χ , here χ is any space(ᵈ) & K(x,z)=<Ψ(x),Ψ(z)> (<.> denotes the inner product). This further reduces to K(x,z)=Ψ(x)ᵀΨ(z). Hence this is a computational trick by which dot product between 2 high dimensional feature maps can be compactly expressed.

We straight up went form order d⁴ calculations to order d calculations(<x,z> is an order d computation). Let me illustrate the actual usage of a kernel in a layman language. Imagine a guy solving Rubik’s cube takes 60 seconds with the beginner algorithm. One day, he was amazed, how Feliks Zemdegs only took 4.73 to solve this puzzle with 43 quintillion configurations? He researched about the advanced algorithms to make time minimum and was surprised that how his solving time went exponentially low. The advanced algorithm here is the kernel trick that makes the beginner algorithm efficient.

I guess you’ve been pondering about how and where we’ll use this? It can be used in :

Only making you aware won’t help , know some math too :)

Steps to kernelize a linear regression algorithm:

βᵢᵗ⁺¹ must be βᵢ here, sorry for the confusion, and '=' is an assignment operator

3. Prediction : From the algorithm above we have enough knowledge of β that is sufficient to make prediction and draw hypothesis.

Observations from the above :

The train and test iterations are done as stated above

Did you notice something?

Source : Author

This is where kernel trick plays. We saved ourselves from the high dimension computation. Even if the feature vector is infinitely long, the kernel can be still evaluated in constant time.

2. Recall the linear regression, we used to input training examples, the gradient descent algorithm updates it until convergence, and at the end we “learned” a θ vector. Now all the operations were held on the theta vector and the training set is discarded. But notice here! The training examples need to be stored as we need to remember them during performing mathematical operations.

Source : Author

Following is a polynomial kernel with its feature map

We have a special kernel known as Gaussian kernel. This is an infinite dimensional feature vector. It is given as :

This is used when K(x,z) must be high when x and z are similar and low when x and z are not similar. The σ is the variance of the gaussian distribution[5] of the training examples.

Let me tell you an interesting thing , researchers when they want to find kernel functions for a feature map, they don’t actually devise a function in accordance to the map, they first write a function and then check whether it is kernel or not. Even scientists use hit and trial :P

So coming back to the point , a function must satisfy the following conditions for it to be kernel :

3 symmetric functions

I’m not going into the proof of this theorem, interested people may refer 20-notes.pdf in the references section[4].

Summing this up and concluding, kernels have a wide range of applications (like in SVM) and is in-short a “trick” that impressively makes learning algorithms efficient.

Add a comment

Related posts:

Just 5 minutes of Journaling can change your life

The research suggests that it can help you organize your life and work, improve your focus, and reduce stress and anxiety. Why not start this new pathbreaking habit? Write whatever comes to mind. You…

How LED Linear High Bay Lighting is Revolutionizing Industrial Workspaces

led linear high bay lighting is taking the world of industrial workspaces by storm. These powerful and efficient lights are quickly replacing traditional options like fluorescent and HID luminaires…

Introduction

Nestled in the heart of the vibrant city of Dubai, the Dubai Mall stands as a testament to architectural brilliance and human ingenuity. Renowned as the world’s largest shopping and entertainment…