# Supervised Machine Learning I : How could have Darth Vader saved the Death Star?

Let’s see how?

Let’s pick up from where we last left and continue our journey into the realms of Supervised Learning Algorithms – ‘The One Labels’. We will discuss our very first algorithm, called, Linear Regression.

Remember this:

[i]

Consider you are a Commander on the Death Star and you have just received Darth Vader’s command to present him a working system that allows him to choose the optimal battleship for defeating the Rebellion. Everyone else at work, you decide to take it upon yourself to develop this system.

So, let get going!

Machine Learning is all based on data. So, consider the following data set:

 Size of Darth Vader’s Ship(sq. m.) Rounds of Lasers 2500 500 1695 256 2400 450 1500 240 3000 720 2650 520 . . . . . .

The above data can be plotted as follows:

The question: Given data like this, how can we learn to predict the number of laser rounds Darth Vader can fire, as a function of the size of a ship?

Before we tread any further, we need to establish some notations as the math is about to get messy. We’ll use x(i) to denote the “input” variables (Size of Vader’s ship in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict (laser rounds). A pair     (x(i), y(i)) is called a training example, and the dataset that we’ll be using to learn —a list of m training examples {(x(i), y(i)); i = 1, . . . ,m}—is called a training set. Note that the superscript “(i)” in the notation is simply an index into the training set, and has nothing to do with exponentiation. We will also use X to denote the space of input values (ship size), and Y the space of output values (laser count). In this example, X = Y = R (Real Numbers).

A general overview of how a Supervised Learning Algorithm works can be drawn from this image:

[ii]

Here, training data is the data we have. New data refers to different sizes of ships that can be entered into the trained model and Prediction is the estimated Laser Rounds the ship can carry.

When the target variable(y) that we’re trying to predict is continuous, such as in our example, we call the learning problem a “regression problem”. When y can take on only a small number of discrete values (such as if, given the size of ship, we wanted to predict if it is a Jet or a Battle Cruiser), we call it a classification problem.

# How does it Work?

A regression algorithm fits data by drawing out features from available training example (in our case, the size of ship). It can have more features such as the Fuel Tank size, canon size and many more. It is up to the programmer/researcher/data scientist to choose the number of features.

To perform supervised learning, we must decide how we’re going to represent functions/hypotheses h in a computer. As an initial choice, let us say we decide to approximate y as a linear function of x (keep track of what x and y denote here):

$h_\theta(x) = \theta_0+\theta_1x_1+\theta_2x_2$                       ..(1)

Here, the θi’s are the parameters (also called weights/features) parameterizing the space of linear functions mapping from X to Y. The hypothesis above takes three features into consideration. To simplify our notation, we also introduce the convention of letting x0 = 1 (this is the intercept term), so that the hypothesis can be written as:

$h_\theta(x) = \sum_{i=0}^n \theta_ix_1 = \theta^Tx$                                  ..(2)

Above, equation (1) is condensed into a shorter form. The right-hand part of eq.1 can easily be written as a summation of θ and x. Now you might wonder how the summation gets converted to matrices in equation 2. Here it is: if we put all the values of θ and x in two individual vectors (vectors are 1-dimensional matrices of order 1*n or n*1) and use transpose (T in eq.2 stands for transpose) on one of them, the product of the two will be equal to right-hand side of eq.1(If you have doubts, pick up a pencil and a paper and see for yourself).

Now, our task is to, given the training set, learn the parameters θ that contribute to the size of ship. On proper thought (go on; use those brains) this can be achieved if the hypothesis computed by out algorithm are close to y. So, we need to calculate

For each set of features, how close is the hypothesis to y. Hence, we define, another equation (don’t worry, it’s the last one for today), the Cost Function:

$j(\theta)=\frac{1}{2}\sum_{i=0}^n(h_\theta(x^{(i)})-y^{(i)})$                      ..(3)

To those of you wondering what gibberish this is, this equation is called the Ordinary Least-Square cost function and to the brainiacs out there who are laying waste on their scalp over why not simply use the difference between h and y, the squared function is used so that the difference is always positive. And no, I didn’t just put it here to confuse you. It is all through a very natural and intuitive process that this equation comes into play (the grand design is at work). We can discuss more on this later.

Let’s move on! The boss fight where we get to know how this algorithm improves its outputs.

We need to find θ that minimizes the cost function (the lower the difference between h and y, the better the prediction). For this, we need to consider the Gradient Descent algorithm which is just a simple update for θ :

$\theta_k=\theta_k-\alpha\frac{\partial}{\partial \theta}j(\theta)$            ..(4)

This equation optimizes θ for all the values from (1, 2….k). k is the order of the feature vector θ and  signifies learning rate (another one of those algorithm you don’t need to care about.) The algorithm is very natural in the sense that it repeatedly takes its steps in the direction of steepest descent, i.e., the direction in which θ the most unless the value for  is so large that values overshoot and the function doesn’t find its minima or it is so small that it take very long to converge. You must avoid both of these conditions.

Upon solving the partial differentiation part in eq. 4, we get  which when plugged into eq.4 gives up our final update rule:

Repeat until convergence: {

$\theta_k=\theta_k+\alpha\sum_{i=0}^n(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

(for every j)        }

The process is repeated n times for every j. For every value of j, the algorithm looks though all of the training data and then makes an update. This process is called the Batch Gradient Descent.

Another instance of this algorithm that works as follows:

for i=1 to n:

$\theta_k=\theta_k+\alpha(y^{(i)}-h_\theta(x^{(i)}))x_j^{(i)}$

(for every j)

It is called the Stochastic Gradient Descent. This process, too, works fine and if you have large data set to work on, this would be a better choice over Batch Gradient Descent due to its less convergence time.

After all this hard work, what is the outcome:

We find the values for θ0 = 90.30, θ1 = 0.1592, θ2 = −6.738 and the plot comes out as:

Finally, Darth Vader has a system that can tell him which ship size to choose to defeat Luke!

[iii]

Applications:

The applications of Curve-fitting are ubiquitous. From optimizing biological parameters to astronomy and from the stock market analysis to weather data analysis, Linear Regression is widely to estimate results.