Notes are available at course wiki

https://share.coursera.org/wiki/index.php/ML:Main

Supplement to these notes are is the following page going into more details of deep learning and belief networks

http://ufldl.stanford.edu/wiki/index.php/UFLDL_Tutorial

Here I would make notes on the coursera course on Machine Learning

Week 1 Lecture 1

Supervised Learning

Lets say you want to predict housing prices. You have a Price v/s square feet X-Y plot. So a learning algorithm can be a straight line fit to the data. Anot+her one can be fitting a quadratic function and you get a better price. In further lecturers we’ll study how to choose a better algorithm. This is an example of supervised learning. You had a data and you had to predict the price of a value in between the data. Regression was the first example.

Lets say we want to predict a breast cancer tumor. On x axis we have tumor size and on y axis we have malignant (0 and 1). Lets say we have a person whose tumor size is somewhere in between the discrete values and we are supposed to calculate the probability of how much malignant her cancer is. So this may come under something called classification problems. 0-benign, 1-mild, 2- , 3-, 4-, 5-malignant.

Another method can be plot on a single line.

In the above problem we had only one parameter, but we may have more than one like Age, Tumor Size and malignant or benign. So to such a data the learning algorithm will lets say fit a straight line separating the malignant and benign tumor, so that you may be able to tell whether the tumor is malignant or benign.

So other features may be

Clump Thickness

Uniformity of cell shape

Uniformity of cell size

So we want an infinite number features so that an algorithm can perform best. To do that we have a mathematical trick called support vector machines.

Lecture 2 – Unsupervised Learning

In supervised learning for each example we had labels what is right or wrong.

In unsupervised learning we give the algorithm a data and ask it what to do about a data. Lets say we have an X-Y plot so a classification algorithm may give us the two types of categories to which the data belongs.

One example is clustering by Google news, genomic classification

So this is unsupervised learning we dont tell that these are people of type one, we just tell that this is data and please cluster it. It is also used to cluster large computer data. This is also used in social network analyses. Companies use this to do market segmentation. This is unsupervised learning as we dont tell that this manuy number of people are out there. All these are examples of clustering.

Another could be

Cocktail Party problem- Lets say we have 2 speakers and each has their own microphones. So each microphone records their voice and noise which is common. We can take these two recordings and give it to an unsupervised learning algorithm. What the cocktail party algorithm will do is separate out the different types. For eg english and japanese can be separated. So if we have music and a person talking the algorithm may separate it out the both. It may seem that this requires a task if done otherwise, but using machine learning this can be done using a single line of code

[W,s,v]=svd((repmat(sum(x.*x,1),size(x,1),1).*x)*x’);

Its not as if this is easy but this can be done in octave easily.

svd- singular value decomposition

In C++ or java this would be a big code to implement.

Lecture 3 First Model

Lets say we have a data set of a size v/s price data and you have to tell that how much it costs. Lets say we do it by linear regression, this is a supervised learning problem. Other type of supervised learning algorithm is the classification algorithm. We have a data set which we call a data set.

Symbols

m- no of training examples

x- input variables / features

y- output variables / features

x,y represents a single training example. (x(i),y(i)) is the ith training example.

We feed the learning set to learning algorithm and that learning algorithm gives a prediction function called h(x) or hypothesis. So it maps from sizes of houses to the predictions.

How do we represent the hypothesis h ?

h(x) = a + bx

What this does is, it fits the linear equation a + bx to the data. This model is called linear regression. Another name is uni-variate linear regression.

Lecture 4 – Cost Function

This will tell how to best fit a straight line to our data

h(x) = a + bx

a,b are the parameters of the model

With different values of a and b we can give different values of hypothesis function.

Cost function is J(a,b) = (1/2m) x Sigma(h(x)-y)^2

Square error cost function we need to minimize. If it seems to be an abstract thing plot as a 3d figure with J on y and a,b on x and z.

Lecture 5 Linear Regression – Supervised Learning

Understanding the cost function better

When we have a parameter cost function looks like a bell shaped curve and when we have 2 paramters we have the same plot in 3d. We are not going to use the 3d plots rather we’ll use contour plots.

We ll need to automate the process to automatically minimizing value of cost function J. And when number of parameters increase can’t plot it as well.

Lecture 6 – Gradient Descent

Let’s assume we have some function J(a,b) and we want to minimize cost function. Idea for gr descent- we ll start with guesses for a and b. Lets say a=0 and b=1, we ll keep changing their values until we get a minimum value of the cost function.

Lecture 7- Gradient Descent Intuition

What the algorithm is doing and how it works ?

Week 2 – LEc 1 – Multivariate Linear Regression

previously y = ax + b

Now we may have

y = a1 + a2.x1 + a3.x2 + … +an.xn

Convolutional Neural Networks CS231n Stanford University

Deep Learning for Natural Language Processing