The first steps with Machine learning

Navneet Suman
learning-ai
Published in
5 min readJun 17, 2016

--

With a few lines of code…

Machine learning is about learning to do better in the future based on what was experienced in the past.

The learning that is being done is always based on some sort of observations or data, such as examples, direct experience, or instruction. For instance, you might wish to predict how much a user Bob will like a movie that he hasn’t seen, based on her ratings of movies that he has seen. This means making informed guesses about some unobserved property of some object, based on observed properties of that object.

The act of teaching a program to react to or recognize patterns.

Supervised learning is a type of machine learning algorithm that uses a known dataset (called the training dataset) to make predictions. The training dataset includes input data and response values. It begins with the examples of the problem you want to solve.

Tools I will use

Python

Enough of the writing already. Let’s jump in with a problem . Consider the sample data given below.

Training data

For starters we will take the descriptions of the fruits as input and try to predict whether they are apple or oranges based on the features like its weight and texture.

A good feature makes it easy to discriminate between different types of fruits.

Each row in our training data describes one piece of fruit. The last row is the label which identifies which type of fruit it is. Here we have just two type of fruit i.e. apple and orange. The whole table is our training data. The more the training data you have the better the classifier you can create.

features = [[140,'smooth'], [130,'smooth'], [150,'bumpy'], [170, 'bumpy']]
labels = ['apple','apple','orange','orange']

The features contains the first two columns and the labels contains the last.We can think of features as the input we provide and the labels as the output we desire.

Now for calculation we are going to represent the features and labels in the int instead of a string.

features = [[140,1], [130,1], [150,0], [170, 0]]
labels = [0,0,1,1]

Decision Trees (DTs) are a non-parametric supervised learning method. The goal is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features.We are going to start with Decision tree classifier. I will discuss it in details in the future posts. For now we can consider the classifier as the box of rules because there so many types of classifier but the input and output type is always the same.

Now we can build the classifier.

from sklearn import tree
features = [[140,1], [130,1], [150,0], [170, 0]]
labels = [0,0,1,1]
clf = tree.DecisionTreeClassifier()

At this point it is just the empty box of rules. It does not know anything about the apples and oranges yet. To train it we need a learning algorithm. If the classifier is the box rules you can think of the learning algorithm as the process that creates them. It does by finding patterns in your learning data. For instance It may observe that apples are always smooth in texture. Or, it may observe that orange weighs more than apple. so the heavier the fruit is the more likely it is to be an orange.

In scikit the learning algorithm is included in the classifier object and it is called Fit.

from sklearn import tree
features = [[140,1], [130,1], [150,0], [170, 0]]
labels = [0,0,1,1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)

At this point we have a trained classifier and we are ready to test it now. Suppose we want to test a fruit which weighs 160g and is bumpy.

from sklearn import tree
features = [[140,1], [130,1], [150,0], [170, 0]]
labels = [0,0,1,1]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
print clf.predict([[160,0]])

The output will be ‘0’ if it’s apple or ‘1’ if it’s a orange.

Looking at the dataset we can say that the fruit is orange as it’s bumpy and also weighs more relatively. If we run the program we will find out it gives the same result. Thats it. We have taken our first step in the machine learning world.

Lets see how our program has predicted the result. I am gonna use some graph libraries to generate the decision tree.

from sklearn import treefeatures = [[140,1], [130,1], [150,0], [170, 0]]
labels = [0,0,1,1]
target_names = [‘Apple’, ‘orange’]
features_name = [‘weight’, ‘texture’]
clf = tree.DecisionTreeClassifier()
clf = clf.fit(features, labels)
print clf.predict([[160,0]])
# Generating visualizationfrom sklearn.externals.six import StringIO
import pydot
dot_data = StringIO()
tree.export_graphviz(clf, out_file=dot_data,
feature_names=features_name,
class_names=target_names,
filled=True, rounded=True,
impurity=False)
graph = pydot.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf(“fruit.pdf”)

So you can see the rules stated in our classifier on the basis of which it is predicting the class of the sample data. Our classifier assumes the fruit to be orange if it weight greater than 145g.

The decision tree can become complex based many factors like features and their values.

So finally we have taken the first step in the machine learning world. We saw how we can classify a simple data set using the decision tree and then use the classifier to classify the values. You can use the above example and use the concepts to try to classify the data set given below.

Sample training data

In the next post i will try to discuss about the decision tree and then classify the IRIS dataset and try to generate the decision tree and analyse it.

If you enjoyed reading, please support my work by hitting that little green heart!

Twitter : Find me out here.

--

--