In this post, I am going to give a quick tutorial on how to use knn in sklearn to evaluate a car class.

The dataset is from the UCI Machine learning repository https://archive.ics.uci.edu/ml/datasets/car+evaluation. You can download the dataset from this location from the data folder link to follow the tutorial. Also read more about the dataset and understand the description of the dataset. The dataset looks like below in the file

I have downloaded the dataset locally and stored as car.data so lets go ahead and start the code. Before you can run the code make sure you have the sklearn, pandas and numpy installed.

 

import sklearn
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import neighbors, metrics
from sklearn.preprocessing import LabelEncoder
car = pd.read_csv(‘car.data’,sep=’,’)
car.columns = [‘buying’,’maint’,’doors’,’persons’,’lug_boot’,’safety’,’class’]
print(car.head())

In the above code, I imported the sklearn library, pandas, numpy, train_test_split, neigbors, metrics and labelEncoder. All these modules will be explained as they are used in our code. I used pandas to read the file car.data and specified the data is coma separated. We see that the file does not have column headings so I added the columns headings to the data frame using car.columns and assign the column names. You can find the column names from the UCI website what each field in the data means. I then printed the head of the car data frame and we can see the columns in the data. The class of the car is what we are evaluating so we can be able to predict which class the car belongs given the features. There are 4 classes of unacc, acc, good, vgood. The classes means unacceptable, acceptable, good and very good. So based on the features in the dataset of buying, maintance, doors, persons, lug_boot and safety we can be able to predict which class the car belongs to. We are going to use the KNN algorithm which assumes that similar things stay closer to each other. Its like the saying birds of the same feather flocks together. KNN is a supervised machine learning algorithm which means it needs a labeled data to be train on and when given a new dataset it will be able to predict which label to assign the instances of the dataset. You can read more on the details of the algorithm itself but the purpose here is how to use it to predict.

For this tutorial, l will use 3 features buying, maint, safety as the predictors and ignore the rest of the features. so here we go

X = car[[
    ‘buying’,
    ‘maint’,
    ‘safety’
]].values
y = car[[‘class’]]

print(X)

print(y)

In the code above, l choose the selected features and put them into a numpy array X and put the labels into the y data frame. You can see that we have character or categorical values and not numerical values in X and y so we need to convert this data into a numerical representation using the LabelEncoder in sklearn so we can use it in our algorithm.

Le = LabelEncoder()
for i in range(len(X[0])):
    X[:, i] = Le.fit_transform(X[:, i])
label_mapping = {
    ‘unacc’:0,
    ‘acc’:1,
    ‘good’:2,
    ‘vgood’:3
}
y[‘class’] = y[‘class’].map(label_mapping)
y = np.array(y)
print(X)
print(y)

In the above code, l created an instance of the LabelEncoder and call it Le then I went into a loop over the X array and converted each categorical value in that data into numerical values. For the y data frame, l created a manual label that maps each of the categorical values into a number and then l updated the y data frame with the label mapping. I then converted y data frame into a numpy array.

Now that we have the data processing complete we can now split the dataset into train and test datasets and train the model on the training set and test it on the test dataset.

knn = neighbors.KNeighborsClassifier(n_neighbors=25, weights=’uniform’)
X_train , X_test, y_train,y_test = train_test_split(X,y, test_size=0.2)
knn.fit(X_train,y_train)
prediction = knn.predict(X_test)
accuracy = metrics.accuracy_score(y_test,prediction)

In the above l created a knn instance from KNeighborsClassifier and use k = 25 and with uniform weight to each of the data points. I then split the dataset and train the model on the training set. After the training l used the model to predict the labels for the test dataset and then l found the accuracy of the prediction to be 72%. This means the class for the test data has been predicted correctly up to 72%.
Lets do some extra work to create a data frame just look at the test data original labels and what was predicted by the model.
test = pd.DataFrame(y_test,columns=[‘original’])
pred = pd.DataFrame(prediction,columns=[‘pred’])
df = pd.merge(test,pred,left_index=True,right_index=True)
df[‘status’] = np.where(df[‘original’]==df[‘pred’],’passed’,’failed’)
print(df.groupby(‘status’).count())
print(df.head())

So l created a test data frame from the y_test data set and a pred data frame from the prediction array. I then merge the two data frames into the df data frame based on just their indices. I then added a status column to the df to indicate passed or failed based on if the original class in the test data equals the class predicted. Next l did a count of failed and pass from the data frame using the group by status. We see 250 correct predictions and 96 failed predictions.
We can also take just one instance of data in our dataset and predict the class.
idx = 20
print(“actual value:”,y[idx])
print(‘predicted:’, knn.predict(X)[idx])

In the above code, l choose the 21st data instance in the dataset and predicted the class and as can be seen the actual class was 0 and it was correctly predicted as 0. 0 corresponds to unacceptable in our label mapping.
That is the end of the tutorial for this topic. Thank you.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

Name *