In this post, I am going to give a quick tutorial on how to use knn in sklearn to evaluate a car class.
The dataset is from the UCI Machine learning repository https://archive.ics.uci.edu/ml/datasets/car+evaluation. You can download the dataset from this location from the data folder link to follow the tutorial. Also read more about the dataset and understand the description of the dataset. The dataset looks like below in the file
I have downloaded the dataset locally and stored as car.data so lets go ahead and start the code. Before you can run the code make sure you have the sklearn, pandas and numpy installed.
In the above code, I imported the sklearn library, pandas, numpy, train_test_split, neigbors, metrics and labelEncoder. All these modules will be explained as they are used in our code. I used pandas to read the file car.data and specified the data is coma separated. We see that the file does not have column headings so I added the columns headings to the data frame using car.columns and assign the column names. You can find the column names from the UCI website what each field in the data means. I then printed the head of the car data frame and we can see the columns in the data. The class of the car is what we are evaluating so we can be able to predict which class the car belongs given the features. There are 4 classes of unacc, acc, good, vgood. The classes means unacceptable, acceptable, good and very good. So based on the features in the dataset of buying, maintance, doors, persons, lug_boot and safety we can be able to predict which class the car belongs to. We are going to use the KNN algorithm which assumes that similar things stay closer to each other. Its like the saying birds of the same feather flocks together. KNN is a supervised machine learning algorithm which means it needs a labeled data to be train on and when given a new dataset it will be able to predict which label to assign the instances of the dataset. You can read more on the details of the algorithm itself but the purpose here is how to use it to predict.
For this tutorial, l will use 3 features buying, maint, safety as the predictors and ignore the rest of the features. so here we go
print(X)
In the code above, l choose the selected features and put them into a numpy array X and put the labels into the y data frame. You can see that we have character or categorical values and not numerical values in X and y so we need to convert this data into a numerical representation using the LabelEncoder in sklearn so we can use it in our algorithm.
In the above code, l created an instance of the LabelEncoder and call it Le then I went into a loop over the X array and converted each categorical value in that data into numerical values. For the y data frame, l created a manual label that maps each of the categorical values into a number and then l updated the y data frame with the label mapping. I then converted y data frame into a numpy array.
Now that we have the data processing complete we can now split the dataset into train and test datasets and train the model on the training set and test it on the test dataset.