Recognizing Handwritten Digits with scikit-learn
Recognizing handwritten text is a problem that can be traced back to the first automatic machines that needed to recognize individual characters in handwritten documents. To address this issue in Python, the scikit-learn library provides a good example to better understand this technique, the issues involved, and the possibility of making predictions.
Here we are predicting a numeric value, and then reading and interpreting an image that uses a handwritten font.
Scikit-Learn is a library for Python that contains numerous useful algorithms that can easily be implemented and altered for the purpose of classification and other machine learning tasks.
One of the most fascinating things about the Scikit-Learn library is that is has a 4-step modeling pattern that makes it easy to code a machine learning classifier:
1.Import the model.
In Scikit-Learn, all machine learning models are implemented as Python classes.
2. Make an instance of the Model.
3. Train the model and store the information learned from the data.
4. Predict the labels of new data
using the information the model learned during the training process.
Prerequisites
If you already have Jupyter notebook and all the necessary python libraries and packages installed you are ready to get started.
If not you can use Google colab too!
Before we move to the four-step first prepare the dataset and split to train and test using the below ways.
Loading the Dataset
The Scikit-learn library provides numerous datasets, among which we will be using a data set of images called Digits. This data set consists of 1,797 images that are 8x8 pixels in size. Each image is a handwritten digit in grayscale.
Here we load the dataset and print the loaded data
After loading the dataset, you can analyze the content. First, you can read lots of information about the datasets by calling the DESCR attribute.
Analyzing the content
For a textual description of the dataset, the authors who contributed to its creation and the references will appear as shown below.
Here I used the shape, size and type method to extract the data about the dataset.
Visualizing the images and labels in our Dataset
You can visually check the contents of the results using the matplotlib library.
Train-test split
Now split the dataset into train and test data. Here my train data size is 75% and the test data size is 25%
The Scikit-Learn 4-Step Modeling Pattern
Step 1. Importing the model we want to use.
An estimator that is useful in this case is sklearn.svm.SVC, which uses the technique of Support Vector Classification (SVC).
Import the svm module of the scikit-learn library.
Step 2. Making an instance of the Model
You can create an estimator of SVC type and then choose an initial setting, assigning the values C and gamma generic values. These values can then be adjusted in a different way during the course of the analysis.
Step 3. Training the Model
Now you can train the svc estimator that you defined. After a short time, the trained estimator will appear with text output.
Step 4. Predicting the labels of new data
Now you have to test your estimator, making it interpret the six digits of the validation set. If you compare them with the actual digits, will obtain these results:
You can see that the svc estimator has learned correctly. It is able to recognize the handwritten digits, interpreting correctly all six digits of the validation set.
Measuring the performance of our Model
To test the accuracy of our predictions we can use accuracy_score
Have to assign the y_pred value as y_pred = svc.predict(x_test)
In the above case we have got 99.33% accurate predictions, but this may not be the case at all times. Run for at-least 3 cases , each case for different range of training and validation sets and observe the variance.
Confusion matrix
A confusion matrix is a table that is often used to evaluate the accuracy of a classification model. We can use Seaborn or Matplotlib to plot the confusion matrix. We will be using Seaborn for our confusion matrix.
The above code displays the Confusion matrix as shown below,
Conclusion
From this article, we can see that the svc estimator has learned correctly and able to recognize the handwritten digits using scikit learn. Also found the accuracy of our prediction(which in our case is 99.33%). I hope this article helps you with your future endeavors!
Thank you for reading my article!
For the source code, click here.