Handwritten Text Detection using Open CV and CNN

The main aim of the project is Handwritten Text Recognition (HTR). HTR is the task of transcribing images of handwritten text into the digital text. In HTR, the text is written, captured by a scanner and then the resulting images are processed as input to return its text format. We need openCV and CNN for achieving this task. Our goal is to design a model that transcribes the images to text with great accuracy.


I. INTRODUCTION
Now a day's people have been using ebooks which do not occupy any space and ample copies can be carried comfortably. So there is a need for making more ebooks available. We have more number of handwritten texts available all over the world which need protection be safegaurded. By transcribing them we can increase the availability of ebooks. And also instead of striving hard to protect old texts which are hand written, they can be digitilized and stored as soft copies with ease.
We will apply the machine learning techniques in order to find the digital form of handwritten text from their scanned images. We will take the help of the readymade datasets that contain pixel values of scanned images as the inputs and we will be able to find the text in it. We can also extend this project for different languages and writing styles. Problem Statement: To accurately predict the text from a scanned handwritten text image using Machine Learning algorithms.
For this we need to assume that all the images that contain same letters have same features and we can conclude that an image having those features contains that alphabet. However this hypothesis is ideal and may not come true always in practical.

II. DATASETS
The data plays a very important role in machine learning. The past data is used to predict the future outcome. The relevant data can be downloaded from the internet.
The data that is related to our project that is HTR consists of pixel values. The format of the data files is csv (Comma Separated Values). Each row represents an image and contains a lable in the first column and followed by 784 pixel values for 28 X 28 images.
The data related to our project will have thousands of instances. The data that we use for our project is obtained from kaggle. Two types of data will be taken ✓ One for English alphabets A_Z DATA VISUALIZATION For dataset_1 the shape will be (372450, 785) and for dataset_2 the shape will be (42000, 785).

A_Z Handwritten_Data:
The dataset will comprise of multivariate data of English alphabets. Here the dataset will have a label and pixel values which lie from 0 to 255. There will be total of 372450 instances, and the total number of attributes is 784 and a label.

0_9 Handwritten_Data:
The dataset will comprise of multivariate data of digits from 0 to 9. Here the dataset will have a label and pixel values which lie from 0 to 255. There will be total of 42000 instances, and the total number of attributes is 784 and a label.  ARCHITECTURE The problem convert a handwritten text which is in the form of pixels into its digital for is a data driven approach. The data which is already collected can be used for extracting the features of each letter. The availability of more powerful machine learning algorithms introduces an efficient and better approach to solve this problem. The project is divided into two modules.
A Segmentation module in which an image is taken as input, letters are detected, bounded, cropped, resized and then segmented and a Training module where prediction occurs. The output of the segmenation module is the input of the training module.

A. Segmetation Module
Segmentation model is very important for this project as its output will be the input of the other module.

1)
Read the image We have many ML libraries like Pillow, openCV etc. for performing operations on images. Here openCV is used to read and manipulate images. An image is read and then stored in multiple copies for performing different operations. After reading the image is plotted in its shape to make sure it is read perfectly. That image contains letters that need to be images each cropped into 28 X 28 images by the end of segmentation model.

2)
Detecting the letters Object detection is a Computer vision technique that detectects certain components from in an image or a video. It makes use of Machine Learning and Deep Learning Algorithms to yield good results. Detecting the letters is same as detecting objects. We need to apply some standard filters to the input image for achieving this task.
Step 1: Convert a BGR image to Greyscale image. An image with 3 channels is a BGR image but a Grayscale image consists of a single channel. A channel is a thid dimension of an image. Step 2: Applying Gaussian blur to the Greyscale image. This step is done to remove any noise and disturbances in the image. If the image is blurred the colour intensites can be recognised easily from the image. The blurring is technically called Gaussian Blur in the Computer Vision.
Step 3: Otsu Thresholding → A standard stepfor object detection It is calcluation of the measure of spread for the pixel levels each side of the threshold.  Step 5: Storing co-ordinates for rectangular bounding Boundingrect() gets the list of x, y co-ordinates of top left point of the image, width and height allowing us to drawimages in the order of detected objects. We need to sort in the order of x co-ordinate of the top left corner to order them. All these lists are stored in a list and sorted with a base of list[i][0] th elements.

3) Bounding and Cropping
This part of code is added before cropping in order to add spaces between letters. In the list that contains bounding details, we add a space string " "if the distance between the corresponding x co-ordinates is > 50 pixel values representing they must be separated by a space.
And then Bounding and Cropping for non-spaces and storing them in a list 'img_lst'.
The elements in the list of detected objects may be the boundingrect values of a detected object from the image or a space string. The detected object is considered a letter only if its height is greater than 20 pixels. (Our assumption) Then for those which are considered letters we use boundingrect values to crop the letter from another copy of the original image stored in another variable and append each of the cropped images which are in numpy array form into a python list.
'rectangle()' draws a rectangular border around detected letters which has the image, bounding values, colour and width of the border.

4) Resizing
For each cropped image in the list, we resize them to 28 X 28 pixels. We do so because the output images from this module which are going to be the input of the other module must be images of size 28 X 28 pixels as the training data of that module is of that format. Each resized images must be converted to Grayscale to match requirements of training model. And then sent as input to the training model.
The subplot of the output of segmentation module looks like

B. Training Module
A model is trained by using past data and Machine learning Algorithms. It learns from the past data by feature extraction and patterns. In this project Convolutional Neural Networks are used. We split the data into training and testing in the ratio of 80:20.
The dataframe with attributes, the dataframe with only label column, train_size or test_size and shuffeled are important parametere of trin_test_split. Shuffeled will be 'True' by default which shuffeles the data before spliting. This method need not always return same output.
Scikit-learn library is used to change the form of data. We need to convert the attribute values to the float datatype and their labels into the categorial form to train them.
The attribute values are converted to floating point numbers ranging from 0 to 1 which earlier were 0 to 255. A pixel value of 0 will be 0.0000, a pixel value of 1 will be 1.0000, and a pixel value of 128 will be 0.5000.
Categirial form results in a list of size equals number of all possible labels in which an instance with lable value i will have i = 1 and other elements 0. Even these values must be in float.
The categorial form will look like this.
Now we can proceed to train our data by using standard Machine Learning Algorithms. A sequence of hidden layers are created with some nodes in each of them. The first hidden layer is 2Dimensional Convolution layer with kernal size 5 X 5 and 32 nodes. The activation function used is 'relu'. And then max pooled with 2 X 2 and over fitting is reduced using dropout of 0.3. Then two more layers are added with 128 and n nodes respectively, where n is the number of possible outputs (here n = 36). Now we compile the model by 'categorial_crossentropy' loss function, 'adam' optimizer with a metric of 'accuracy'.

VIII. CONCLUSION
Convolutional Neural Network learns from the real time data and simplifies model by reducing the number of parameters and hence gives considerable accuracy.

Future Enhancements
We can increase the accuracy: → By taking huge datasets → By adopting much suitable algorithms → We can compile the model at more number of epochs. → Hyper-parameter Tuning (There are a lot of parameters that we can play with). → Use of deeper architectures This application can be taken to next level by → Extending its scope to different writing styles → Extending its scope to different writing styles