Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour

Chhaya Patel; Apurva Desai

doi:10.17577/IJERTV2IS60509

Volume 02, Issue 06 (June 2013)

Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour

DOI : 10.17577/IJERTV2IS60509

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 393
Total Downloads : 891
Authors : Chhaya Patel, Apurva Desai
Paper ID : IJERTV2IS60509
Volume & Issue : Volume 02, Issue 06 (June 2013)
Published (First Online): 24-06-2013
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour

Chhaya Patel MCA Department

Anand Institute of Information Science , Anand, India

Apurva Desai Department of Computer Science,

Veer Narmad South Gujarat University, Surat, India

Abstract

Gujarati is a language used by more than 50 million people worldwide. Due to dissemination of ICT in India need for Optical Character Recognition (OCR) activities for Indian script is in demand. One can obtain very less OCR related research work for Gujarati script, especially for handwritten form. This paper describes a hybrid approach based on tree classifier and k-Nearest Neighbor (k-NN) for recognition of handwritten Gujarati characters. Combination of structural features and statistical features is used for classification and identification of characters. The features are relatively simple to derive. The structural features are selected by studying the appearance of various handwritten characters. The moment based and centroid based features are first time combined for character recognition of Gujarati script. A success rate of 63% is achieved using proposed method, which is acceptable, as it is one of the few attempts to recognize whole character set of Gujarati handwritten characters.

Keywords- Feature representation, k-Nearest Neighbor, Moments, Optical Character Recognition (OCR), Tree- classifier

Introduction.

Gujarati script is derived from the popular Devanagri script. The Gujarati language is popularly used by more than 50 million peoples mainly by Gujarati people in the state of Gujarat from India and worldwide as Gujarati people are domicile of many countries. It is found that the work related to Optical Character Recognition (OCR) for Gujarati script is very limited. One can find very few attempts addressing either one or two stages of the OCR phases [10] or OCR for limited characters [9,11] of Gujarati script. It is also observed that majority of work for this language is for printed form rather

than handwritten form. A rich cultural heritage is available in handwritten form for this script. Being official language of state of Gujarat major correspondence within various Government departments and other institutes is carried out using Gujarati, either in handwritten or printed form. Many OCR solutions are available for the other languages of Indian origin like Bangala, Devnagri, Gurumukhi but OCR solution for Gujarati handwritten form is not available. The optical character recognition of Gujarati script will definitely be helpful for developing a full-fledged OCR system for Gujarati.

This paper describes an approach to identify numerals and characters of Gujarati script. The suggested approach is based on the structural and statistical features. The structural features are derived based on characteristics of the characters. These features are used to generated a tree classifier that classifies the whole set of characters into subsets. Additional structural features and statistical features are used with k-NN to recognize individual character from each subset at later stage. An overall recognition accuracy of 63.1% is achieved for writer independent data set of characters. The data set is generated by collecting samples from more than 200 different writers of different age group and gender.
Characteristics of Gujarati script and challenges for Gujarati OCR

The lack of OCR activities for Gujarati characters may be due to many reasons. The Gujarati script has wide range of characters, which includes – 35 consonants, 13 vowels and 6 signs, 13 dependent vowel signs, 4 additional vowels for Sanskrit, 9 digits and 1 currency sign, which is almost double sized compared to number of alphabets of English language. Fig. 1 shows most frequently used consonants, vowels and numerals.

Fig. 1 – The most frequently used consonants, vowels and numerals of Gujarati Script

One can observe that there are many characters that look quite similar to each other. Some of the alphabets resemble the numerical digits, this may create confusion during

identification, e.g. the numeral and the alphabet , the numeral and the alphabet , the numeral

and alphabet , the numeral and alphabets and .

Also the numeral can easily cause confusion with part of alphabet .

Some characters are quite confusing many times especially in presence of noise that may lead to classification error, even a human may need help of context knowledge to correctly identify them. The characters forms one such group of confusing characters. Some other such pairs are and (for handwritten character), and , and .

The numeral and the characters are formed using more than one object or parts. Such characters can be easily misinterpreted as sub-part of some characters or conjunct characters that are formed by combining more than one basic character. These characteristics of Gujarati script can be considered as some of the reasons for slow progress of Gujarati OCR activities.
OCR activities for handwritten documents in Indian Languages

The publications related to Indian script OCR are less compared to other foreign languages like English, Japanese, and Chinese etc. Indian scripts received attention little later than the other popular languages. Irrespective of any language of Indian origin very few publications are found related to handwritten OCR. It is found that except Hindi, Bangla and few south Indian languages the OCR activities related to handwritten form are negligible.

An off-line recognition of pre-segmented Malayalam handwritten characters based on Kolmogrov-Sminov statistical classifier and k-NN classifier is described in [1]. To identify off-line Devnagri handwritten characters features based on the directional chain code information of the contour points of the characters are suggested in [2] an accuracy of 98.86% and 80.36% is reported for Devnagari numerals and characters respectively. Fuzzy model based recognition of handwritten Hindi numerals is proposed in [3]. The recognition is based on the modified exponential membership function fitted to the fuzzy sets derived from features consisting of normalized distances.

A zone and distance metric based feature extraction system is described in [4] for classification and recognition of Kannada and Telugu script numerals using a centroid based approach. A recognition rate of 98% for Kannada and 86% for Telugu numerals is achieved. Work on recognition of isolated Bangla alphanumeric handwritten characters using a two stage feed forward neural network, trained by back-propagation algorithm is used for recognition in [5]. Another neural network based approach for the recognition of Bangla handwritten numerals is described in [6]. An attempt to recognize Bangla characters is reported in [7].

A system for off-line unconstrained Oriya handwritten numerals is presented in [8]. Histograms of direction chain code of the contour points of the numerals are used as features. A neural network based classifier has been developed supporting an accuracy of 94.81%.

An OCR system for handwritten Gujarati numerals is discussed in [9]. Here in this work a neural network is proposed for identification of Gujarati handwritten digits. A multi layered feed forward neural network is suggested for classification of digits. The features of Gujarati digits are abstracted by four different profiles of digits. Thinning and skew-correction are also done for preprocessing of handwritten numerals before their classification. This work has achieved approximately 82% of success rate for Gujarati handwritten digit identification. Another approach based on hybrid feature extraction technique by same author is suggested in [11]. The structural and statistical features are used for identification of the numerals. An overall accuracy of 96.99% for handwritten Gujarati Numerals is achieved using k-NN as a classifier.
Data set generation for Gujarati handwritten characters

There does not exist ready to use data set for Gujarati OCR as it is one of the first few attempts to recognize handwritten Gujarati characters. The author has collected samples of all characters of Gujarati script from more than 200 different

writers of different age group and gender to form the data set. To make the data set stable and usable some of the samples were collected from writers who did not know Gujarati script. Each collected data form was scanned at resolution of 200 to

300 dpi using a flatbed scanner. Collection of individual character was generated by separating them from the input form image. These separated characters were preprocessed at next stage.
Preprocessing

The individual handwritten character will be different from the other in a same character group, as they are written by different people using different pens having different ink and different tip size. Following Fig. 2 shows scanned images of the Gujarati numeral 6. One can observe the variations in thickness and size of each sample. Presence of slant due to writing style is frequently found in handwritten form one can notice it from Fig.2. As handwriting of any two persons are hardly same one have totally different image for individual character in data set.

Fig. 2 Sample images for Gujarati numeral 6

Another such case is shown in Fig. 3 for a Gujarati character ka written by different writers. One can observe more variations in size, thickness, slant and formation of the character. It is also noticeable that the letter ka and the numeral 6, as shown in Fig. 2 have sufficient similarities to cause confusion while recognizing.

Fig. 3 Sample images for Gujarati character ka

A preprocessing is required to convert an input character image into binary form that is noise free, smooth, and thin. Such images preserve the shape information with minimum storage requirements and help to improve accuracy of classification and recognition phase.

During preprocessing phase, first the image contrast is adjusted in order to remove effect of different colored ink and effect of thin pointed tip causing sometimes light appearance of a character. The image intensities are adjusted based on the adaptive histogram equalization algorithm. The noise introduced during digitization is removed by using two- dimensional adaptive Wiener filter. The Wiener filter is a low- pass filter that uses a pixel wise adaptive Wiener method based on statistics estimated from a local neighborhood of each pixel. The Wiener filter estimates the local mean and variance around each pixel. In our case for each pixel each

3X3 neighborhood pixels are used for filtering. The image is then converted into binary form by using threshold value.

An Ostus method is used to determine the global image threshold. This global threshold (level) is used to convert an intensity image to a binary image that can be used for feature extractions and recognition. This binary image is cropped to remove unwanted pixels present surrounding the character image. Fig. 4 shows original input character image, binary form of the image and the cropped image.

Fig. 4 The input character and the cropped binary image for the character

All the images of the characters need to be converted into a standard size for further processing. Normalization is carried out for converting each character image into size of 40 X 40 pixel image.
Feature extraction

The features are divided in two categories in this case – primary features and secondary features. The primary features are selected based on the characteristics of Gujarati script. These features are reasonably invariant with respect to shape variations caused by various writing styles, easy to implement, fast to generate, size independent and easy to derive. These features are structural features and are used primarily to divide the set of basic characters into smaller manageable subsets using a binary tree classifier and secondarily to recognize a character at later stages.

The binary tree classifiers are one of the popular classifiers. They were introduced nearly twenty five years ago and are being used by many researchers for classification of characters. Many references are available where tree classifiers have been used for OCR activities related to Indian origin scripts. A Gurumukhi script recognition is described in
[18] and [22]. Tree classifier is used for printed Oriya script in [19]. A Bangla and Devnagri classifiers are described in [20] and [21]. The features used for recognition of Gujarati characters are described below.
Keeping in view the advantages and disadvantages of binary

p q 1 2

p q 2

p

p

classifier a combination of binary classifier and k-Nearest Neighbor has been suggested. Once the character is classified to fall into a particular subset, some additional features / secondary features are derived, based on which final identification is carried out. The secondary features used in this study are described below.

Region moment representations interpret a normalized gray-level image function as a probability density of a 2D random variable. Assuming that non-zero pixel values represent regions, moments can be used for binary or gray- level transformations. For a digital image the central moments can be expressed as:

i) The averaging feature: This feature is derived by application of averaging principal on the binary image. The

pq x y

(xx) (y y

q

) f (x, y)

original image of size 40 X 40 is subdivided into 4X4 blocks and average ON pixel values in each block is used to form a feature. The image blocks are extracted as shown in Fig. 6 , in row major order.

where x, y are the co-ordinates of the regions center of gravity (centroid). These can be obtained using the following equations:

x m10

m00

and

y m01

m00

The central moments of up to order 3 can be obtained from the above equation by choosing p, q = 0, 1, 2, 3, such that p + q 3.

The normalized central moments denoted by pq , are

pq

pq y

denoted by 00 where y = (p+ q)2 + 1 for p + q = 2, 3…. .

These secondary features are used to recognize an individual character using k-NN.
Character Recognition

To eliminate limitations of binary tree classifier especially for noisy characters, the tree based approach is used only to derive subsets of characters and not for recognition of character, thus reducing probability of incorrect recognition. A hybrid approach based on subdivision of character set into subsets

Rotation invariance can be achieved if the coordinate using tree classifier and recognition using k-NN is developed

system is chosen such that

11 0 . Seven rotation,

for Gujarati handwritten OCR.

translation, and scale invariant moment characteristics can be derived from the second and third moments. The moments values are calculated based on following set of equations.

M1 = 20 + 02

M2 = (20 – 02 )2 + 4 2

The feature vector that is used to recognize a character is a combination of various features discussed above. It is formed using 5 elements that are based on structural features namely objects in character, objects in left half , right half , lower half and upper half of the characters, 28 elements based on invariant moments, 25 elements based on average pixels and 64 elements

M3 = (

– 3

11

)2 + (3 – )2

are based on centroid distance. This feature vector is used for

30 12

2

21 03

character recognition, thus a feature string of total 122 elements

M4 = (30 + 12 ) + (21 + 03 )2

M5 = (30 – 312 )( 30 + 12) + ((30 + 12 )2 3(21 – 03 )2) + (321 – 03) (21 + 03) ((3(30 + 12 )2 (21 + 03 )2)

M6 = (20 – 02 )(( 30 + 12)2 – (21 + 03 )2)+ 4 11 (30 + 12 )

(21 + 03 )

M7 = (321 – 03 )( 30 + 12) ((30 + 12 )2 3(21 + 03 )2) + (312 – 30) (21 + 03) ((3(12 + 30 )2 (21 + 03 )2)

Here M1 and M2, are second-order moments as p + q = 2 for them. The remaining are third-order moments, since p

+ q = 3. The moment M7 is a skew invariant moment and is used to distinguish mirror images. The seven moments M1,M2.M7 values are used for creating a feature vector for a given character image. The character image of size 40 X 40 is segmented into four equal sub-images of size 20 X

20 each and for each sub-image moment features are calculated. This process produces total 7 X 4=28 different values for each character.

iii) The Centroid distance based features: The fourth feature is derived by using centroid distance function. The thinned, resized and cropped binary image of the character is segmented into 64 equal sub-images. For all ON pixels in each sub-image, the distance from the centroid of the original character image is computed. Using these distance values an average distance value for individual sub-image is computed, i.e. for each sub-image there will be one average distance value if at least one pixel is ON in that sub-image. Thus total 64 average values are computed per character image. This collection of 64 average values is used as one of the feature for character recognition. If some zones are empty there will be no ON pixels, for such zones value of feature vector will be zero.

is used to recognize an individual character.
The proposed method has provided an accuracy of 63.1% for given data set.
Results

The results are shown in Table -2. As it is the first attempt to identify the handwritten basic characters of the Gujarati script the accuracy is acceptable. Some of the reasons for less accuracy are, similarity of some characters like the digit 5- and the basic character pa- .The number 2- and the character ra- , the characters i- and i_bar- , characters

dha- and gha- , the number 4- and character ja- . These confusing characters cause reduction in recognition accuracy.

Table 2 Results of k-NN for handwritten Gujarati characters recognition
Conclusions

It is one of the few attempts to address the issue of Optical Character Recognition for Gujarati handwritten characters. The structural features selected for recognition are easy to obtain and are used for the first time for Gujarati script. The other statistical features are also applied first time for Gujarati script. While designing algorithms, factors like simplicity, ease of implementation and speed were considered to have a fast and accurate solution for an OCR problem. A hybrid approach using binary tree and k-NN is also suggested first time for Gujarati handwritten characters. An overall recognition accuracy of 63.1% for character recognition is obtained. The results are satisfactory, as it is just beginning of this untouched research area compared to results for other Indian scripts in printed and handwritten form. The work hopefully will be a stepping stone for future research work for Gujarati language or similar Indian languages.
References
1. V.L. Lajish, T.T.K. Suneesh and N.K. Narayanan,Recognition of Isolated Handwritten Character Images using Kolmogorov-smirnov ,Statistical Classifier and K-nearest Neighbour Classifier, Proc. of the International Conference on Cognition and Recognition, pp 526-531, 2005.
2. N. Sharma, U. Pal,F. Kimura, and S. Pal P. Kalra and S. Peleg (Eds.), Recognition of Off-Line Handwritten Devnagari Characters Using Quadratic Classifier, ICVGIP 2006, LNCS 4338, Springer-Verlag Berlin Heidelberg, pp. 805-816,2006 .
3. M. Hanmandlu and O.V. Ramana Murthy, Fuzzy Model Based Recognition of Handwritten Hindi Numerals, Science Direct, Pattern Recognition Volume 40, Issue 6, Proc. of the International Conference on Cognition and Recognition, pp. 1840-1854, June 2007.
4. Rajashekararadhya, S.V.; Ranjan, P.V., Neural network based handwritten numeral recognition of Kannada and Telugu scripts, TENCON 2008,IEEE Region 10 Conference , pp .1 5 Nov. 2008.
5. A. Dutta and S. Chaudhury, Bengali alpha-numeric character recognition using curvature features, Patter Recognition, Vol. 26(12), pp. 1757-1770, 1993.
6. U. Bhattacharya, T. K. Das, A. Datta, S. K. Parui and B. Chaudhuri, A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers, International Journal on Pattern Recognition and Artificial Intelligence, Vol. 16(7), pp. 845-864, 2002.
7. U. Pal and S. Datta, Segmentation of Bangla unconstrained handwritten text, Proceedings of 7th ICDAR, pp. 1128-1132, 2003.
8. K. Roy, T. Pal, U. Pal and F. Kimura, Oriya handwritten numeral recognition system, Proceedings of ICDAR, pp. 770- 774, 2005.
9. Apurva A. Desai, Gujarati handwritten numeral optical character reorganization through neural network, Pattern Recognition Volume 43, Issue 7, pp. 2582-2589, July 2010. [10]Jignesh Dholakia , Atul Negi, S. Rama Mohan, Zone Identification in the Printed Gujarati Text, Proceedings of Eight International Conference on Document Analysis and Recognition (ICDAR05), pp.272-276, 2005.

[11]Apurva A.Desai ,Handwritten Gujarati Numeral Optical Character Recognition using Hybrid Feature Extraction Technique, Proceedings of International Conference on Image processing, computer vision & pattern recognition,IPCV10. pp. 733-739,2010.

[12]B. V. Dasarathy, Nearest neighbor pattern classification techniques, IEEE Computer Society Press,New York, 1991. [13]Anilkumar N. Holambe,Dr.Ravinder.C.Thool, Printed and Handwritten Character &Number Recognition of Devanagari Script using SVM and KNN,International Journal of Recent Trends in Engineering and Technology,Vol.3, No.2, pp.163- 166,2010.

[14]Sanghamitra Mohanty, Himadri Nandini Das Bebartta, Performance Comparison of SVM and K-NN for Oriya Character Recognition, International Journal of Advanced Computer Science and Applications (IJACSA), Special Issue on Image Processing and Analysis, pp.-112-116, 2011. [15]B.V.Dhandra, Mallikarjun Hangarge, Gururaj Mukarambi, Spatial Features for Handwritten Kannada and English Character Recognition, International Journal of Computer

Applications,Special Issue on Recent Trends in Image Processing and Pattern Recognition, pp.-146-151, 2010. [16]Cheng-Lin Liu, Kazuki Nakashima, Hiroshi Sako, Hiromichi Fujisawa, Handwritten digit recognition: benchmarking of state-of-the-art techniques, Pattern Recognition Vol. 36, pp.2271 2285, 2003.

Dinesh Acharya U.,N. V. Subba Reddy, and Krishnamoorthi Makkithaya, Multilvel Classifiers in Recognition of Handwritten Kannada Numerals, World Academy of Science, Engineering and Technology Vol. 42, pp. 278-283,2008.
G. S. Lehal and Chandan Singh, A Gurmukhi Script Recognition System, Proceedings of 15th ICPR, Vol. 2, pp. 557-560,2000.
B. B. Chaudhuri and U. Pal and M Mitra, Automatic recognition of printed Oriya script, SÂ¯adhanÂ¯a Vol. 27, Part 1, pp. 2334, February 2002.
B. B. Chaudhuri and U. Pal, A Complete Printed Bangla OCR System , Pattern Recognition, Vol. 31, pp. 531-549, 1998.
B. B. Chaudhuri and U. Pal, An OCR System To Read Two Indian Language Scripts: Bangla And Devnagari (Hindi), Proceedings of Fourth International conferance on Document Analysis and Recognition, IEEE Computer Society Press, pp.1011-1016, 1997.
G S Lehal and Chandan Singh, A Complete OCR System For Gurmukhi Script, Proceedings of SSPR 2002, Lecture Notes in Computer Science, Vol. 2248, Springer- Verlag, Germany, pp. 344-352, 2002.
Hu, M. K., Visual Pattern Recognition by Moment Invariants , IRE Transaction on Information Theory, IT-8, pp. 179187, 1962.
Prokop, R. J. and Reeves, A. P., A Survey of Moment- Based Techniques for Unoccluded Object Representation and Recognition, CVGIP: Graphical Models and Image Processing, 54(5), pp. 438460, 1992.

Volume 02, Issue 06 (June 2013)

Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour

Gujarati Handwritten Character Recognition Using Hybrid Method Based On Binary Tree-Classifier And K-Nearest Neighbour

Recognition of Gujarati Characters using k-NN

Leave a Reply