Chinese Handwriting Recognition — Stephanie Xie

From CSclasswiki
Jump to: navigation, search


     There is a vast amount of information locked inside historical manuscripts, and while it is easy to search a typed document for specific phrases, the same cannot be said for handwritten documents: these are converted into scanned images, but creating full transcripts of these images is both time-consuming and costly. An approach called word spotting allows for the user to retrieve the desired information by identifying appearances of the query word within a collection. One of the underlying techniques in word spotting is part-structured inkball models; this is the basis of Professor Nicholas Howe’s current handwriting research.

     However, Prof. Howe's method, while great for Roman alphabet based languages, is not as efficient for Chinese characters simply because there are too many components to a character. Instead, I am using the response from character-radical matches as input to a machine learning algorithm such as boosting or support vector machines.

STRIDE 2015-2016

     Chinese characters are just different combinations of the basic components called radicals, so the first thing I did was find a complete list of Chinese radicals; that way, the program will only have fewer things to recognize. However, because Chinese writing has two forms, traditional and simplified, I downloaded the database of handwritten Chinese characters to be used (CASIA-HWDB) and verified that the characters were simplified. I then wrote out all 275 radicals and scanned them in so they can be used to build models.

We then decided to train a multiclass SVM classifier.

Functions Written

  • buildRadicals.m: This function builds part-structured inkball models for all 275 radicals.
@params foldername The name of the folder containing the written radical images
@returns Part-structured models of all the radicals
  • radicalSearch8.m: This function searches for all instances where each radical appears on the image, one at a time for all 275 radicals. Radicals with fewer strokes appear more, since they might actually make up part of another radical (e.g. 一).
@params foldername The name of the folder containing the training data
@params path The path to the specified folder
@returns Character-radical matches
Page of handwritten Chinese characters
Matches for radical 一
Matches for radical 廴
  • boundingBox.m: This function calculates the root point of character on the page through the character's bounding box. Both the bounding box and the character's label are part of the CASIA database. The character labels are in GB2312 format.
@params foldername The name of the folder containing the training data
@params path The path to the specified folder
@returns roots The root points of each character
@returns cvl The labels of all the characters
GB Code Table
  • CreateChinesePsmFeatureVectors.m: This function creates feature vectors for the training data using the radical match responses. The feature vectors are then converted into a matrix of predictor data. This function is also used to create feature vectors for the test data, which will be used after the model is built and ready to predict the test data.
@params foldername The name of the folder containing the training/test data
@params path The path to the specified folder
@returns fv Feature vectors
@returns fm Training matrix
@returns roots Root of each character
@returns cvl3 Character labels concatenated into 1 string
@returns C Dictionary of characters from cvl. Contains no repeats
@returns ic Index to character
To check for correctness: Choose a random ic row (e.g. row 15, #162), whose number will be the corresponding row in C to find the character label (e.g. row 162 -> 203181). Now go to cvl3 and look at the same row chosen for ic - it should contain the same label found in C (e.g. row 15 -> 203181).
  • Creating a multiclass model: model = fitcecoc(fm, cvl3);
  • Predicting the test data: Run CreateChinesePsmFeatureVectors.m on the test data, then use its (transposed) feature vectors to predict.
label = predict(model, fvTest{1}'); <- predicts the first page of the test data

SURF 2016

Background Information

  • Part-structured inkball models
    • Used for photographic object recognition
    • Detects parts (inkballs) arranged in approximate spatial configuration
    • Model = closely spaced inkballs forming a curve (tree structure)
  • Radicals - character components that indicate meaning or sound