CSC400: Chinese Handwriting Detection/Project Progress

From CSclasswiki
Jump to: navigation, search

Back to the main page.


The first month of work consisted mainly of background research to get a sense of what has been accomplished in the area of Chinese handwriting detection and recognition.

Week 00: 09.08-09.15

I chose the subject of the project and began reading articles describing related research while waiting for a workstation to be set up.

Week 01: 09.16-09.22

I continued the background research, still waiting for the computer to be set up.

Week 02: 09.23-09.29

I began reading related code to be utilized in the project, continued reading relevant articles, and began exploring the databases of Chinese handwriting to be used.


The second month of the project was highly exploratory and still in preparation stages. I processed databases and performed informal experimentation to devise a formal experiment plan.

Week 03: 09.30-10.6

I created the wiki, explored existing code, and continued to examine the CASIA databases' contents. After meeting, I began writing code for handling CASIA database contents.

Week 04: 10.07-10.13

-Fall Break-

I continued writing code to extract the images from the CASIA databases.

The CASIA 1.0-1.2 Offline Handwriting databases utilize a file format *.gnt, and I had to write a program in MATLAB to access the images within the files for our uses. In a *.gnt file, the images of individual handwritten characters are stored along with header information identifying the image height and width, as well as the GB character code for the Chinese character to which the image corresponds.

Week 05: 10.14-10.20

Extracted 419 single character handwriting libraries (CASIA 1.0). Reevaluated next steps for project and decided to work on whole passages at a time, rather than single characters. Nick helped code to extract whole passages from *.dgr files in the CASIA database. I extracted images from these sets.

Trouble files:

  • 1.0: 409
  • 1.2: 767
  • 2.0/Train: 385 (P20)

Also, I began rereading ICDAR articles with the motive of devising an experimental method from those in the papers.

Week 06: 10.21-10.27

I made some minor changes to the CASIA 2.0-2.2 database information saved after discussing optimal conditions with Nick (changed image file format, saved *.mat files for data).

I then began informal experimentation, mainly looking for character matches on whole pages from the CASIA database. Through informal experiments I familiarized myself with time constraints and issues with potential approaches we may take in experimentation (e.g. binarization methods, character size differences, file size constraints). Using knowledge from this exploration and from other readings of similar experimental approaches, I will formulate a formal experimentation procedure.


Week 07: 10.28-11.03

I continued the same tasks that I began in the end of the previous week. I also developed a function to highlight same characters over a PSM match given a character image and page *.mat file.


Things still to do (want to discuss for suggestions of ways to do these):

  • Refine normalization method and compare PSM results
  • Develop a way of aggregating PSM page match results to yield a reasonable confusion matrix

Week 08: 11.04-11.10

Work was interrupted this week due to illness and kidney stone!

Week 09: 11.11-11.17

This week, I developed reverse match system for character matches in whole page matching from a single character. First, the program performs a regular page match as above, and keeps this as a one way page match. Then, for each character on the page (given in the CASIA data), the program takes the image of the page character and matches it to the single character. Currently, the program simply takes the best reverse match value and averages it over the entire character location in the one way page match results. This is a very primitive method that can only be used when character locations are known (not helpful in application), however it is useful in testing the basic performance of the recognition model.

Notice the significant difference in the same results as above after adding the simple reverse match:


The two way page match also calculates the precision and recall of the results, given a threshold for the match.

Week 10: 11.18-11.24

After meeting with Nick, I learned that the best way to aggregate the results of the two directions of the match is to take the maximum of the two minimum values for the matches. After this, I made slight changes to code organization and added this new way of aggregating results. The two way match function now returns two arrays, one with the match values for positive matches and the other with the match values for negative matches. From this data, the recall/precision curve can be extracted. I also added parallelization to the page match so that 6 lines of each page are processed at a time for the reverse match.

Just before leaving for Thanksgiving break, I ran the matching for 8 very common Chinese characters for all 5 pages of writing of 420 writers (CASIA 1.0/2.0 data) with the same writer for character and page in the match. I stored the match results in *.mat files for analysis after break.

-Thanksgiving Break-

Week 11: 11.25-12.01

First, I reviewed the performance of the tests than ran over break. It happened that 1274 runs of the matching did not yield results. Analyzing the error, I found the majority of the problem was in the reverse match when the page characters were too small.

I wrote a program to extract the overall rankings of the positive matches from the tests ran over break for use with a rank>precision curve program Nick wrote. I also went back and added a condition to perform reverse matches on page characters only if they are larger than 15x15 pixels. By the end of the week, I had precision/recall curve data for 7 characters with all 420 writers in the 1.0/2.0 CASIA handwriting database.


Week 12: 12.02-12.08

In this week, I extracted recall/precision curves on the single character match data. I averaged the recall/precision results writer-wise, character-wise, and overall. Here we have the average recall/precision curve for PSM same-writer single-character matching for 2,401 data points using the CASIA 1.0/2.0 handwriting database.

Total RP curve.png

Here are the recall/precision curves for the individual character matches (hover over the image for alt-text containing the Chinese character the recall/precision curve pertains to):


Week 13: 12.09-12.15

Did final wrap up and testing before finals. During the last two weeks of the semester (last week of classes and finals week) I researched ROC curves and how to interpret them. This is a subject with which I am hardly familiar, so the new information at the end of the semester was a little difficult to internalize quickly. For finals period, I produced a presentation-style poster overviewing the work of the semester, an attempt at results analysis, and a self-reflection.