Syriac Genesis Character Registration

From CSclasswiki
Jump to: navigation, search

-By Faith Kim, Wenqin Chen, Minyue Dai, Mulangma "Isabella" Zhu

'Friday, February 24th, 2017'

This is our second meeting of the semester. Our goal by the next meeting is to familiarize ourselves with MATLAB, have a better understanding of the research by reading the two assigned papers, and to try to familiarize ourselves with the Syriac codes by running them. Thus far, we have tried loading the images of the Syriac lines in the initGenesis.m file. We were successful in loading the Syriac images, and we have also learned there is a useful tool called bwdist(cellarray[i]), where we can use the command called colorbar to get a relative distance between the page and the words, and the separate letters themselves. Faith was a little confused with the difference between a matrix and a cell array, and Minyue was able to explain that to Faith, and we concluded that cell arrays are generally more flexible than a matrix.

One of the biggest challenges we are trying to overcome is to alter the file paths that have been written in our MATLAB codes. Everything is disoriented and have been rearranged, and so we are missing a lot of data. We are currently trying to run the GenesisWork.m file, but we are getting an error, because there is no such variable called "vwords" that exists. We spent solid 30 minutes trying to look for the file path of the 30-page Syriac raw manuscripts (.png file), and we were finally able to locate it in a different computer in the lab under the file "Julia." This may be because the previous students before us have moved the data around, and thus is no longer saved in the current computer we are working on. This should be a lesson for us that we should organize all of our data in a neat and consistent file paths so that future students who would like to work on this project can easily find the needed data.

We are looking forward to the rest of the semester :)

Wednesday March 1st, 2017

We were able to resolve the problems we had last week. We had to set path in MATLAB to add more directories to get functions like cropautopage, and other functions that we were missing before. We have run the GenesisWork.m file, and now we are able to see the figures with the raw manuscript versus the binarized form of the manuscript. We believe that GenesisWork may also save all the binarized images in some other directory. Now, we should find and read the functions that chop the pages into words, and figure out if we can reverse the procedures and find pages and coordinates of each given word image. The end goal for this procedure is to create training data and testing data for reading a whole page rather than each Syriac word. On Friday, we will discuss this possible procedure with Professor Howe.

""Monday March 3rd, 2017'

Today we figure out our goal: to chop more lines into words and pair them with correct transcripts. We find wimg2 is the data file for both words and sentences. We use a method to find the division index is 3574: calculate the diff of # of col of two consecutive pictures. We found that start at 3574 all diffs are zero, which means that all sentences have equal length and we know the start index of sentences is 3574. The chopped words have some problems, and they are not correctly matched to the transcripts, which we should solve in the future. Also, we should learn how each syriac letter matches English letter. We are also confused about the order of transcripts.

Friday March 24th, 2017

-Minyue, Isabella, Wenqin, Faith

Today, we have decided to split up the workload of looking through each wimg and see if the transcripts match up with each word. If it doesn't match up, then we will index into our wtag and write in the new, more accurate transcript of it. We are currently looking through the visualization of wimg and wtag on GenesisWork, and we are trying to find the index when the transcript starts to be mismatched with the wimg. Some of the Syriac letters were hard to distinguish, such as the ALAPH, and sometimes the Latin letter H versus the E. They can look very similar when looking through each images. We will mark certain transcript/wimg with a "*" so that we can come back to it and give it the correct Latin wtag.

We have decided to use Minyue's newly written function for choplines, where her choplines function will automatically chop the lines into wimg (words) according to the spaces on the lines, and we will manually check them to make sure the lines have been chopped accurately. If it is not chopped accurately, then we will use the default choplines provided for us so that we can manually chop the lines/split up the words properly.

Each person will be responsible for at least 80 words to look through for each transcript.

Wednesday April 19th 2017

-Minyue, Wenqin, Faith

In this week, we found that the mismatched index of words and tags, so we found that the way previous students have chopped the lines has problems. They may have chopped the lines in an improper way, so we decided to chop the lines automatically based on statistic analysis, such as the histogram, and we checked that the automatic slicing of lines/words works better. However, we still need to chop special lines/words by hand due to varying small threshold where the automatic chopline function is unable to chop that line correctly. Therefore, we developed two main functions: matchindex and handChop. The matchindex takes some given tags written by users and tries to locate the position of this sequence of tags back into the transcript. Therefore, we can use this function to locate the correct position of the tags we need for the words. The handChop takes an array of words and the index of long words that need to be chopped, and then we can chop the long words by hand, and this function will return a new array of words with the hand-chopped words in it.

As a result, the procedure we want to use to find the tags is first we take some given number of pre-processed words, and we read this word and find the short sequence of words that we can ensure that our own translation is correct. Then, we use this sequence of translation to locate the correct positions of the tags in the variable wtag. And then, we can find the long words in these words and chop it by hand, based on the tags.

Friday April 21st 2017

-Minyue, Wenqin, Faith

Today, we worked on the first 100 image words and aligned all of the tags together. We evaluated all of the long words and tried to match them/align them with the existing tags, and used the handChop method to chop those long words manually, then they were automatically aligned to the tags. We ran into a small problem, where there was one extra tag, so instead of deleting off that tag, Minyue had created a placeholder function that would create a black image for these cases; therefore, we used the placeholder function for this one extra tag we found in our images.

Next, what we will do is we will now split up the work load, repeat the same procedure as explained above.

Friday May 5th 2017

-Minyue, Wenqin, Faith


All files is folders named as "*Data" are *.mat, which are data files in Matlab. All other files are *.m, which are code files or functions.

        autoSlice:	functions for automatically slicing lines into words
     	autoSliceData:	data for autoSlice
     	matchTag:	functions for checking tags and images by eyes and match them and correct errors
    	matchTagData:	data for matchTag


    • slice.m Takes a line image and threshold as input and slice the words if two ink's space is larger than the threshod, finally returns a cell array of sliced words(in the image order, which is the revered order of reading Syriac)
@param img a binarized image of one line of Syriac as matrix
@param threshold an integer defines the minimum allowed spare between characters
@return wordsinline an cell array of sliced words from this line
    • slice_img.m Takes a line image and threshold as input and return all important pivot for slicing this line
@param img a binarized image of one line of Syriac as matrix
@param threshold an integer defines the minimum allowed spare between characters
@return pivot a vector of important pivot(integer) for slicing lines
    • showwords.m Takes an cell array of words from a single line(output of slice.m) and visialize all words and the the original line
@param line an cell array of words from a single line


    • Syriac.mat Data file for original working tags and images, and all unsliced lines from the images
@varia line an cell array of all unsliced lines
@varia wimg2 an cell array of original working image file
@varia wtag2 an cell array of original working tag file


    • chopLines.m This is a function from previous work on Syriac and for slicing, it just takes a line image as input and allow users to slice it with GUI. Then user clicks ESC and this function will return a vector of pivot for slicing.
@param lines a binarized matrix of line image
@return seps a vector of pivots for slicing
    • handChop.m This function takes a cell array of words and the index of word image reslice. Then it allows users to slice the given image by hands and return a corrected cell array of all words.
@param array a cell array of words
@param index index of the word in array for reslicing
@return corrected a corrected cell array of all words after resclicing
    • matchindex.m Takes a user-defined array of tags and a target array of tags and find all possible locations for user-defined array in the target array.
@param mytrans a cell array of tags for searching position
@param trans a target cell array for finding possible positions
@param ista the index of start position for finding in target array
@param iend the index of end position for finding in target array
@param precision a double from 0-1 to specify the accuracy of matching. 1 means exactly match and 0 means a trivial solution
@param bound a non-negative integer speficy the bounding freedom for matching. 0 means no freedom and the relative positions of words in user-defined array and target array should be the same exactly
@return start a vector a all possible matching index for the first element from mytrans in trans
    • placehold.m Takes an cell array of images and an index and insert a placeholder black image after this index
@param imgarray a cell array of images
@param index the index for inserting placeholder after it
@return newarray a new cell array with placeholder
    • first100example.m An examples of working on matching tags for the fist 100 words from newallwords in newallwords.mat. The long line of showing images and tags are line 460 in SyriacGenesis.m


    • SyriacData.mat The data file contains about 700 new sliced and matched word image and the wtag
@varia Syriac3820 an cell array of words and the index of first image back to wtag is 3820.
@varia diff100 the difference, in this file is 3820.
@varia wtag a cell array of tags for words.

    • first#.mat The data file for matched first #-99 to # words from newallwords
@varia first# an cell array of words that exactly match parts of wtag from GenesisSync.mat
@varia diff# the difference between the matched position in wtag and the position of image in first100. The way of using is is in first100example.m
    • newallwords.mat The data file saves all automatically sliced words
@varia newallwords A cell array of all automatically sliced words
@varia newalllength A vector of the pixel width of all automatically sliced words
@varia taglength A vector character-length of all tags
    • GenesisSync.mat The data file saves all original images and tags
@varia wimg A cell array of all images
@varia wtag A cell array of all tags for matching index
    • week#.mat The data file of unhandled word images for the #th week
@varia first# The #-99 to # words from allnewwords