Digital Humanities Grant — Adriane Gan
- 10/28/13: Binarized selected manuscript image files (all files currently available to me) from VatSyr and BorSyr collections
- 10/29/13: Checked annotation consistency between Java annotation client output and MATLAB read-in of that output
- 02/14/14: Congealed letters are now saved to an array
- 02/24/14: Congealed letters are used to create image masks, using selectLetters tool
- 03/26/14: Congealing process now saves transformed source samples; image masks created using congealed images are applied to transformed images.
Next step: congeal masked transformed images; compute parts and cavities.
Documenting Emma's work
Relevant files: EmmaRevised.m and KeepGoodPatches.m
1. Scale images [text lines 19 images high] 2. Smooth with \sigma = 0.5 (gsmooth) 3. Binarize (binarizeImage2b) 4. Samples from 24 pixel square radius 5. For a given letter type: a. Congeal all samples b. Threshold congealed image c. Throw out connected components that don't overlap (chaff) d. Recongeal samples e. Divide ideal image into parts and cavities, throwing out unstable ones f. Congeal each part/cavity sample g. Assemble feature vector for each sample 6. Vote for affinity: a. Simple voting: each character sample votes once for favorite related document b. Rank voting: each character sample gives votes to related document based on 1/rank (Borda/Nauru) c. Weighted rank voting: Some character samples' votes count more 7. Evaluation: 19 documents with 4 pages from each. Compute precision at various recall levels.
The files Emma was working with are in the folder From Emma -- June 2010 Update. The list of files below contains the sequence of commands used for processing. However, there are some false starts included, that are overridden by later code.
matchSearch.m was an attempt to recover the location annotations Emma used, which she had not recorded Image rescaling happens in rescaleImages.m EmmaRevised.m up to line 93 does some binarization After this, KeepGoodPatches.m extracts image patches The remainder of EmmaRevised.m follows this.
I've gone through rescaleImage.m and pulled the sequence of processing steps into a more coherent form, now called EmmaSyriacProcess.m
Next steps in the Emma process:
1. Extract patches according to their bounding boxes (KeepGoodPatches.m lines 84-103) 2. Congeal patches to get canonical form (KeepGoodPatches.m lines 111-125) (note 1: Current congeal works with pre-isolated patches. It would be better to revise the congealing so that patches are extracted from the source images freshly every time the transformation changes.) (note 2: Normally, patches are congealed with each other. Alternately, they could be congealed to a predetermined mean image. The latter course is necessary for an established reference system.) 2a. Remove extraneous marks from patches. An extraneous mark is defined as a connected component in the image that does not touch a connected component of the congealed image. (KeepGoodPatches.m lines 126-143) 2b. Recongeal patches to get canonical form. (KeepGoodPatches.m lines 144-156) 3. Determine canonical parts and cavities. (EmmaRevised.m lines 103-118) 4. Compute parts and cavities for all samples, and relate them to the canonical set. Then congeal all representatives of each distinct part/cavity together. (EmmaRevised.m lines 119-139) 5. Assemble congealing transform parameters for parts/cavities into feature vector descriptive of each individual sample (EmmaRevised.m lines 153-164; note lines 140-152 seek to exclude bad data from final analysis)
To compare documents:
1. Simple vote of all samples from one document: each sample gets one vote; which other document gets the most votes? 2. Rank vote of all samples from one document: each sample votes for all other samples with strength (1/rank); which other document gets the most total strength? 3. Weighted rank vote: as above, but strength of vote by each character is multiplied by a factor that favors some characters over others.
It looks like the experiments that made it into the paper are actually in looEmma.m, cvEmma.m, and greedyCvEmma.m.