Common Schedule

From CSclasswiki
Jump to: navigation, search

Alice Yang & Rui Huang


Week 1: 5/12 - 5/16

- Set up computer; install software
- Familiarize with Matlab, having no previous experience
- Create Matlab exercise Q/A
- Become familiar with Adriane's progress and Emma's thesis Automated Writer Identification for Syriac Scribes

Week 2: 5/19 - 5/23

In Adriane's function processSyriacAnnotations.m, we saved the two structural arrays lftr and ldoc. lftr and ldoc are both structural arrays containing 14 fields which correspond to different letters in the Syriac alphabet. Each field is a cell array of various length, in which feature vectors describe a sample of that letter found in a manuscript. Inside each cell array, the first two indices indicate the character's global skew, while the rest of the indices in groups of 7 are feature vectors that describe each feature/cavity's 7 parameters of transformation (translation, rotation, elongation, and shear). At same index of each sample in ldoc is the string holding the source manuscript where the sample came from.

Here are screenshots from Matlab:

  • lftr


Lftr.png

  • lftr.Alaph


Lftr.Alaph.png


  • lftr.Alaph{1,1}


Lftr.Alaphcell.png




Functions written:

  • runSyriacStyleComparison.m compares test features against sample features and ranks all sample sources in proximity to each test source. It takes 4 fixed parameters and 3 modification parameter (sampFtr, sampSrc, testFtr, testSrc + method, weights, letters). The four fixed parameters and the output rank are all structural arrays. The method parameter can be one of the three functions, simple vote, rank vote, or weighted rank vote. Users can pass in which method, which set of weights and which letters they want to use in the program.
  • filterManuscripts.m filters a set of sample features and their sources and returns a subset of features and sources corresponding to the given subset of sources.
  • identifyRepresentativeFeatureSample.m takes a cell array of sample features and returns the most representative sample feature and its index in the given array.
  • identifyRepresentativeSample.m takes a cell array of samples and returns the most representative sample and its index in the given array.

Functions in progress:

  • cleanSamples.m cleans samples of characters passed in as a cell array. The method of cleaning may be specified: ccTouch, selectLetters, predefined, predefinedFromFile. When using selectLetters method, users have the option to save (default empty string vs. a given directory) the masks generated for later use in the option to use predefined, which uses the latest masks saved. The default settings for cleaning: method = selectLetters; mask = ' ' (no mask); saveMasks = ' ' (don't save)

Week 3 & 4: 5/26 - 6/6

Please refer to Rui's page

Week 5: 6/9 - 6/13

  • Display raw manuscripts on the Representative Samples webpage
  • Functions written/revised:
    • processSyriacAnnotations.m: This function processes Syriac annotations by creating a struct array of all given samples with their source information and images. The user can add padding to the extracted samples and specify the desired data format: 'raw', 'binarized', 'grayscale' or 'blackOnWhite' by entering 'true' or 'false' after each argument.
@param dbFile a database text file containing annotations
@param varagin user can specify the directory of manuscripts, the padding in pixels added to the image coordinates, and fields of character images with different formats (raw, binarized, grayscale, blackOnWhite) of the images
ex: 'padding', 3, 'raw', true
defaults: raw = true
binarized = false,
grayscale = false,
blackOnWhite = false,
padding = 0,
directory = 'C:\MATLAB\Handwriting\Summer2014\ImageDirectory');
@returns a struct array of character sample source information and image with fields: source, url, letter, coordinates
ex. sample(1).source = 'VatSyr1'
ex. sample(1).page = '01'
ex. sample(1).url = 'http://.../VatSyr1-01.png'
ex. sample(1).letter = 'Alaph'
ex. sample(1).coordinates = [515 65 80 105]
ex. sample(1).raw = %raw image
ex. sample(1).binarized = %binarized image
    • getSampleSources.m: This function returns a struct (samples) of the data passed in from a database file.
@param dbFile The database file containing info about the character samples
ex. 'C:\MATLAB\Handwriting\Summer2014\FullDB-2014-06-09.txt'
@returns a struct array of character sample source information with fields: source, url, letter, coordinates
ex. sample(1).source = 'VatSyr1'
ex. sample(1).page = '01'
ex. sample(1).url = 'http://.../VatSyr1-01.png'
ex. sample(1).letter = 'Alaph'
ex. sample(1).coordinates = [515 65 80 105]
    • getSamples.m: This function takes in a struct containing source information about character samples, and saves the manuscript image file into the specified local directory, retrieves each character sample and stores the image into the original struct. If no manuscript is found or the sample dimension requirements are not met, nothing is saved into the character sample field(s) for that source.
@param samples the struct containing source info about charactersamples, the struct must have fields: source, url, letter, coordinates
@param directory the image file of the desired manuscript
@param varargin user can specify the padding in pixels added to the image coordinates, and the output format (raw, binarized, grayscale, blackOnWhite) of the images and directory where manuscripts will be retrieved and downloaded
ex: 'padding', 3, 'raw', true
default: raw = true
binarized = false,
grayscale = false,
blackOnWhite = false,
padding = 0,
directory = 'C:\MATLAB\Handwriting\Summer2014\ImageDirectory');
@returns the passed in struct with new field(s) containing the corresponding character sample(s)
    • retrieveManuscriptImage.m: This function retrieves the desired manuscript(s)/image(s) specified in the given scalar struct recursively and saves the manuscript(s) into a local directory. If no manuscript is found online, returns NaN. The user may specify the output format: raw or binarized.
@param sample a scalar struct containing only the info about the desired sample image. The struct must have fields: letter, source, page, url, coordinates
@param varargin user can specify the output format (raw or binarized) and the local directory where the image files will be saved
default: outputFormat = raw
directory = 'C:\MATLAB\Handwriting\Summer2014\ImageDirectory'
@returns desired manuscript image, if none is found, returns NaN
    • extractSampleFromManuscript.m: This function extracts a character/subimage from a larger manuscript/image. Any sample coordinate that has coordinates yielding an image greater than the maximum pixels specified will be downgraded and an image smaller than the minimum pixels specified will not be returned.
@param sampleInfo One layer of the original sample struct containing only the info about the desired sample image. The struct must have fields: letter, source, page, url, coordinates
ex. sample(3)
@param manuscript the image file of the desired manuscript
@param varargin user can specify the padding added to the image coordinates, the minimum pixels of the sample and the maximun pixels of the sample, and can pass in whether or not the manuscript is binarized, which updates the fill color if the sample exceeds the boundaries of the manuscript page
ex. 'padding',3,'binarized',true,'minpixels',50,'maxpixels',100
@returns the extracted sample image; if the coordinates of the sample image yield an image smaller than the min pixels or is all one color, this returns NaN

Week6: 6/16 - 6/20

  • Run processSyriacAnnotations.m to generate a struct array of samples with 7 fields: letter, source, page, url, coordinates, raw, and binarized.


Samples2.png

  • Functions written:
    • compareSamples.m: This function takes two samples/struct of samples and returns a number/matrix of how similar they are.
@param sample1 a struct containing the first character or set of characters to compare
@param sample2 a struct containing the second character or set of characters to compare
@param method the method used to compare the samples: chamferDistance, congealFeatures, or inkballDifference
@param varargin
if user selects method involving congealing features, user can specify which distance caluclation to use
options: 'Euclidean', 'Manhattan', or 'EarthMovers'
default: congealMethod = 'Euclidean'
ex: 'congealMethod', 'Manhattan'
user can pass in cells of sources along rows and columns for current letter
default: 'rowSources' = cell(1); 'colSources' = cell(1);
ex: 'rowSources',letterRowsources,'colSources',letterColsources
@returns
distances a number/matrix of the similarity between two sets of samples
rowSources updated sources along the row after deleting samples with empty feature vectors
colSources updated sources along the column after deleting samples with empty feature vectors
    • voteOnSimilarity.m: This function performs a vote on a matrix of distances/difference scores.
@param simMatrix a matrix of difference scores
@param method the desired voting method: 'simple', 'ranked' or 'weighted'
@rowSource a cell array specifying the manuscript of the sample in the corresponding row index of the passed in matrix
@colSource a cell array specifying the manuscript of the sample in the corresponding column index of the passed in matrix
@m1Sources a cell array of all of the manuscripts in the first set of samples to compare
@m2Sources a cell array of all of the manuscripts in the second set of samples to compare
@varargin user can specify the weight value of the current letter to be applied in the weighted rank vote
ex: 'weight', .35
default: weight = 0.5
@return a matrix of votes indicating the similarity between two sets of manuscripts; the higher the vote, the more similar the manuscripts
    • compareManuscripts.m: This function compares two sets of manuscripts and rates their similarity relative to the other manuscripts in each set, and returns a value/matrix describing their similarity. The higher the score, the more similar the two manuscripts are.
@param mSet1 a struct array containing samples from manuscript or cell array of struct arrays containing samples from manuscripts; each index in the cell holds samples from a different manuscript
@param mSet2 a struct array containing samples from manuscript or cell array of struct arrays containing samples from manuscripts; each index in the cell holds samples from a different manuscript
@varargin
user can specify the method to calculate the difference between characters
options: 'chamferDistance', 'congealFeatures' or 'inkballDifference'
default: weight = 0.5
ex: 'diffMethod', 'chamferDistance'
if user selects method involving congealing features, user can specify which distance caluclation to use
options: 'Euclidean', 'Manhattan' or 'EarthMovers'
default: congealMethod = 'Euclidean'
ex: 'congealMethod', 'Manhattan'
user can specify the voting method for calculating which manuscripts are most similar
options: 'simple' 'ranked' or 'weighted'
default: votingMethod = 'ranked'
ex: 'votingMethod', 'simple'
@returns returns a value/matrix describing the similarites between the two manuscripts/two sets of manuscripts; the higher the number, the more similar the two manuscripts relative to the other manuscripts in their sets

Week 7: 6/23 - 6/27

FUNCTIONS WRITTEN/REVISED

    • filterManuscripts.m: This function filters samples by 3 ways: 'by manuscripts', 'by letters', and 'by sources'. It returns a subset of samples corresponding to the given subset of sources or letters.
Example inputs: filterManuscripts(samples,'Alaph','by letters')
filterManuscripts(samples,{'VatSyr001',VatSyr002'},'by sources')
@param samples the original struct containing information about the characters. It must contain fields: sources, page, and letter.
@param filters a string of cell array of strings containg the desired filters
@param method One of 3 strings; 'by manuscript' filters the struct by the specified manuscript(s), 'by letters' filters the struct by the specified letter(s),'by sources' filters the struct by the specified manuscript page(s)
@returns subSamples a substruct array of the original struct, filtered
indices indices of the filtered samples in the original struct
    • cleanAllLetterSamples.m: This function congeals all samples, requests the letter masks all at once and applies masks.
@param samples s struct array of samples to be congealed
@returns samples the passed-in struct of samples with one more new field 'congealed' storing their congealed format
    • resizeSamples.m: This function resizes samples passed in as either a struct array or a cell array of structs and returns these structs with one more field 'standardized'.
@param samps samples which needs to be resized/put in a stack, either a struct array or a cell array of structs
@returns samps the passed in structs with one more field 'standardized' storing resized samples
    • deleteBadSamples.m: This function deletes all empty/bad samples from the struct array: empty images and all black/white images


    • runExtractJpgAndCrop.py: Takes in command line arguments and calls extractJpg and/or cropJpg from extractJpgAndCrop.py with passed in values. Arguments not specified will take on default values. The directory storing the JPG files will be created if it does not exist.
params: PDFdirectory    the directory containing all of the manuscript files
                        as PDFs   
        JPGdirectory    the directory where the JPG images of the manuscript
                        will be saved
        x               the x coordinate of the manuscript's top left corner
                        to be cropped
        y               the y coordinate of the manuscript's top left corner
                        to be cropped
        w               the width of the manuscript's cropped image
        h               the height of the manuscript's cropped image
        extract         True/False indicating whether to extract JPGs from PDFs
        crop            True/False indicating whether to crop JPGs
    • extractJpgAndCrop.py: Contains functions
extractJpg: extracts jpg files from a folder of PDFs and saves them to a given directory
cropJpg: crops all jpg files in a given directory with given dimensions/coordinates

User must download Python library "PIL" and be using Python 2.7 to run the Python scripts (PIL only supports Python 2.7)


PROGRAMS EXECUTED

  • Run createWebpagewithRepresentativeSamples.m to create webpages in the revised format for samples with higher/lower resolution. Samples are extracted using square dimensions and displayed in the same size as the highest resolution sample.


HighResoWebpage.png
LowResoWebpage.png

Week 8: 7/7 - 7/11

  • Add chamferDistances as one method to identify the most representative sample in identifyRepresentativeSample.m
  • Revise functions compareSamples.m and compareManuscripts.m
  • Write samplesToRelativeSyriacIdFeatures.m based on samplesToSyriacIdFeatures to compute relative sample features using binaryCongealWithRecord.m and binaryCongealToAnchor.m

Week 9: 7/14 - 7/18