Medieval Paleographic Scale Experiments

From CSclasswiki
Jump to: navigation, search


Our goal thus far is to recreate the results of the paper entitled "Image-based historical manuscript dating using contour and stroke fragments", so that in the future this method can be tested on Syriac data.

Source Paper: [1]

The Medieval Paleographical Scale (MPS) Data Set

In order to recreate the results of the paper, we tested the method on the MPS data set, which contains images of documents from the Medieval Dutch language area written between 1300 and 1550 CE. The data set is divided into 11 classes of files belonging to each 25-year period in this range. Each class contains between 100 and 600 documents.

kCF Method

The k contour fragments (or kCF) method uses written contours to identify handwriting style. Each contour is a collection of contour fragments, which are pieces of handwriting segmented at the midpoints between two points of high or low curvature. To test this method, we requested the kCF code from the authors of the paper and translated it into a mexTestkCF.mex that could be used with MATLAB. The input of this code is an image file. The output is a matrix that describes all contour fragments. Each row of the output matrix represents one contour fragment described by 100 points.

Running the kCF Method

Because of the large number of files in the data set, and the large size of the output matrices, we decided to use only 20% of the contour fragments in the output matrices. The results of running the kCF code on all files in the MPS data set can be found in May7.mat.

Building a Training/Testing Set

Before this information can be used for dating, we must simplify the output of the kCF code. We do this by using a clustering method. The paper uses 2D SOM, but for simplicity we used the kmeans method.

Previous Attempt

In previous attempts, we ran kmeans on each document, to produce one vector per document. However, this may be wrong because the clustering on each document would be different despite all the data belonging to the same class. We also only tested the 1300 data set versus the 1325 data set and the 1300 data set versus the 1550 data set.

Future Steps

Future students should do this for all classes. In addition, they should run kmeans on all documents in one class, to produce one vector per class rather than one vector per document.

Dating using the kCF Method

The next step is to train the results of kmeans using a linear SVM, which MATLAB supports. We used 70% of the data to train, and 30% to test. Once we have trained an SVM, we can use it to predict the class of the documents in the test group.

Previous Attempt

In our previous attempt, we used one SVM with 1300 and 1325 data and another SVM with 1300 and 1550 data. The 1300 v 1325 test had a 62% success rate, and the 1300 v 1550 88% success rate. The latter result is close to the paper's predicted results for the kCF method. However, as the clustering may be erroneous (see "Building a Training/Testing Set") these results may not be trustworthy.