Document Image Binarization

A handwritten manuscript (top) and its binarization (bottom)

If the configuration of the ink markings in a document can be recovered accurately, then further processing steps may be simplified. However, scanned document images normally show variations in color, imperfections and stains, fading, bleed-through, and other artifacts that make exact an determination between ink and background difficult in many cases. Binarization is the procedure that solves this basic problem in document analysis.

Because binarization is such a fundamental task, much research attention has focused on making it as accurate as possible. Six international competitions assessed the state of the art annually. The method introduced in my 2012 IJDAR paper won the H-DIBCO 2012 contest, beating 23 other entries from around the world. Further entries based upon improved versions of the method placed third and second in 2013 and 2014 contests, respectively. (The winner in 2014 also used a variant of the my technique, submitted by researchers at another institution.)

See a list of papers on document image binarization.

Download MATLAB reference code for binarization.