CSC352 Emily's Project 2013

From CSclasswiki
Jump to: navigation, search

Here is a brief description of my planned project, which is subject to change somewhat as I work on it. This will be replaced with an abstract once I have made more progress on the project and have results.

I plan to do a project on downloading and filtering wikipedia images. The eventual goal of the class project to resize and pack wikipedia images based on their "weights". I will download information on access statistics from the wikimedia dumps, filter it, and place it in a MySQL database. I will then use these access statistics to determine the N top weighted images and process them.

The key aspects of my project will be the parallelization involved in setting up the MySQL database with image and access statistics information and setting up a parallel framework to identify the N top images that we can process (see below). I will do this in MPI. My project will not focus on devising a weighting scheme for the images, rather it will focus on setting up a system in which images can be examined based on a given weighting scheme. I will start with using the wikipedia image view statistics (we looked at these in HW4) as weights, and then if time permits, I will explore other weighting schemes (number of page views, how linked a page is to others, how often the page is edited, etc.) using the statistics available on media-wiki.

My project is based on two main assumptions: 1) we will be using ImageMagick to resize the images, and 2) we are only interested in determining and resizing the top N weighted images for now (perhaps we will do something slightly different with the rest of the images). There are many images in wikipedia that cannot be processed by ImageMagick, as we saw in HW5. Some of these images cannot be processed because they are not of supported file types and others because they are "corrupted". Because of these processing issues, I want to revise the goal of examining the "top N images" to the "top N processable images." I will devise a manager/worker program that runs ImageMagick identify on the top N weighted images (identified from the MySQL database), stores the result of identify in the same MySQL database, and finds failure cases. There will be a two-way buffer system between the manager and workers, in which workers send back information if running ImageMagick identify fails. I will try to classify and fix as many of these failures as possible; however, there will still be many images we cannot process. As a result, the program will take an additional parameter, "L", the leeway parameter, which will be how many extra image names to keep track of. When identifying the N top weighted images, the program will actually identify the top N+L images, and then process these images until N successful images have been processed. The value of L will be tuned to make sure it is large enough so that N+L image names will almost always include at least N successful images, but not so large we are sorting too many images. Other parameters that will be examined will be the number of workers and amount of information per block for both the database producing and image processing steps.