CSC430 Elizabeth Do Thesis Proposal
- Department of Computer Science, Smith College
- Dominique Thiebaut
- Eitan Mendelowitz
The main focus for my thesis is the research into the field of analytics  and the extraction of patterns from large data sets to create artistic visualization of selected patterns in the data.
Analytics is an emerging discipline driven by the large amounts of data generated in increasing size every day in many fields including theoretical physics, economics, and social networks, one of the most notable of which is Twitter. Our time is the time of “big data” and the ability to sift through vast amount of electronic data, recognize patterns, and generate new information that highlight these patterns is becoming a worthwhile endeavor. Google, for example has gained from the use of analytics and improved the quality of their engines . My goal is to research, document, and generate methods for processing data sets in innovative ways.
I plan on exploring statistical analysis and various visualization techniques for multi-dimensional data. This type of visualization will be an aesthetic and scientific driven type of animation. I will apply these techniques to two different data sets. The first set is the different versions of Wikipedia which exist in different languages, from which I will compare differences in languages and generate still or animated visualizations of the cultural differences present within the population of Wikipedia contributors. The second set of data is historic data for the student population and Smith departments which I hope to have access to with permission from the Smith College Admissions and Registrar’s offices. The purpose of analyzing this set of data is to explore the variation in the Smith population. Hans Rossler’s Gapminder  provides a possible starting point for my exploration.
I plan on learning and using several different tools during my thesis work. They include hadoop , the open-source MapReduce framework used by Google to process large data sets, the statistical programming framework R , MySQL databases, Processing, a programming language developed specifically for artists , and a mixture of image processing software and programs, some of which I will need to develop. I also plan on consulting with sociology professors in the Smith community for advice and guidance on cross-cultural analysis of Wikipedia content.
Documentation and progress on my thesis will be maintained on the Computer Science Wiki. (http://cs.smith.edu/classwiki/index.php/CSC430_Page_(2010))
 Big Data: Technologies & Techniques for Large-Scale Data. Perf. Roger Magoulas.R2.oreilly.com. O'Reilly Radar, 22 Mar. 2009. Web. 13 Sept. 2010. http://www.youtube.com/watch?v=acimvXoKwhc&feature=player_embedded
 DJ Patil on How Big Data Impacts Analytics. Perf. DJ Patil. O'Reilly Radar, 27 Apr. 2009. Web. 13 Sept. 2010. http://www.youtube.com/watch?v=dRrkgvr9V_s&feature=player_embedded.
 Gapminder, Unveiling the Beauty of Statistics for a Fact Based World View. Gapminder Foundation. Web. 13 Sept. 2010. http://www.gapminder.org/.
 "Introduction to Scientific Programming and Simulation Using R, Owen Jones, Book" Barnes & Noble.com. Web. 14 Sept. 2010. http://search.barnesandnoble.com/Introduction-to-Scientific-Programming-and-Simulation-Using-R
 Reas, Casey, and Ben Fry. Processing: a Programming Handbook for Visual Designers and Artists. Cambridge, MA: MIT, 2007. Print.
 White, Tom. Hadoop: the Definitive Guide. Sebastopol, CA: O'Reilly, 2009. Print.