CSC400 Thesis Proposal

From CSclasswiki
Jump to: navigation, search

Back To Christine Grascia's Thesis Page

3-D Traversing of Wikipedia: Proposal

D. Thiebaut, Thesis Adviser

The main purpose of this thesis is to continue Allie Bellew's recent independent study, and to research the current state of the art for displaying and navigating 3-D graphs representing information stored in databases, in particular Wikipedia.

Wikipedia is a free online encyclopedia which allows users to change or alter information under strict circumstances. It has grown exponentially in size over recent years with eleven other languages, besides English, containing over one hundred thousand articles in each, and over 50 more languages with over ten thousand articles each [3]. Because Wikipedia is so large and editable, it has been accused of lacking creditability and accuracy for its contents. The databases that store this information also contain important data such as the identity of the people who have edited pages, the IP address of the computer they used to carry out their edit, the exact extent of the edit, the date and time of each edit, and even the number of times pages have been edited or viewed.

This wealth of information is unfortunately not easily accessible, and it seems that few visualization tools exist to show mappings of Wikipedia’s database. However most of these tools show only basic viewing capabilities. The initial part of this thesis will be to survey visualization tools that work on similar databases.

Ideally, we would like be able to find visualization tools that will make it easy to answer questions such as

  1. How often has the page for Senator X been edited, and how do these edits correlate with the dates of major political events?
  2. Who are the ten most active contributors to Wikipedia Page Y, and what other pages have they most contributed to?

In order to understand correlations, trends, or other information given to us by Wikipedia, we first must understand how information is coded and organized in a Wikipedia. Unfortunately, Wikipedia’s database isn’t easily legible with “categories having complex names reflecting human classification and organizing instances” [2].

One important goal of the initial research is to understand how and what these categories mean, awhat information is contained in them, as well as how they are linked with each other. Primary keys that easily distinguish database tables will need to be identified and checked to see if these are linked with any other keys.

We will create a methodology to decode these categories and use Nastase and Strube's "Decoding Wikipedia Categories for Knowledge Acquisition” [2] as a map to explore the database more in depth.

Furthermore we plan on researching various languages for this purpose, and to create or adapt a Web-based visualization tool that will allow a user to explore the connections existing between Wikipedia pages, Wikipedia concepts, and Wikipedia contributors in a graphical environment.

By monitoring the information stored in Wikipedia’s database, we can further analyze the validity of articles based on what has been changed and by whom. This is very helpful in tracking down falsified or altered information, for example, such as in the case of an Electronic Arts’ Redwood City employee who had been caught removing several references to Trip’s Hawkins’ legac y to Electronic Arts (EA) on its Wikipedia Page [1].

The various aspects of the visualization will be selected based on our interests and the type of connections and relationship we feel yield important information. We currently plan on using the language Processing. The displays will show a topic of choice in the center of the processing graph with contributors depicted as nodes surrounding the topic. The measure of the amount of contribution from each contributor will also be visually represented. Exploring the graph will be done interactively, clicking on pages or contributors and following a path through the many connections that exist.

A large portion of the time will be devoted to writing this visualization software. Database retrieval will be done using the language SQL.