CG Weekly Tasks

From CSclasswiki
Jump to: navigation, search

Back to Christine Grascia's page.



Weekly Tasks

Sept 24

Tasks

  • Sept 24: To do for next week --Thiebaut 19:35, 25 September 2008 (UTC)
    • Go to Mediawiki web site, locate wikipedia visualizers and research projects based on wikipedia that use visualization tools
    • Report them on the wiki
    • Explore Visio
    • Start organizing the resources on your wiki page (this set of pages)
    • Add images for each software tool found. Add a description for each one as well (which can be simply copied/pasted from the site associated with the tool).
    • Explore Prefuse
    • Locate Web site of Tamara Munzer, and look at the different tools developed by her group
    • Locate Google video of talk given by Munzer at Google. Add the link to your resource section and summarize the 1-hour talk.
    • Install Ubuntu on JF's PC (partition first, and keep Windows on 1 partition)
      • The different pieces you will need are on your Ubuntu server are:
1 Apache
2 MySql
3 Php 5
4 MediaWiki (http://download.wikimedia.org/mediawiki/1.13/mediawiki-1.13.1.tar.gz)
      • Make sure you turn the firewall on, as well as the sshd daemon.
      • We'll want to set the firewall to let only requests from 131.229.xxx.xxx go through. This way will protect the server from outside attacks.
      • Keep track of the passwords and accounts you create on the server. You need two accounts, basically: one for the root, and one for you, as regular user.


Report: Week 2, More details on visualization tools and articles

Mediawiki, although it is originally written for Wikipedia, does not contain any sort of articles for Wikipedia visualizations. Or at least I couldn't find any. Fortunately, Google always does the trick.

I found a wonderful site, and the site name is just brilliant! Its called “The Best Tools for Visualization”. It can't get any better than this! This site contains a lot of visualizers for a bunch of online tools such as Youtube and Digg. It also has visualizers for your computer desktop, gaming, widgets, etc.

This is an article I encountered from within the link above. Its a great article called “Visualizing the 'Power Struggle' in Wikipedia”. The actual visualizer is a very large, high resolution, map that is also printable, although it does print out rather large maps. The creators, Todd Holloway and Bruce Herr, describe their large and confusing, yet accurate data map. Each node is its own article and only the most popular (in terms of most viewed and edited) are either titled or given an image. Nodes are displayed in yellow, and the visualization itself contains some 650,000 nodes (articles!).

To the left of this article contains a visualization column with related articles.

To get me started on what specifically I want to program, and how to do it, this is a good article to start me off:

It explains why visualizing data is a difficult process. One really needs to have a set plan of coding data. For example, you can code the relationship between a certain data and the number of times viewed in total by, let’s say, how large the node is. So the larger the node the more viewed it but. But you may not necessarily use that node size to also show the relationship between the data and how frequently it’s viewed per day (hour, minute, years, etc). So you start running into all sorts of constraints, such as what type of data would you most likely want displayed, and how specific do you want it displayed as?

This is why creating code in a dynamic way (something I love doing!) that allows for user-friendly interaction is a good way at going about things. The best way to make something dynamic is to allow the user to pick from choices or even input what they want displayed (like a search engine), except the user doesn't have to do all the hard coding. The queries are generated by inputting the user's choices into the appropriate places of the SQL or whichever preferred code and out displays a visualization for the user.

This article discusses open source tools that could be used to produce visualization. Software such as processing, Flash Action script, and Prefuse.


“15 Views of a Node Link Graph: An information Visualization Portfolio”

This video is about a talk at Google given by Tamara Munzer who graduated from Stanford. Munzer goes into some detail about design spaces and visualizing data in a particular way, like grouping, coloring, sizing, etc. Drawing the ‘pictures’ or data displays in specific ways can help the user find data easily. Unfortunately, the more nodes you work with the harder the display is to read or even show the graph. So minimizing and maximizing is necessary in most cases. For example, minimize relationships, linking, and the space used and directly and maximize resolution (the ability of a user to see the data and distinguish it). She explains that to easily avoid confusion is to develop a sort of multi-level visualization. Another problem that arises is that data may be dynamic, meaning it is constantly changing. So it is necessary to try to keep the display for the data as similar as possible for every change in the data. Moving things around in polar coordinates versus rectangular coordinates, it makes it easier to see and interpret the graph. Munzer mentions that it is also better to have the visualization smoothly translate the data; however it becomes more limited as the visible data becomes larger. A traditional metrix that Munzer mentions is to avoid as many crossovers as possible. The reason for this is because of ambiguity. Crossovers can misdirect data relationships. Munzer goes into some detail about treemaps, and manipulating shading to avoid color overuse. Treemaps are good for having topological structure, however it makes it more difficult to determine data relationships (how the link). Another interesting topic is displaying the data in multiple views, such as Adobe Photoshop has a main viewing area and a smaller display which shows the image in its entirety. This goes into SpaceTree, a neat software which allows a datum to open up once clicked to show a second layer of data directly linked to this datum. Munzer many visualization tools and explains the strengths and weaknesses of each. By knowing what the good and the bad for each is, you can pick and choose which is more suited for your data.


Prefuseexample.gif

DataMountain Demo by Jeffrey Heer.

Prefuse is an information visualization toolkit that has many different ways of displaying data. It is written Java programming language and can be used as many types of applications (applets, widgets, standalone graphs, etc.). The nice thing about Prefuse is that it supports panning and zooming, making large complicated data more legible. It correlates with SQL for writing queries.


Microsoft Visio 2007

MicrosoftVisio.jpg

Microsoft Visio 2007 is a neat tool for displaying complex IT related information into easy and legible diagrams. Fortunately I own the product so I could possibly use this if needed.

Oct. 2

Tasks

  • Oct. 2: To do for next week --Thiebaut 18:03, 2 October 2008 (UTC)
    • It looks like prefuse and visio have potentials for us, although the demos I have seen of visio have not struck me as a possible way to do what we want. You need to take a look at all the other visualizers listed in the various links we have so far and do a crude binary selection so that we have a short list of candidates:
      • The visualizers that are not good for what we want to do
      • Those that may be possible candidate
    • Create a table of the visualizers you have found (include prefuse and visio), and list their features:
      • Name
      • company
      • URL
      • open source or not
      • programming language
      • interface with mysql
      • limitations
      • can be integrated into Web applications
      • whatever else you feel is relevant
    • Create a table of the visualizers that were rejected and the reason why

Report: Week 3, Deciding on which visualizer to stick with, gathering more information on them

Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Visio 2007 Microsoft 3D Visioner: http://www.shareit.com/product.html?productid=212962&languageid=1&affiliateid=74685 Microsoft homepage: http://office.microsoft.com/en-us/visio/default.aspx No .NET Framework Yes Visio has visualization limitation - does not allow 3D unless paired with 3D visioner which is also a non-free software. Yes I do like .Net and its ease at communicating with the SQL language, but its too expensive! Yes Although we could get a license for visio funded from the college, we probably wouldn't be able to get funded for 3D visioner which is what we actually need.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Dang Developed by Kinemage, a group from the Biochemistry department at Duke Dang: http://kinemage.biochem.duke.edu/software/dang.php Yes C, C++ Does not seem to do so Doesn’t take input from database, but can be used for other reasons, see other information. Not Sure Is mostly used as a geometric measurement tool for finding angles of a 3D shape (such as molecules). Runs on linux. No I think this can be looked into for programming purposes only. It probably has some dynamic way of figuring out how to equally space out relationships from a common node in a 3D environment.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Mage Developed by Kinemage, a group from the Biochemistry department at Duke Mage: http://kinemage.biochem.duke.edu/software/mage.php Yes C, C++ Possibly Although it comes pre-compiled, it needs to be run through a terminal.It also requires an emulator, the Xterm emulator, to handle the graphic display Not Sure Shows 3D displays and 3D relationships between data "in an interactive enfironment which facilitates both open-ended exploration and structured presentation." Runs on linux. No Even though it is written in a program that requires compiling every time, it’s a good starter program to grab code out of.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











JavaMage Developed by Kinemage, a group from the Biochemistry department at Duke JavaMage: http://kinemage.biochem.duke.edu/software/javamage.php Yes Java Possibly Graphics aren't as great and very little detail shown. Yes Displays graphics resembling database structures. Shows links and nodes, and is in 3D although really bad graphics Yes / No Yes if we decide not to use Java as the language to code in. No if we need Java resources.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











KiNG Developed by Kinemage, a group from the Biochemistry department at Duke KiNG: http://kinemage.biochem.duke.edu/software/king.php Yes Java Possibly Again, no implementation of databases involved, simply for fiewing molecular structures. Yes The program is written in Java so it can create a web applet. It shows a better 3D image than JavaMage and a bit more detail Yes / No Yes if we decide not to use Java as the language to code in. No if we need Java resources.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











3D Data Visualizer 1.0.3 Optunis 3D Data Visualizer 1.0.3: http://www.versiontracker.com/dyn/moreinfo/macosx/30720#screenshots Freeware OpenGL Only mentions table based data, so possibly Some bugs that people have mentioned, does not save, doesn’t have capability to restore,no background color specification. Not sure This visualizer creates 3D lines, surfaces, and scatter plots from table based data. Yes Its freeware only, not open source. Too many bugs, there is a Pro (pay) version though. Doesn't seem like it is capable of doing the things we are looking for.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Prefuse Sourceforge.net Prefuse: http://prefuse.org/ Yes Java Yes Sometimes the information can be too cluttered, there isn't a sort of base we are looking for. Limited options for the user to break down data and to view. Yes Many interesting ways of displaying data. Good graphics. No This is a good starting point. There's plenty of starting options as to how data can be displayed.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











last.forward Last.fm Last.Fm: http://build.last.fm/item/42 Yes Java Possibly Im assuming that since the site itself is in german, the documentation is probably also in german. Source is probably limited to social networks, galleries, and for desktop applications. Yes Great graphics! Open source, downloadable and probably easy to alter code to take in other data. No Just like Prefuse, this can also be useful as a starting point.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Digg Radar Brian Shaler Digg.com: http://brian.shaler.name/digg/radar/ Not sure Flash Most Likely Very difficult to read, too much information on the screen. The visualizer is not moveable. Yes Good idea. Works on real time! When mousing over the small "diggs" you can see the person that has just "dugg" a story. No If we plan on doing a project that also takes in information in real time, this could be a good help. Not sure how easy it is to get the actual source though.


Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Digg HeatMap Brian Shaler Digg.com: http://brian.shaler.name/digg/heatmap/ Yes Flash Most Likely Difficult to read, not much information being shown. Too cluttered. Yes Another nice idea. This includes a search function which pinpoints where on the map the username you chose appears. Yes / No This has a good search function that we could use, but a search may be easy to do with SQL. So it may just be a back up.
Name Company URL Open Source or not Programming Language Interface with MySql Limitations Can be integrated into Web Other Information Rejected? Why











Opte The Opte Project The Opte Project: http://www.opte.org/ A REALLY NICE PREVIEW: http://opte.org/maps/mpeg/movie4.mpeg Yes PHP, Perl, C++ Yes The code is still in development, and so its buggy. Graphics aren't that great either. Not sure Great visualizer. Allows you to zoom in from what looks like a ball of clumped elastic bands. A closer looks gives more detailed information No This is neat in allowing one to zoom in and out. Its essentially what we are looking for except we want the user to choose their data display more freely.

AND FINALLY.... A photoshop idea of the visualizer we I had in mind after discussing some possibilities with Professor Thiebaut:

Examplewikivisualizer.jpg

Oct 9. '08

--Thiebaut 20:35, 9 October 2008 (UTC)

  • Great table!
  • Looking at the different visualizers, I found this site, which is great and that we should investigate more (It's based on Prefuse): http://www.intsysr.com/nearword.htm . Type in a word and search for it. Double click on other words and see the graph change.
    Great movie of Prefuse features at http://prefuse.org/media/prefuse.wmv
    • Study prefuse deeper. In particular see how to access the code and edit it.
  • Server installed: (wikithesis.xxxxxx.xxx)
  • Things to do on the server:
    • initialize ddclient
    • find a way to start a GUI software installer (besides using apt-get)
    • install emacs
    • see if mysql is installed.
    • setup mysql root password as same as server.
    • make mysql start automatically when server starts (service)
    • see if apache installed. If not install it. Make it start automatically with server.

--Thiebaut 14:53, 15 October 2008 (UTC)

  • Some tiny problems to fix in near future:
    • the apache server does not server php pages: need to edit config file for apache to allow php
    • I cannot connect to mysql server from my PC: we very likely need to open port 3306 in firewall..

Oct 16 '08

Latex

  • Install Latex (MikTex) and WinEdit on your Windows machine following the steps listed here

Paper

  • Start reading this paper:
Cognitive costs of zooming versus using multiple windows, M. Plumlee and C. Ware, ACM Transactions on Applied Perception. 13(2) 1-31. 2006.
  • Do not get lost in the model. It's a description of the math used to measure key parameters. The more important information for us are the conclusions, the heuristics, and, to some degree, the way the experiments were conducted.
  • Summarize what methods of displaying information are covered, what their different advantages and disadvantages are.

Oct 23 '08

--Thiebaut 13:55, 23 October 2008 (UTC)

1. The table of visualizers above has replicated information. Please remove the duplicates. You may want to add GraphViz to the table, even though we won't use it. It is a graph visualizer that displays static images (in SVG format, for example), where each node or vertex can be associated with a hypertext link. It would require more work on the computer part than something like prefuse, but it should be in the table for completeness.

2. Prefuse seems to be the way to go for us for now.

3. Perform a library search on any research papers talking about wikipedia and visualization of its data. Create a list of the abstracts of the most interesting ones and post them here (in a separate wiki page).

4. Start playing with Latex!

5. Remove anonymous login from mysql server (I added an account for me after our meeting).

Oct 30 '08

--Thiebaut 19:53, 30 October 2008 (UTC)

  • Figure out which file on the wikipedia download site is the one we need/want. Don't hesitate to look at wikipedias in other languages, too, because there may have more information about their contents than the English version one.
  • Figure out what size will the 147GB file unzip to
  • Find out how to write a python program that reads a bzip2 file without unzipping it to disk first. We also want to make sure that the unzipper library won't create too large a footprint in the memory (and virtual memory).

    Python has a library called bz2 that should handle this. It should have been automatically installed along with Python on your server. If Python is not installed there, make sure you install it!

    You will find examples of how to use the bz2 module in python here.
  • Find out the format of the XML dumps of the wikipedia databases.

Note added by --Thiebaut 20:31, 6 November 2008 (UTC)

Some thoughts about the limitations we are facing because of the size of the Wikipedia file.

  • We need to first work on a small scale. No need for millions of data in a database while you are setting up the visualizer.
  • All you need to do is to figure out a way to store a few thousand pages in a database, along with their contributors, and whatever else we want. For this you have two options:
    • create an organization of the data yourself
    • adopt the organization used by mediawiki
The first one is easier, but needs you to filter the XML dump of wikipedia to create the database
The second one is harder, but can run "live" on top of wikipedia
  • You may even create a "fake" list of pages, with a fake list of contributors first, so that you can debug the visualizer on a small scale.
  • Then, as you make progress on a small scale, we can run programs to extract more significant amounts of information from the xml file and store that in MySQL.

Note added by --Thiebaut 15:16, 12 November 2008 (UTC)
(copies of email messages)

SkyRails.jpg
  • Papers read
    • A data model and architecture for hypermedia database visualization, Robert Steven Owor, Web3D '02: Proceedings of the seventh international conference on 3D Web technology, 2002. (a bit old)
    • Information visualization, Tiziana Catarci (Università di Roma, Italy), Isabel F. Cruz (Tufts University), SIGMOD 1996
      (too old for us)
    • 3D geographic network displays, Kenneth C. Cox Bell Laboratories, Stephen G. Eick Bell Laboratories, Taosong He Bell Laboratories, ACM SIGMOD, 1996 (Too old for us)
  • I think we need to check SkyRails closer. I don't want to take you away from what you are doing, but I think at some point, in the month to come, we should figure out if it can be useful for us.
  • Books/Articles to get from the library
    • Applied Security Visualization. (Paperback) by Raffael Marty (Author) Amazon link
    • Security Data Visualization: Graphical Techniques for Network Analysis. [ILLUSTRATED] (Paperback) by Greg Conti (Author) Amazon link
    • Information Visualization. Palgrave/Macmillan link

Nov 13 '08

  • Status Report
    • I got a Latex version of the first chapter
    • We debugged a python program that reads pages from a compressed wikipedia file and stores their title in the database
  • To do for this week
    • Look at the notes I have added above and keep on top of the information in them
    • Continue developping the python program
    • Think of a way for the python program to store the list of IDs of the pages that are linked to a given page, rather than a list of their names. The program will have to have two phases, or better yet, you can write 2 different programs, one that generates the raw data, one that filters it.
    • Do a library search on the latest articles (2007-2008) (use the ACM Portal, for example) written by Munzer, or that have Munzer in one of the listed references.

Jan 26 '09

  • DT will read new chapters
  • CG will play with prefuse and will
    • install on PC
    • investigate how to set it up for public Web access
    • investigate several options for display star-shaped graphs that allow
      • display of short text associated with each node
      • display of longer text in side window, or pop-up window
    • investigate type of coding required for special features.
    • contact 2nd reader for thesis

Feb 2 '09

  • CG will
    • play with Prefuse and take a simple example java code, and "import" it in the prefuse package
    • Make example display simple graph
    • figure out if actions can be defined, and how (click on mouse, make node display text)
    • click on window brings in a new graph
    • display square and round nodes

Update, 2/3/09 11:00 p.m.

  • DT played with the server a bit and was able to get prefuse to work.
  • Changed the Java JDK that eclipse uses to compile the Java program. DT installed java-6-openjdk, and made it the default:


sudo update-java-alternatives -s java-6-openjdk


  • also set the JAVA_HOME environment variable to in CG's .bashrc file, as well as in /etc/bash.bashrc to be:


export JAVA_HOME=/usr/lib/jvm/java-6-openjdk


  • Finally, DT modify the Eclipse icon that is on your desktop so that the launcher command is now:


/usr/bin/eclipse -vm /usr/lib/jvm/java-6-openjdk/bin/java


which forces eclipse to use the new java jdk to do its compiling.

To Do for 2/23/09

Good demo today.

Continue working the the "toy" tables in database, and make the program behave as follows

  • User clicks on a node and program executes 1 of 2 queries
    • get contributors of the page that was clicked on
    • get pages edited by contributor that was clicked on

This will allow a quick exploration of pages

  • Make the font scale with the graph when zooming in or out
  • Look into coloring nodes depending on
    • # of contributors a page has (red for large number, pale blue for small number)
    • # of pages edited by a given contributor
    • color differently depending on whether the page was last edited this past week or not
  • Modify the toy table for pages, and make the list a list of Ids, rather than a list of names.

To Do for 3/2/09

Another good week of progress!

For next week:

  • Add code that allows the MySql query time to finish before responding to mouse events. It might be that having a boolean set to false before the query executes, and reset to true as soon as the graph is recreated is sufficient. Then mouse events can check this boolean, and not act on the event if it is false. The information in the prefuse documentation at http://prefuse.org/doc/manual/data/db/ might also contain the solution for our problem.
  • Set the colors so that they are associated with users and pages and do not change.
  • Figure out a way to display more information on the page when the user right-clicks or hover over a node. If it's a user, then we could display all the information known, if it's a page, we could display something like the date of last revision, the total number of contributors, the number of links to other pages... What is important right now is to be able to display something taken from the page table in the database.
  • Reorganize the database so that it is indexed by Ids that are integers.
    • The contributor table should contain records of the form { Id, name, Ip, email }, the page table should contain records of the form { Id, title, [list of contributor Ids], misc. info}.
    • Figure out how to nest mysql queries to get the information required. For example, a query of this type would work fine, given the Id (say 20) of a page:
   select * from contributors where Id in ( select contributorIdList from pages where Id = 20);

To Do for 3/9/09

  • The main discovery of the meeting is that the application in its form demonstrated on 3/3/09 has a memory leak. As we click on different nodes, the memory is increasing. This seems to indicate that the graph data structures, possibly the visual graph as well, and maybe some visualization-related data are not destroyed properly when a new graph is created. ==> investigate this leak, and see if having a member variable for the graph and the visual graph help in fixing the leak.
  • Download a new copy of the english wikipedia with history to wikithesis.dyndns.org. The 1TB disk on your server is accessible as /media/wikidata. You can store and expand the wiki with history there (create a new directory, say enwiki).
  • GUI things to work on
    • the color of the nodes should be the nodes themselves, not the cluster around
    • see if you can make a pop-up toolbox appear when mouse is over a node, or over an edge. The text displayed should be something we add to the graph table when we read the graph from the database.
    • start organizing your display so that there's a panel on the right hand-side where we can later display information we have about the nodes that are clicked on.

To Do for 3/23/09

  • Start parsing the xml enwiki file in /media/wikidata/wikipedia_dumps/. This file was downloaded as follows:
cd
cd /media/wikidata/wikipedia_dumps
wget http://download.wikimedia.org/enwiki/20081008/enwiki-20081008-stub-meta-history.xml.gz
ls -l
-rw-rw-r-- 1 thiebaut thiebaut  9047045780 2009-03-08 22:28 enwiki-20081008-stub-meta-history.xml.gz
  • The file in XML starts as follows:
<mediawiki xmlns="http://www.mediawiki.org/xml/export-0.3/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.mediawiki.org/xml/export-0.3/ http://www.mediawiki.org/xml/export-0.3.xsd" version="0.3" xml:lang="en">
  <siteinfo>
    <sitename>Wikipedia</sitename>
    <base>http://en.wikipedia.org/wiki/Main_Page</base>
    <generator>MediaWiki 1.14alpha</generator>
    <case>first-letter</case>
      <namespaces>
      <namespace key="-2">Media</namespace>
      <namespace key="-1">Special</namespace>
      <namespace key="0" />
      <namespace key="1">Talk</namespace>
      <namespace key="2">User</namespace>
      <namespace key="3">User talk</namespace>
      <namespace key="4">Wikipedia</namespace>
      <namespace key="5">Wikipedia talk</namespace>
      <namespace key="6">Image</namespace>
      <namespace key="7">Image talk</namespace>
      <namespace key="8">MediaWiki</namespace>
      <namespace key="9">MediaWiki talk</namespace>
      <namespace key="10">Template</namespace>
      <namespace key="11">Template talk</namespace>
      <namespace key="12">Help</namespace>
      <namespace key="13">Help talk</namespace>
      <namespace key="14">Category</namespace>
      <namespace key="15">Category talk</namespace>
      <namespace key="100">Portal</namespace>
      <namespace key="101">Portal talk</namespace>
    </namespaces>
  </siteinfo>
  <page>
    <title>AmericanSamoa</title>
    <id>6</id>
    <revision>
      <id>233188</id>
      <timestamp>2001-01-19T01:12:51Z</timestamp>
      <contributor>
        <ip>office.bomis.com</ip>
      </contributor>
      <comment>*</comment>
      ...

and store the information into several mysql tables.

I would recommend having:

  • one table containing the Id of the pages, and their titles.
  • one table containing the Id of the contributors, the name of the contributors, and the IP (we may have one, two, or three of these for a revision).
  • one table containing the Id of a page, and a list of the Ids of the contributors to that page. For right now, we are interested in whether somebody contributed to a page, not necessarily how often or when (although this would be interesting in the future.)

To Do for 3/30/09

MySql

Organize 3 tables in mysql db to allow user to switch back and forth between contributor --> page <-- contributor star-network to page-->contributor<-- page network.

The challenges:

  • 9 million pages
  • 370 million edits to pages
  • 9 million contributors

When one clicks on a contributor node, prefuse should

  • quickly find all the pages that contributor touched, and when.
  • generate the name of all the pages
  • create a graph with the contributor at the center of a star with pages all around.

The 9 million contributors require that the information that is gathered by prefuse when clicking on a node can be quickly found in the database. If this information is a name, then there should be a table in the database where one column (field) contains this name only once. This column should be organized as a primary key, meaning the contents can be indexed and organized in a fast searchable data structure (hash table or tree).

Similarly, when one clicks on a page node, prefuse should

  • quickly find all the contributors who have edited this page, how many times, (and optionally, when the last time was)
  • generate the name of all the contributors
  • create a graph with the page at the center of a star with contributors all around.

Same comment. Whatever is retrieved by prefuse should be name or an Id number that belongs to an indexed field in the table of pages.

Python

Filter the file on the server and process only 10,000 pages, with all the edits and all the contributors corresponding to these 10,000 pages. This information should then stored in the database.

Prefuse

Adapt the prefuse program to the newly created database

Back to Python

Start filtering the rest of the XML file on the 8-core macpro in DT's office, and create a new database for all the information in the XML file.

To Do for 4/17

Add a chapter or a section on performance

  • time
    • How long does it take to process 1000 pages of wikipedia?
    • Is the time O(N) where N is the number of pages?
    • Is the time also proportional to E, the number of edits?
    • What about the P Contributors? Is the time O(P) as well?
  • space
    • Same questions for space as for time
  • Extrapolation for 9 M pages and 9 M contributors, and 300 M edits to figure out how much time it will take to process the totality of wikipedia, and how much space the database will take.

Some future directions/possible modifications you might want to list (not implement) in your conclusion chapter.

  • change from a JPanel to a text area or a text editor
  • how to display the total number of edits and the total number of pages when listed information for a node just clicked
  • how to get the top contributors, rather than 25 random ones
  • how to make the application distributed on the Web (applet or web start)
  • how to allow the user to decide on the # of pages/contributors listed
  • how to highlight some users or some pages of interest while the graph is being shown