CSC352 Project: Image Repository 2013 & 2015

From CSclasswiki
Jump to: navigation, search


2015 Update


Index of /other/mediacounts/daily/2015/

../
mediacounts.2015-01-01.v00.tsv.bz2                 08-Mar-2015 16:09           379930501
mediacounts.2015-01-02.v00.tsv.bz2                 08-Mar-2015 16:39           400451562
mediacounts.2015-01-03.v00.tsv.bz2                 08-Mar-2015 17:09           405933049
mediacounts.2015-01-04.v00.tsv.bz2                 08-Mar-2015 17:39           412597050
mediacounts.2015-01-05.v00.tsv.bz2                 08-Mar-2015 18:12           414549164
mediacounts.2015-01-06.v00.tsv.bz2                 08-Mar-2015 18:45           412489923
mediacounts.2015-01-07.v00.tsv


Processing mediacounts files


  • Get media-count file with wget
  • Unzip with bunzip2
  • filter out all entries not in English wikipedia and save in new file:
grep "wikipedia/en" mediacounts.2015-01-01.v00.tsv | sed -e 's#/wikipedia/en/##' \
           > wikipedia_en_mediacounts.2015-01-01.v00.tsv

  • Example of entry in data file:
0/00/102.7_The_Peak.png 935844  15      12      0       -       -       3       0       3       \ (line broken)
             0       0       0       0       -       -       0       0       0       0       -       -       6       0       9

  • Explanation: 15 bytes response, 12 transfers. That's the number we're interested in.


Processing top1000 files


  • The same directory where the mediacounts file can be found, also holds top1000 files named as follows: mediacounts.top1000.2015-03-19.v00.csv.zip.
  • They have the same format as the other files, except that they are coma-separated-value formatted, rather than tab-separated-values.
  • Processing them is done by running a batch file of this type (can be easily adjusted):


#! /bin/bash

date=2015-03-19
file=mediacounts.top1000.${date}.v00.csv

wget http://dumps.wikimedia.org/other/mediacounts/daily/2015/${file}.zip
unzip ${file}.zip
rm ${file}.zip
for file in `ls mediacounts.2015*` ; do 
   grep "wikipedia/en" $file | sed -e "s#/wikipedia/en/##" > wiki_${file}
   rm $file
   echo $file done
done
cat wiki_* > wiki_mediacounts.top1000.${date}.all.csv
rm *sorted*




2013

Downloading All Wikipedia Images

--Thiebaut 22:10, 9 October 2013 (EDT)

Where are images and uploaded files

Images and other uploaded media are available from mirrors in addition to being served directly from Wikimedia servers. Bulk download is currently (as of September 2012) available from mirrors but not offered directly from Wikimedia servers. See the list of current mirrors.

Unlike most article text, images are not necessarily licensed under the GFDL & CC-BY-SA-3.0. They may be under one of many free licenses, in the public domain, believed to be fair use, or even copyright infringements (which should be deleted). In particular, use of fair use images outside the context of Wikipedia or similar works may be illegal. Images under most licenses require a credit, and possibly other attached copyright information. This information is included in image description pages, which are part of the text dumps available from dumps.wikimedia.org. In conclusion, download these images at your own risk (Legal)

wikimedia.wansec.com/other/pagecounts-raw/

Tarballs are generated on a server provided by Your.org and made available from that mirror. The rsynced copy of the media itself and an rsynced copy of the above files (image/imagelinks/redirs info) is used as input to createmediatarballs.py to create two series of tarballs per wiki, one containing all locally uploaded media and the other containing all media uploaded to commons and used on the wiki.
One series of tarballs (with names looking like, e.g., enwiki-20120430-remote-media-1.tar, enwiki-20120430-remote-media-2.tar, and so on for remote media, and enwiki-20120430-local-media-1.tar, enwiki-20120430-local-media-2.tar and so on for local media), should contain all media for a given project. We bundle up the media into tarballs of 100k files per tarball for convenience of the downloader.
enwiki-20121201-local-media-2.tar	22.5 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-3.tar	25.6 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-4.tar	21.5 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-5.tar	20.7 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-6.tar	22.4 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-7.tar	18.2 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-8.tar	24.4 GB	12/6/12 12:00:00 AM
enwiki-20121201-local-media-9.tar	1.3 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-1.tar	89.9 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-10.tar	90.5 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-11.tar	88.2 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-12.tar	88.4 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-13.tar	89.6 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-14.tar	88.6 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-15.tar	91.2 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-16.tar	91.3 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-17.tar	89.4 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-18.tar	90.0 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-19.tar	90.0 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-2.tar	90.5 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-20.tar	90.1 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-21.tar	91.2 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-22.tar	89.3 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-23.tar	91.0 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-24.tar	44.3 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-24.tar.bz2	42.6 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-3.tar	88.6 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-4.tar	90.0 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-5.tar	90.9 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-6.tar	88.3 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-7.tar	89.6 GB	12/6/12 12:00:00 AM
enwiki-20121201-remote-media-8.tar	90.4 GB	12/7/12 12:00:00 AM
enwiki-20121201-remote-media-9.tar	89.7 GB	12/7/12 12:00:00 AM

  • To get them, store list above in a text file (listOfTarArchives.txt) and use wget:
for i in `cat listOfTarArchives.txt | cut -f 1 | grep -v bz2`; do 
     echo $i
     wget ftp://ftpmirror.your.org/pub/wikimedia/imagedumps/tarballs/fulls/20121201/$i
     done
  • Total size should be 2.310 TB.

Download the page statistics

Links of Interest

#! /bin/bash
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-000000.gz
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-010000.gz
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/pagecounts-20130101-020001.gz
...
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-210000
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-220000
wget http://dumps.wikimedia.org/other/pagecounts-raw/2013/2013-01/projectcounts-20130131-230000

Explanations

This is taken directly from dumps.wikimedia.org/other/pagecounts-raw/.

Each request of a page, whether for editing or reading, whether a "special page" such as a log of actions generated on the fly, or an article from Wikipedia or one of the other projects, reaches one of our squid caching hosts and the request is sent via udp to a filter which tosses requests from our internal hosts, as well as requests for wikis that aren't among our general projects. This filter writes out the project name, the size of the page requested, and the title of the page requested.

Here are a few sample lines from one file:


     fr.b Special:Recherche/Achille_Baraguey_d%5C%27Hilliers 1 624
     fr.b Special:Recherche/Acteurs_et_actrices_N 1 739
     fr.b Special:Recherche/Agrippa_d/%27Aubign%C3%A9 1 743
     fr.b Special:Recherche/All_Mixed_Up 1 730
     fr.b Special:Recherche/Andr%C3%A9_Gazut.html 1 737
   

In the above, the first column "fr.b" is the project name. The following abbreviations are used:

wikibooks: ".b"
wiktionary: ".d"
wikimedia: ".m"
wikipedia mobile: ".mw"
wikinews: ".n"
wikiquote: ".q"
wikisource: ".s"
wikiversity: ".v"
mediawiki: ".w"

Projects without a period and a following character are wikipedia projects. The second column is the title of the page retrieved, the third column is the number of requests, and the fourth column is the size of the content returned.

These are hourly statistics, so in the line

     en Main_Page 242332 4737756101
   

we see that the main page of the English language Wikipedia was requested over 240 thousand times during the specific hour. These are not unique visits. In some directories you will see files which have names starting with "projectcount". These are total views per hour per project, generated by summing up the entries in the pagecount files. The first entry in a line is the project name, the second is the number of non-unique views, and the third is the total number of bytes transferred.

Organization of the directories containing the images

cd /media/dominique/3TB/mediawiki/images/wikipedia/en$ 
ls
0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f
cd 0
ls
00  01  02  03  04  05  06  07  08  09  0a  0b  0c  0d  0e  0f
cd 0a
ls 
00-capleton-the_people_dem.jpg
04_-_Seducción.jpg
101MiniGolfWorld_CoverArt.jpg
1070TheFanLogo.jpg
...
Zaoksky_Adventist_University_logo.png
Zappa_The_Yellow_Shark.jpg
ZhejiangLucheng.png
Zongo_Comics_logo.jpg


Md5 Hash


  • the numbering system used to store the file is created by taking the md5 hash of the file name and using the first 2 numbers of the resulting hexadecimal string.
  • Example (in Python 2.7)
>>> import hashlib
>>> m = hashlib.md5()
>>> m.update("Zongo_Comics_logo.jpg")
>>> print m.hexdigest()
0aa887f8bfbf2f7e6c5c1fe00072f163
  • The output starts with "0a", so the file will be stored in Folder 0, and then in Sub-Folder 0a


Number of files

/media/dominique/3TB/mediawiki/images/wikipedia/en$ find . -type f | wc -l
804398
/media/dominique/3TB/mediawiki/images/wikipedia$ find . -type f | wc -l
3153469

MySQL Database

--Thiebaut 22:23, 9 October 2013 (EDT)

  • Not sure we need a database with all the file names, but just in case...
  • Created mysql database on Hadoop0
  • Database enwiki_images
  • Table images
  • Fields
    • Id, int, autoincrement
    • name, varchar(300), for file name
    • path, varchar(50), for path to file name

Python Programs

  • Two python programs to walk the structure of images directories and enter the information in the MySQL database

ImageDataBase.py


# imageDataBase.py
# D. Thiebaut
# 
import MySQLdb


def readOneRecord( db, c ):
    #c = db.cursor()
    c.execute( "SELECT * FROM images" )
    line = c.fetchone ()
    print line


def readAllRecords( db, c ):
    #c = db.cursor()
    c.execute( "SELECT * FROM images" )
    rows = c.fetchall()
    for row in rows:
        print "row = ", row

def insertNewRecord( db, c,  name, path, width, height, scale ):
    with db:
        c.execute( "INSERT INTO images (name, path, width, height, scale1) VALUES ( '%s', '%s', %d, %d, %f )" %  
                  ( MySQLdb.escape_string( name ),  MySQLdb.escape_string( path ),
                    width, height, scale ) )

def truncateTable( db, c ):
    c.execute( "TRUNCATE images" )

def openDB():
    db = MySQLdb.connect( db="enwiki_images", passwd="xxxxxx", user="root" )
    return db

def closeDB( db ):
    db.close()

def main():
    db = openDB()
    c = db.cursor()
    truncateTable( db, c )
    print "2"
    name = "file.txt"
    path = "path/to/somewhere/"
    insertNewRecord( db, c, name, path, 10, 20, 2 )
    print "3"
    readAllRecords( db, c )
    print "4"
    closeDB( db )
    print "5"

if __name__=="__main__":
    main()


putFilePathsInDB.py

  • The other program walks the directories and insert the names and paths in the database


# putFilePathsInDB.py
# D. Thiebaut

from PIL import Image
import imageDataBase
import os
import time

# where the images are stored...
currentDir = "mediawiki/images/"


def main():
    done = False
    db = imageDataBase.openDB()
    cur = db.cursor()

    imageDataBase.truncateTable( db, cur )

    count = 0
    startTime = time.time()
    for root, dirs, files in os.walk( currentDir, topdown=False):
        
        for name in files:
            path = root
            parts = path.split( "/" )
            shortPath = ""
            if len( parts ) > 2 :
                shortPath = parts[-2] + "/" + parts[-1]
            count += 1
            fullPath = path + "/" + name
            #size = os.path.getsize( fullPath )
             
            try:
                im = Image.open( fullPath )
                width, height = im.size
            except:
                continue

            if  count % 1000 == 0 :
                print count, count/(time.time()-startTime), "inserts/sec: ",  name, shortPath, "%dx%d" % (width, height ), 

            imageDataBase.insertNewRecord( db, cur, name, shortPath, width, height, 1 )

            #if ( count > 100 ):
            #    break
        #if ( count > 100 ):
        #    break

    imageDataBase.closeDB( db )

if __name__=="__main__":
    main()


Running the Program

 cd /media/dominique/3TB
 python2.7 putFilePathsInDB.py

Stats

An output:

BornToRunHartford100207.jpg 1/14 2304x1536 239000 13.675435462 inserts/sec

Only 13 insertions per second in the database! This will require 64 hours to insert the 3 million images!