XGrid Running a perl program on 10000 files

From CSclasswiki
Jump to: navigation, search

Back to the XGrid programming page


--Thiebaut 01:48, 4 October 2008 (UTC)

The Problem

Running a Perl program on 10,000 files

The goal is to prepare a simple example for the students taking the CSC334 Bio-Math seminar, taught for the first time in Fall 08. The programming language in this seminar is Perl.

The main program is written in perl and takes the description of a protein in pdb format and computes the number of amino acids it contains, and prints out this number. It is a very simple filter program reading lines of text, isolating the lines starting with SEQRES and counting the number of 3-letter amino acids found.

The lines selected and parsed by this code look like this:

SEQRES   1 A  154  MET VAL LEU SER GLU GLY GLU TRP GLN LEU VAL LEU HIS          
SEQRES   2 A  154  VAL TRP ALA LYS VAL GLU ALA ASP VAL ALA GLY HIS GLY          
SEQRES   3 A  154  GLN ASP ILE LEU ILE ARG LEU PHE LYS SER HIS PRO GLU    

The perl program is shown below.

#! /usr/bin/perl -w
# getStats.pl
# D. Thiebaut
# A simple perl program that displays the number of occurrences 
# of amino acids in a protein's pdb file.
#
# Syntax:
#         chmod +x getStats.pl
#         ./getStats.pl  1a5b3.pdb.gz
#

#---------------------------------------------------------------------------
# readFile: reads a text file, even if it is compressed with
#               compress, gzip or zip.
#               Returns all the lines in an array.
# Note: this code could be made more efficient by having it stop
# reading lines as soon as all the SEQRES lines have been found.
#---------------------------------------------------------------------------
sub readFile {
    my ( $logfile ) = @_;
    if ($logfile =~ /\.Z/)   { open(ACCESS,"uncompress -c $logfile |"); }
    elsif ($logfile =~ /\.gz/) { open(ACCESS,"gzip -d -c $logfile |"); }
    elsif ($logfile =~ /\.zip/) { open(ACCESS,"unzip -p $logfile |"); }
    else { open(ACCESS, "$logfile"); }

    @lines = <ACCESS>;
    close( ACCESS );
    return @lines;
}

#---------------------------------------------------------------------------
# main: parses the lines and computes the # of amino acids found.
#---------------------------------------------------------------------------
sub main {
    #--- get the name of the file from the command line ---
    $argc = $#ARGV + 1;

    if ( $argc < 1 ) {
	print "Syntax getStats.pl filename\n\n";
	exit(1);
    }

    #--- get the lines of the file ---
    my @lines = readFile( $ARGV[0] );
    #print @lines;

    my $line;
    my %aminos = ();

    #--- parse each line ---
    foreach $line ( @lines ) {
	#print "--", $line;

	#--- process only SEQRES lines ---
	if ( $line =~ /^SEQRES/ ) {
	    chomp $line;    # remove blank spaces at the end
	    my @words = split( " ", $line );  # split into words
	    shift( @words );   # remove first 4 words
	    shift( @words );
	    shift( @words );
	    shift( @words );

	    #--- count amino acides ---
	    my $amino;
	    foreach $amino ( @words ) {
		if ( exists( $aminos{ $amino } ) ) {
		    $aminos{ $amino } = $aminos{ $amino } + 1;
		}
		else {
		    $aminos{ $amino } = 1;
		}
	    }
	}
    }
    
    #--- display filename followed by amino acids and # occurrences ---
    print "--" . $ARGV[0] ."\n";
    while ( my ( $key, $value ) = each( %aminos ) ){
	print $key." ".$value. "\n";
    }

}

main();

The execution of the program looks as follows:


./getStats.pl pdb1ab5.ent.gz 

--pdb1ab5.ent.gz
ASP 14
PRO 6
ILE 12
LYS 20
TRP 2
GLY 20
PHE 10
GLN 4
SER 8
ASN 18
LEU 30
VAL 18
GLU 22
TYR 4
ARG 8
THR 12
MET 12
ALA 30

The 10,000 Protein Files

The 10,000 files are gzipped text files that I downloaded from the PDB database (ftp://ftp.wwpdb.org/pub/pdb/data/structures/all/pdb/) with a perl script. Only the first 10,000 files in the database were downloaded. The files reside on a client MacPro, in /Users/thiebaut/XGrid/pdbFiles.

The script is available here.

One can also download the pdb files in bulk using the rsync command. Go to the directory where you want the files copied and issue the command [1]:

rsync -a -port =33444 ftp.wwpdb.org::ftp_data/structures/divided/pdb/  .

Processing one file only on the XGrid

Here again we use the getXGridOutput.py program we created for the N-Queens C program. It gathers the job Ids returned by the xgrid command and grabs all the lines of output as soon as the jobs finish.

Here's the output of the submission of the perl program to the XGrid:

xgrid -job submit   ./getStats.pl pdb100d.ent.gz | getXGridOutput.py 
Job 1415 stopped: Execution time: 0.000000 seconds
--Users/thiebaut/XGrid/perl/pdb100d.ent.gz 
DG  8
C 2
G 2 
DC 8

Total execution time: 0.000000 seconds

Processing multiple files

That's where the parallelism comes in. The grain of parallelism in this case is the serial processing of 1 PDB file. By scheduling the processing of 10,000 files at once on the grid, each agent in the grid is given the serial perl program and one file and does the parsing/counting. 10,000 jobs are created for the 10,000 files, and the controller of the grid is in charge of passing them to the agents.

We will proceed as follows:

  • Create a file containing the 10,000 names of all the PDB files
  • Create a shell script that calls the "xgrid -job submit" command for each of the file names listed in this file.
  • Create a perl script that takes the 10,000 lists of amino acids that will be spurted out by getXGridOutput.py and merge them into one big list.

The list of 10,000 file names

ls -1 pdb*.gz > filelist.txt

(that's "minus one", and not "minus ell"). This will create the required list. We can also create a list with fewer files, for testing purposes:

ls -1 pdb*.gz | head -100 > filelist100.txt

will create a list of 100 file names.

The shell script that spawns all the jobs

#! /bin/bash
# runStats.sh
# D. Thiebaut
# Run the getStats.pl script on 10,000 gzipped pdb files
#
# To run the script type:
# 
#     ./runStats  filelist
# 
# where filelist is a text file containing the list of all
# the pdb files to process 
#
if [ "$#" -eq "0" ]
  then
   echo "Syntax:  runStats.sh filelist"
   exit 1
fi

for file in `cat $1`; do
    xgrid -job submit getStats.pl $file
done

Make it executable before running it:

chmod +x runStats.sh

The Perl script that merges the results

The script is available here.

Putting everything together to process 10,000 files in parallel

The command is simple:

runStats.sh filelist.txt | getXGridOutput.py  | mergeResults.pl

Some parallel execution times

Contrarily to the N-Queens problem where there was no communication and no data stored in files, here the computation requires a lot of I/O to take place, as all the pdb files have to migrate from the client MacPro to the agents in the grid.

The 10,000 zipped files occupy 1.3 GB of disk space, and this amount of bytes must travel from the local client to the grid. It might be interesting to see how the execution times vary if we store the files on the crontroller, or on one of the agents (if this is even possible).

The Mac documentation indicates that the current directory is copied to all the agents when jobs are submitted in batch mode, as done here. That's a lot of copying!

There is definitely a plateau-ing going on with the execution times. A deeper analysis is required to figure out the source of the bottleneck...

XGrid speedup perl.png

10 files

1 second

100 files

8 seconds

1,000 files    

162 seconds

2,000 files

1,755 seconds

3,000 files:

2,959 seconds ( 49 min 19 sec )

5,000 files:

3,595 seconds ( 59 min 55 sec )

10,000 files

3,732 seconds ( 1 hour 2 min 12 sec )

The output of the run for 10,000 files can be found here for those interested.