XGrid Faster Access to Files on XGrid

From CSclasswiki
Jump to: navigation, search

Back to XGrid Programming


--Thiebaut 20:57, 21 October 2008 (UTC)

Improving on Input/Output Performance

Context

In a previous installment we showed how to run a Perl script on 10,000 PDB files to extract some simple statistics on amino acids.

The approach worked, but is very heavy in I/O operations. The whole group of 10,000 files has to be copied over to each of the agents for the program to run, even though the agents may not work on all the files.

A simpler approach which we introduced in the XGrid pipeline example is to put all 10,000 files on a Web server that is not too far from the XGrid, and to make the perl program "fetch" a PDB file from the server whenever it needs one. This way there is no need to copy all 10,000 files multiple times (11 times, I believe, since we have 11 computers in the cluster, each with 8 processors).

Accessing Web files with Perl

Grabing a file over the Web in Perl is fairly simple and well documented on the Web:

use LWP::Simple;

my $URL = "http://xgridmac.dyndns.org/~thiebaut/pdbFiles/";

my $fileUrl = $URL . $pdbfile;

unless ( defined( $content=get( $fileUrl ) )) {
        die "Could not get $fileUrl\n";
}

@lines = split( '\n', $content );

Improved Perl Program

Here is the new version of the Perl program we used in the previous installment of this problem.

The main differences are the inclusion of the LWP::Simple module, and the new function readWebFile().

#! /usr/bin/perl -w
# getStats.pl
# D. Thiebaut
# A simple perl program that displays the number of occurrences 
# of amino acids in a protein's pdb file.
#
# Syntax:
#         chmod +x getStats.pl
#         ./getStats.pl  1a5b3.pdb
#

use  LWP::Simple;
my $URL = "http://xgridmac.dyndns.org/~thiebaut/pdbFiles/";

#---------------------------------------------------------------------------
# readWebFile: reads a file from a URL.  The server's address must
# be in $URL, and is used as a prefix to the file whose name is passed
# as a parameter.  Returns an array of the lines found in the file.
#---------------------------------------------------------------------------
sub readWebFile {
    my ( $logfile ) = @_;

    my $url = $URL . $logfile;
    unless ( defined( $content=get( $url ) )) {
        die "Could not get $url\n";
    }
    @lines = split( '\n', $content );
    return @lines;
}

#---------------------------------------------------------------------------
# main program
#---------------------------------------------------------------------------
sub main {
    $argc = $#ARGV + 1;

    if ( $argc < 1 ) {
	print "Syntax getStats.pl filename\n\n";
	exit(1);
    }

    #--- get the lines of the file ---
    my @lines = readWebFile( $ARGV[0] );

    my $line;
    my %aminos = ();

    #--- parse each line ---
    foreach $line ( @lines ) {
	#print "--", $line;
	#--- process only SEQRES lines ---
	if ( $line =~ /^SEQRES/ ) {
	    chomp $line;    # remove blank lines at the end
	    my @words = split( " ", $line );  # split into words
	    shift( @words );   # remove first 4 words
	    shift( @words );
	    shift( @words );
	    shift( @words );

	    #--- count amino acides ---
	    my $amino;
	    foreach $amino ( @words ) {
		if ( exists( $aminos{ $amino } ) ) {
		    $aminos{ $amino } = $aminos{ $amino } + 1;
		}
		else {
		    $aminos{ $amino } = 1;
		}
	    }
	}

	#last if ( $count++>10 );
    }
    
    #--- display filename followed by amino acids and # occurrences ---
    print "--" . $ARGV[0] ."\n";
    while ( my ( $key, $value ) = each( %aminos ) ){
	print $key." ".$value. "\n";
    }

}

main();

Performance

Below is the table comparing the execution times of the first method to those of the improved one, shown in the right-most column. This minor change is getting an execution time 4 to 6 times faster than the original approach. Not bad!

# files First method
Copy Files
Improved method
Fetch Web Files

10 files

1 second

1 second

100 files

8 seconds

8 seconds

1,000 files    

162 seconds

114 seconds

2,000 files

1,755 seconds

271 seconds

3,000 files

2,959 seconds ( 49 min 19 sec )

443 seconds 

5,000 files

3,595 seconds ( 59 min 55 sec )

868 seconds ( 14 min 28 sec )

10,000 files  

3,732 seconds ( 1 hour 2 min 12 sec )