XGrid Faster Batch Execution on XGrid

From CSclasswiki
Jump to: navigation, search

Back to XGrid Programming


--Thiebaut 23:12, 21 October 2008 (UTC)

Improving Processor Utilisation with Batch Scheduling

Batch file for 1 task

This is the third installment on the processing of 10,000 PDB files. In the the first installment we copy 10,000 different files to the grid and start 10,000 separate tasks, which, because the amount of work is low, run more or less one after the other, using between 1 and 4 processors only.

The second installment we put the files on the Web server so that we don't need to copy them, and the tasks can start right away and grab the files they need. Here again the processor utilisation is fairly low, between 1 and 4.

In this installment we create a single batch file containing one file (getStats.pl) and N tasks, one for each pdb file we want to treat. Here again the tasks will grab the pdb file from the Web browser, so we only need to include the name of the pdb files in the batch file.

We create the batch file using a simple trick. Submit an asynchronous job first, then read its PList-formatted description file:

 xgrid -job submit getStats.pl pdb100d.ent
{
        jobIndentifier = 44123;
 }
 xgrid -job specification -id 44123 > batch.plist

Here's the batch file batch.plist for running getStats.pl on only one file:

{
    jobSpecification =     {
        applicationIdentifier = "com.apple.xgrid.cli";
        inputFiles =  { "getStats.pl" =  {
                fileData = <2321202f 7573722f 62696e2f 7065726c 202d770a 23206765 74537461 74732e70   
                                 ...
                                 2020207d 0a0a7d0a 0a6d6169 6e28293b 0a>;
                isExecutable = YES; };
        };
        name = "getStats.pl";
        schedulerHints =  {
            0 = mathgrid5;
        };
        submissionIdentifier = abc;
        taskSpecifications =         {
            0 = { arguments = ( "pdb100d.ent" );  command = "getStats.pl"; };
        };
    };
}

We've cut part of the hexadecimal data to make it more readable.

Note that the line of interest for us is this one:

0 = { arguments = ( "pdb100d.ent" );  command = "getStats.pl"; };

What it does is to define a task, labeled with 0, which runs the program getStats.pl on the pdb file pdb100d.ent.

We can run this batch job on XGrid as follows:

 xgrid -job submit batch.plist
 {
        jobIndentifier = 45000;
 }
 xgrid -job results -id 45000

A batch file for 100, 1,000, or 10,000 tasks

The trick now is to create a PList batch file, like the one shown above, with 100, 1000, or 10,000 tasks, each one assigned a different pdb file.

One way to perform this is to have text files that contains the name of all the files, and to use a script to merge them with the batch file. We use a python program which we call makeBatch.py to perform this job.

For example, assuming that we have a list of 10 pdb files in a text file called filelist10.txt, we run the command:

   makeBatch.py filelist10.txt > batch10.plist'''
  
   cat batch10.plist

  {
    jobSpecification =     {
        applicationIdentifier = "com.apple.xgrid.cli";
        inputFiles =         {
            "getStats.pl" =             {
                fileData = <2321202f 7573722f 62696e2f 7065726c 202d770a 
                           ...
                   223b0a20 2020207d 0a0a7d0a 0a6d6169 6e28293b 0a>;
                isExecutable = YES;
            };
        };
        name = "getStats.pl";
        schedulerHints =         {
            0 = mathgrid5;
        };
        submissionIdentifier = abc;
        taskSpecifications =         {
            0 = { arguments = ( "pdb100d.ent" );  command = "getStats.pl"; };
            1 = { arguments = ( "pdb101d.ent" );  command = "getStats.pl"; };
            2 = { arguments = ( "pdb101m.ent" );  command = "getStats.pl"; };
            3 = { arguments = ( "pdb102d.ent" );  command = "getStats.pl"; };
            4 = { arguments = ( "pdb102l.ent" );  command = "getStats.pl"; };
            5 = { arguments = ( "pdb102m.ent" );  command = "getStats.pl"; };
            6 = { arguments = ( "pdb103d.ent" );  command = "getStats.pl"; };
            7 = { arguments = ( "pdb103l.ent" );  command = "getStats.pl"; };
            8 = { arguments = ( "pdb103m.ent" );  command = "getStats.pl"; };
            9 = { arguments = ( "pdb104d.ent" );  command = "getStats.pl"; };
        }; };
   }


Submitting the Batch Job

Once we have all the components, we simply submit one batch job of 10, 100, 1,000, or 10,000 tasks, depending on how we created the batch file:

 xgrid -job batch batch10000.plist | getXGridOutput.py | mergeResults.pl

There is a long setup time for all the tasks to be spawn on the XGrid, then the computation starts utilizing close to all 88 processors.

Execution Times

The table below shows the execution times for different groups of pdb files. The left columns show the times obtained with previous versions of the getStats.pl program.

We found that batch jobs with more than 3,000 tasks would regularly crash on the XGrid, so we divided up the work into several batch jobs of 2,500 tasks each. Each batch job processes one pdb file per task.

Here is an example of how the 10,000 files were processed in this manner:

   makeBatchN.py filelist.txt  0 2499 > batch2500a.plist
   makeBatchN.py filelist.txt  2500 4999 > batch2500b.plist
   makeBatchN.py filelist.txt  5000 7499 > batch2500c.plist
   makeBatchN.py filelist.txt  7500 10000 > batch2500d.plist
   xgrid -job batch batch2500a.plist | getXGridOutput.py | mergeResults.pl &
   xgrid -job batch batch2500c.plist | getXGridOutput.py | mergeResults.pl &
   xgrid -job batch batch2500d.plist | getXGridOutput.py | mergeResults.pl &
   xgrid -job batch batch2500b.plist | getXGridOutput.py | mergeResults.pl &

We use a new version of makeBatch.py called makeBatchN.py which splits the list of pdb files into a chunk with only the files with indices ranging from a low index to a high index, both provided on the command line. MakeBatchN.py is available here.


# PDB files First method
Copy Files
(seconds)
Improved method
Fetch Web Files
(seconds)
Batch job
multiple tasks
(seconds)

10 files

1

1

1

100 files

8

8

4

1,000 files    

162

114

55

2,000 files

1,755

271

157

3,000 files

2,959
(49m 19s)

443  

305

5,000 files

3,595
(59m 55s)

868
(14m 28s)

500
(Note1)

10,000 files  

3,732
(1h 2m 12s)

Job fails

Job fails

Note1: the job was submitted as two concurrent batch jobs of 2,500 tasks each. The first terminated after 242 seconds, the second after 500 seconds.