XGrid Faster Batch Execution on XGrid
Back to XGrid Programming
--Thiebaut 23:12, 21 October 2008 (UTC)
Contents
Improving Processor Utilisation with Batch Scheduling
Batch file for 1 task
This is the third installment on the processing of 10,000 PDB files. In the the first installment we copy 10,000 different files to the grid and start 10,000 separate tasks, which, because the amount of work is low, run more or less one after the other, using between 1 and 4 processors only.
The second installment we put the files on the Web server so that we don't need to copy them, and the tasks can start right away and grab the files they need. Here again the processor utilisation is fairly low, between 1 and 4.
In this installment we create a single batch file containing one file (getStats.pl) and N tasks, one for each pdb file we want to treat. Here again the tasks will grab the pdb file from the Web browser, so we only need to include the name of the pdb files in the batch file.
We create the batch file using a simple trick. Submit an asynchronous job first, then read its PList-formatted description file:
xgrid -job submit getStats.pl pdb100d.ent { jobIndentifier = 44123; } xgrid -job specification -id 44123 > batch.plist
Here's the batch file batch.plist for running getStats.pl on only one file:
{
jobSpecification = {
applicationIdentifier = "com.apple.xgrid.cli";
inputFiles = { "getStats.pl" = {
fileData = <2321202f 7573722f 62696e2f 7065726c 202d770a 23206765 74537461 74732e70
...
2020207d 0a0a7d0a 0a6d6169 6e28293b 0a>;
isExecutable = YES; };
};
name = "getStats.pl";
schedulerHints = {
0 = mathgrid5;
};
submissionIdentifier = abc;
taskSpecifications = {
0 = { arguments = ( "pdb100d.ent" ); command = "getStats.pl"; };
};
};
}
We've cut part of the hexadecimal data to make it more readable.
Note that the line of interest for us is this one:
0 = { arguments = ( "pdb100d.ent" ); command = "getStats.pl"; };
What it does is to define a task, labeled with 0, which runs the program getStats.pl on the pdb file pdb100d.ent.
We can run this batch job on XGrid as follows:
xgrid -job submit batch.plist { jobIndentifier = 45000; } xgrid -job results -id 45000
A batch file for 100, 1,000, or 10,000 tasks
The trick now is to create a PList batch file, like the one shown above, with 100, 1000, or 10,000 tasks, each one assigned a different pdb file.
One way to perform this is to have text files that contains the name of all the files, and to use a script to merge them with the batch file. We use a python program which we call makeBatch.py to perform this job.
For example, assuming that we have a list of 10 pdb files in a text file called filelist10.txt, we run the command:
makeBatch.py filelist10.txt > batch10.plist'''
cat batch10.plist
{
jobSpecification = {
applicationIdentifier = "com.apple.xgrid.cli";
inputFiles = {
"getStats.pl" = {
fileData = <2321202f 7573722f 62696e2f 7065726c 202d770a
...
223b0a20 2020207d 0a0a7d0a 0a6d6169 6e28293b 0a>;
isExecutable = YES;
};
};
name = "getStats.pl";
schedulerHints = {
0 = mathgrid5;
};
submissionIdentifier = abc;
taskSpecifications = {
0 = { arguments = ( "pdb100d.ent" ); command = "getStats.pl"; };
1 = { arguments = ( "pdb101d.ent" ); command = "getStats.pl"; };
2 = { arguments = ( "pdb101m.ent" ); command = "getStats.pl"; };
3 = { arguments = ( "pdb102d.ent" ); command = "getStats.pl"; };
4 = { arguments = ( "pdb102l.ent" ); command = "getStats.pl"; };
5 = { arguments = ( "pdb102m.ent" ); command = "getStats.pl"; };
6 = { arguments = ( "pdb103d.ent" ); command = "getStats.pl"; };
7 = { arguments = ( "pdb103l.ent" ); command = "getStats.pl"; };
8 = { arguments = ( "pdb103m.ent" ); command = "getStats.pl"; };
9 = { arguments = ( "pdb104d.ent" ); command = "getStats.pl"; };
}; };
}
Submitting the Batch Job
Once we have all the components, we simply submit one batch job of 10, 100, 1,000, or 10,000 tasks, depending on how we created the batch file:
xgrid -job batch batch10000.plist | getXGridOutput.py | mergeResults.pl
There is a long setup time for all the tasks to be spawn on the XGrid, then the computation starts utilizing close to all 88 processors.
Execution Times
The table below shows the execution times for different groups of pdb files. The left columns show the times obtained with previous versions of the getStats.pl program.
We found that batch jobs with more than 3,000 tasks would regularly crash on the XGrid, so we divided up the work into several batch jobs of 2,500 tasks each. Each batch job processes one pdb file per task.
Here is an example of how the 10,000 files were processed in this manner:
makeBatchN.py filelist.txt 0 2499 > batch2500a.plist makeBatchN.py filelist.txt 2500 4999 > batch2500b.plist makeBatchN.py filelist.txt 5000 7499 > batch2500c.plist makeBatchN.py filelist.txt 7500 10000 > batch2500d.plist xgrid -job batch batch2500a.plist | getXGridOutput.py | mergeResults.pl & xgrid -job batch batch2500c.plist | getXGridOutput.py | mergeResults.pl & xgrid -job batch batch2500d.plist | getXGridOutput.py | mergeResults.pl & xgrid -job batch batch2500b.plist | getXGridOutput.py | mergeResults.pl &
We use a new version of makeBatch.py called makeBatchN.py which splits the list of pdb files into a chunk with only the files with indices ranging from a low index to a high index, both provided on the command line. MakeBatchN.py is available here.
# PDB files | First method Copy Files (seconds) |
Improved method Fetch Web Files (seconds) |
Batch job multiple tasks (seconds) |
---|---|---|---|
10 files |
1 |
1 |
1 |
100 files |
8 |
8 |
4 |
1,000 files |
162 |
114 |
55 |
2,000 files |
1,755 |
271 |
157 |
3,000 files |
2,959 |
443 |
305 |
5,000 files |
3,595 |
868 |
500 |
10,000 files |
3,732 |
Job fails |
Job fails |
Note1: the job was submitted as two concurrent batch jobs of 2,500 tasks each. The first terminated after 242 seconds, the second after 500 seconds.