Debugging a MPI application is sometimes like finding a needle in a haystack
I am running some tests as usual before releasing a new version of Ray. This time, I will be releasing Ray 2.3.0.
My job 10446216 on colosse (colossus in English) -- the supercomputer of Laval University operated by Calcul Québec (Compute Quebec in English) -- failed, and I did not know why.
All the jobs that run on colosse are automatically profiled by a formidable array of tools. And the nice thing is that I don't need to do anything fancy to get these runtime profiles.
All I need is my job identifier. With my job identifier, I then go to https://www.clumeq.ca/users/common/report/myjobs/10446216/0/#
Below, the runtime report automatically generated for me by Calcul Québec is shown. The job was a 512-core job, running on 64 8-core machines.
Just from the first figure, one can see that one of the machine crashed (the blue line). Such an event may be caused by a software bug in the software I used (in this case, a development version of Ray).
Basically, for each machine, I get 4 metrics sampled throughout the computation:
We can see that something strange happened to r101-n57 (# 19).
My job 10446216 on colosse (colossus in English) -- the supercomputer of Laval University operated by Calcul Québec (Compute Quebec in English) -- failed, and I did not know why.
All the jobs that run on colosse are automatically profiled by a formidable array of tools. And the nice thing is that I don't need to do anything fancy to get these runtime profiles.
All I need is my job identifier. With my job identifier, I then go to https://www.clumeq.ca/users/common/report/myjobs/10446216/0/#
Below, the runtime report automatically generated for me by Calcul Québec is shown. The job was a 512-core job, running on 64 8-core machines.
Just from the first figure, one can see that one of the machine crashed (the blue line). Such an event may be caused by a software bug in the software I used (in this case, a development version of Ray).
Basically, for each machine, I get 4 metrics sampled throughout the computation:
- CPU usage,
- Memory usage,
- Input/Output usage, and
- Communication.
We can see that something strange happened to r101-n57 (# 19).
Usage status overview
Informations
Job ID: 10446216
Task ID: 0
Project identifier (RAPI): nne-790-ac
Number of cores: 512
Queue: med
Wallclock: 11h 19m 24s
Submit time: 2013-09-04 16:06:50
Start time: 2013-09-04 20:36:48
End time: 2013-09-05 07:56:12
Submit script: Click to show
#PBS -W x=ENVREQUESTED:TRUE
#PBS -q med
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-1
#PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
#PBS -l nodes=64:ppn=8
cd $PBS_O_WORKDIR
module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1
mpiexec -n 512 \
Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-1 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
Job ID: 10446216
Task ID: 0
Project identifier (RAPI): nne-790-ac
Number of cores: 512
Queue: med
Wallclock: 11h 19m 24s
Submit time: 2013-09-04 16:06:50
Start time: 2013-09-04 20:36:48
End time: 2013-09-05 07:56:12
Submit script: Click to show
Task ID: 0
Project identifier (RAPI): nne-790-ac
Number of cores: 512
Queue: med
Wallclock: 11h 19m 24s
Submit time: 2013-09-04 16:06:50
Start time: 2013-09-04 20:36:48
End time: 2013-09-05 07:56:12
Submit script: Click to show
#PBS -W x=ENVREQUESTED:TRUE #PBS -q med #PBS -S /bin/bash #PBS -N HiSeq-2500-NA12878-demo-2x150-1 #PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout #PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr #PBS -A nne-790-ac #PBS -l walltime=02:00:00:00 #PBS -l nodes=64:ppn=8 cd $PBS_O_WORKDIR module use /rap/nne-790-ab/modulefiles module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1 mpiexec -n 512 \ Ray -k 31 \ -o HiSeq-2500-NA12878-demo-2x150-1 \ -read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \ -route-messages \ -detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
Comments