Debugging a MPI application is sometimes like finding a needle in a haystack

I am running some tests as usual before releasing a new version of Ray. This time, I will be releasing Ray 2.3.0.

My job 10446216 on colosse (colossus in English) -- the supercomputer of Laval University operated by Calcul Québec (Compute Quebec in English) -- failed, and I did not know why.

All the jobs that run on colosse are automatically profiled by a formidable array of tools. And the nice thing is that I don't need to do anything fancy to get these runtime profiles.

All I need is my job identifier. With my job identifier, I then go to https://www.clumeq.ca/users/common/report/myjobs/10446216/0/#

Below, the runtime report automatically generated for me by Calcul Québec is shown. The job was a 512-core job, running on 64 8-core machines.
Just from the first figure, one can see that one of the machine crashed (the blue line). Such an event may be caused by a software bug in the software I used (in this case, a development version of Ray).

Basically, for each machine, I get 4 metrics sampled throughout the computation:

  • CPU usage, 
  • Memory usage, 
  • Input/Output usage, and 
  • Communication.

We can see that something strange happened to r101-n57 (# 19).

Usage status overview

Job status and usage overview


Job ID: 10446216
Task ID: 0
Project identifier (RAPI): nne-790-ac
Number of cores: 512
Queue: med
Wallclock: 11h 19m 24s
Submit time: 2013-09-04 16:06:50
Start time: 2013-09-04 20:36:48
End time: 2013-09-05 07:56:12
Submit script: Click to show
#PBS -q med
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-1
#PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00

#PBS -l nodes=64:ppn=8

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1

mpiexec -n 512 \
Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-1 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \

No comments:

There was an error in this gadget