Debugging a MPI application is sometimes like finding a needle in a haystack

I am running some tests as usual before releasing a new version of Ray. This time, I will be releasing Ray 2.3.0.

My job 10446216 on colosse (colossus in English) -- the supercomputer of Laval University operated by Calcul Québec (Compute Quebec in English) -- failed, and I did not know why.

All the jobs that run on colosse are automatically profiled by a formidable array of tools. And the nice thing is that I don't need to do anything fancy to get these runtime profiles.

All I need is my job identifier. With my job identifier, I then go to https://www.clumeq.ca/users/common/report/myjobs/10446216/0/#

Below, the runtime report automatically generated for me by Calcul Québec is shown. The job was a 512-core job, running on 64 8-core machines.
Just from the first figure, one can see that one of the machine crashed (the blue line). Such an event may be caused by a software bug in the software I used (in this case, a development version of Ray).

Basically, for each machine, I get 4 metrics sampled throughout the computation:


  • CPU usage, 
  • Memory usage, 
  • Input/Output usage, and 
  • Communication.

We can see that something strange happened to r101-n57 (# 19).

Usage status overview

Job status and usage overview

Informations

Job ID: 10446216
Task ID: 0
Project identifier (RAPI): nne-790-ac
Number of cores: 512
Queue: med
Wallclock: 11h 19m 24s
Submit time: 2013-09-04 16:06:50
Start time: 2013-09-04 20:36:48
End time: 2013-09-05 07:56:12
Submit script: Click to show
#PBS -W x=ENVREQUESTED:TRUE
#PBS -q med
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-1
#PBS -o HiSeq-2500-NA12878-demo-2x150-1.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-1.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00

#PBS -l nodes=64:ppn=8
 
 cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/Ray/2.3.0-devel-b3e6b07764f71318408de5fbe632a41ae29c2105-1


mpiexec -n 512 \
Ray -k 31 \
-o HiSeq-2500-NA12878-demo-2x150-1 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \





Comments

Popular posts from this blog

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

The source code of SOAPdenovo2 sits in the shadows