2013-02-13

A catalog of IBM Blue Gene/Q errors (for science)

Once in a while, I get to run things on a IBM Blue Gene/Q. And when that happens, some of my jobs always crash with random errors.

For science, here are some of them.

Update 2013-02-27 with MPI I/O: 

This requires fcntl(2) to be implemented. As of 8/25/2011 it is not. Generic MPICH Message: File locking failed in ADIOI_Set_lock(fd 4,cmd F_SETLKW/E,type F_WRLCK/1,whence 0) with return value FFFFFFFF and errno 23.
- If the file system is NFS, you need to use NFS version 3, ensure that the lockd daemon is running on all the machines, and mount the directory with the 'noac' option (no attribute caching).
- If the file system is LUSTRE, ensure that the directory is mounted with the 'flock' option.
ADIOI_Set_lock:: Resource deadlock avoided
ADIOI_Set_lock:offset 25625204815, length 6282003
Abort(1) on node 2501 (rank 2501 in comm 1140850688): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 2501


SRA056234-Picea-glauca-2013-02-12-1

2013-02-13 04:56:26.627 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: terminated due to: killing the job timed out
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: abnormal termination by signal 35 from rank 2712 due to RAS event with record ID 279083. END_JOB control action heartbeat timed out after 60 seconds
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: 937 RAS events
2013-02-13 04:56:26.628 (FATAL) [0x40001138a50] :1555:ibm.runjob.client.Job: most recent RAS event text: CFAM Machine Check. Message=REASON: Core4 failed (uncorrectable error).  DETAILS: CFAM_Status=0xc0000000, MachineCheck [Core 4 Chiplet chkstp reg=0x8400000000000000: , Summary bit for xfir_lt, Chkstp from FIR1 [Core 4 PCB FIR1=0x0000000020000000: , A2-L2 UE]]!, DrillDown=CFAM_Status=0xc0000000, MachineCheck [Core 4 Chiplet chkstp reg=0x8400000000000000: , Summary bit for xfir_lt, Chkstp from FIR1 [Core 4 PCB FIR1=0x0000000020000000: , A2-L2 UE]]!


SRA056234-Picea-glauca-2012-12-22-13

2012-12-22 21:40:47.545 (FATAL) [0x40000ee8a50] :23842:ibm.runjob.client.Job: could not start job: block is unavailable due to a previous failure
2012-12-22 21:40:47.546 (FATAL) [0x40000ee8a50] :23842:ibm.runjob.client.Job: node R00-M0-N00-J00 is not available: Software Failure


SRA056234-Picea-glauca-2013-01-18-15

2013-01-19 13:49:22.860 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: terminated due to: killing the job timed out
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: abnormal termination by signal 35 from rank 3345 due to RAS event with record ID 259250. END_JOB control action heartbeat timed out after 120 seconds
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: 189 RAS events
2013-01-19 13:49:22.861 (FATAL) [0x40001138a50] :1474:ibm.runjob.client.Job: most recent RAS event text: A BQL double bit error threshold was exceeded for Switch 2 Group EVEN and ODD


SRA056234-Picea-glauca-2013-02-10-1

2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: received signal 15
2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: signal sent from USER
2013-02-11 16:10:37.815 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: sent from pid 12709
2013-02-11 16:10:37.816 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: could not read /proc/12709/exe
2013-02-11 16:10:37.816 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: Permission denied
2013-02-11 16:10:37.817 (WARN ) [0x40001138a50] :ibm.runjob.LogSignalInfo: sent from uid 0 (root)
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: terminated by signal 9
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: abnormal termination by signal 9 from rank 10513
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: 139 RAS events
2013-02-11 16:10:42.677 (WARN ) [0x40001138a50] :1553:ibm.runjob.client.Job: most recent RAS event text: DDR Correctable Error Summary : count=1 MCFIR error status:  [MEMORY_CE] This bit is set when a memory CE is detected on a non-maintenance memory read op;


3 comments:

Rob said...

To avoid the MPI-IO locking error, try prefixing your file name with bglockless:

Blogger said...

Did you know that that you can make cash by locking selected sections of your blog / site?
Simply join AdWorkMedia and embed their Content Locking widget.

Blogger said...

Are you making money from your premium file uploads?
Did you know Mgcash will pay you an average of $500 per 1,000 file downloads?

There was an error in this gadget