The polytope router and human genomes

- October 07, 2013

My test job for a human genome (dataset: HiSeq-2500-NA12878-demo-2x150) just completed on colosse.calculquebec.ca. This dataset is compressed and its size is 145 GiB.

18G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz
19G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz
19G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz
18G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz
18G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz
19G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz
19G   HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz

145G   HiSeq-2500-NA12878-demo-2x150/

It took 46 hours to assemble 1 171 357 300 short sequences into a human genome. The longuest step is still "Merging of redundant paths". Here is a table with the running time of each step.

-->

Step	Date	Elapsed time	Since Beginning
Network testing	2013-10-03T11:33:31	2 seconds	2 seconds
Counting sequences to assemble	2013-10-03T11:41:49	8 minutes, 18 seconds	8 minutes, 20 seconds
Sequence loading	2013-10-03T12:52:06	1 hours, 10 minutes, 17 seconds	1 hours, 18 minutes, 37 seconds
K-mer counting	2013-10-03T13:18:58	26 minutes, 52 seconds	1 hours, 45 minutes, 29 seconds
Coverage distribution analysis	2013-10-03T13:19:13	15 seconds	1 hours, 45 minutes, 44 seconds
Graph construction	2013-10-03T13:54:47	35 minutes, 34 seconds	2 hours, 21 minutes, 18 seconds
Null edge purging	2013-10-03T14:01:09	6 minutes, 22 seconds	2 hours, 27 minutes, 40 seconds
Selection of optimal read markers	2013-10-03T14:51:05	49 minutes, 56 seconds	3 hours, 17 minutes, 36 seconds
Detection of assembly seeds	2013-10-03T17:43:04	2 hours, 51 minutes, 59 seconds	6 hours, 9 minutes, 35 seconds
Estimation of outer distances for paired reads	2013-10-03T17:59:38	16 minutes, 34 seconds	6 hours, 26 minutes, 9 seconds
Bidirectional extension of seeds	2013-10-04T00:14:34	6 hours, 14 minutes, 56 seconds	12 hours, 41 minutes, 5 seconds
Merging of redundant paths	2013-10-05T01:43:40	1 days, 1 hours, 29 minutes, 6 seconds	1 days, 14 hours, 10 minutes, 11 seconds
Generation of contigs	2013-10-05T02:12:52	29 minutes, 12 seconds	1 days, 14 hours, 39 minutes, 23 seconds
Scaffolding of contigs	2013-10-05T09:27:55	7 hours, 15 minutes, 3 seconds	1 days, 21 hours, 54 minutes, 26 seconds
Counting sequences to search	2013-10-05T09:27:55	0 seconds	1 days, 21 hours, 54 minutes, 26 seconds
Graph coloring	2013-10-05T09:28:09	14 seconds	1 days, 21 hours, 54 minutes, 40 seconds
Counting contig biological abundances	2013-10-05T09:34:05	5 minutes, 56 seconds	1 days, 22 hours, 36 seconds
Counting sequence biological abundances	2013-10-05T09:34:05	0 seconds	1 days, 22 hours, 36 seconds
Loading taxons	2013-10-05T09:34:17	12 seconds	1 days, 22 hours, 48 seconds
Loading tree	2013-10-05T09:34:31	14 seconds	1 days, 22 hours, 1 minutes, 2 seconds
Processing gene ontologies	2013-10-05T09:34:55	24 seconds	1 days, 22 hours, 1 minutes, 26 seconds
Computing neighbourhoods	2013-10-05T09:34:55	0 seconds	1 days, 22 hours, 1 minutes, 26 seconds

Basically, the distributed merger puts identical things together to remove redundancy in the assembly. This is necessary because there are 512 ranks exploring the same distributed graph and they sometimes meet each other.

My director Jacques Corbeil wants to assemble a human genome in 1 hour (using titan.ccs.ornl.gov). To reach that goal, the step of merging redundant paths must be improved.

Here is the submission script. For starters, the new -detect-sequence-files option of "Smart Ray" finds out the list of options required automatically.

$ cat HiSeq-2500-NA12878-demo-2x150-11.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-11
#PBS -o HiSeq-2500-NA12878-demo-2x150-11.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-11.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
#PBS -l nodes=64:ppn=8
#PBS -l gattr=ckpt

cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0

mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-11 \
apps/ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-o HiSeq-2500-NA12878-demo-2x150-11 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \
The -route-messages activate the polytope software message router. This reduces the latency on low-end supercomputers.

As usual, Ray reports a summary of what it did:

Scaffolds >= 500 nt
Number: 460241
Total length: 2690216650
Average: 5845
N50: 10495
Median: 3356
Largest: 160698

Another interesting feature is the MessageRouter final report on the polytope values:

[MessageRouter] Rank 0 will die in 16 seconds, will not route anything after that point.
[Polytope] Load values:
AlphabetSize: 8
WordLength: 3
Self: 0,0,0
0,0,0 (0) -> 1,0,0 (1) Load: 69520331
0,0,0 (0) -> 2,0,0 (2) Load: 67191666
0,0,0 (0) -> 3,0,0 (3) Load: 68039340
0,0,0 (0) -> 4,0,0 (4) Load: 68091189
0,0,0 (0) -> 5,0,0 (5) Load: 68122849
0,0,0 (0) -> 6,0,0 (6) Load: 67318393
0,0,0 (0) -> 7,0,0 (7) Load: 70024675
0,0,0 (0) -> 0,1,0 (8) Load: 67981950
0,0,0 (0) -> 0,2,0 (16) Load: 66576237
0,0,0 (0) -> 0,3,0 (24) Load: 67868556
0,0,0 (0) -> 0,4,0 (32) Load: 67772883
0,0,0 (0) -> 0,5,0 (40) Load: 68159708
0,0,0 (0) -> 0,6,0 (48) Load: 67893735
0,0,0 (0) -> 0,7,0 (56) Load: 67753182
0,0,0 (0) -> 0,0,1 (64) Load: 66892153
0,0,0 (0) -> 0,0,2 (128) Load: 73895416
0,0,0 (0) -> 0,0,3 (192) Load: 75511892
0,0,0 (0) -> 0,0,4 (256) Load: 71191911
0,0,0 (0) -> 0,0,5 (320) Load: 73344055
0,0,0 (0) -> 0,0,6 (384) Load: 73852478
0,0,0 (0) -> 0,0,7 (448) Load: 68183524

So, to conclude, this dataset requires 46 hours with 512 (Xeon) cores. From that 46 hours, 25 hours are consumed by the merger and 7 hours are consumed by the scaffolder.

Search This Blog

DSKernel: AI and Strength Training

The polytope router and human genomes

Comments

Popular posts from this blog

The Thorium actor engine is operational now, we can start to work on actor applications for metagenomics

Learning to solve the example 1 of puzzle 3aa6fb7a in the ARC prize

The source code of SOAPdenovo2 sits in the shadows