2013-10-07

The polytope router and human genomes

My test job for a human genome (dataset: HiSeq-2500-NA12878-demo-2x150) just completed on colosse.calculquebec.ca. This dataset is compressed and its size is 145 GiB.

18G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_001.fastq.gz 
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R1_002.fastq.gz
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_001.fastq.gz
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L001_R2_002.fastq.gz
18G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_001.fastq.gz
18G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R1_002.fastq.gz
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_001.fastq.gz
19G    HiSeq-2500-NA12878-demo-2x150/sorted_S1_L002_R2_002.fastq.gz


145G    HiSeq-2500-NA12878-demo-2x150/


 It took 46 hours to assemble 1 171 357 300 short sequences into a human genome. The longuest step is still "Merging of redundant paths". Here is a table with the running time of each step.




-->
Step Date Elapsed time Since Beginning
Network testing 2013-10-03T11:33:31 2 seconds 2 seconds
Counting sequences to assemble 2013-10-03T11:41:49 8 minutes, 18 seconds 8 minutes, 20 seconds
Sequence loading 2013-10-03T12:52:06 1 hours, 10 minutes, 17 seconds 1 hours, 18 minutes, 37 seconds
K-mer counting 2013-10-03T13:18:58 26 minutes, 52 seconds 1 hours, 45 minutes, 29 seconds
Coverage distribution analysis 2013-10-03T13:19:13 15 seconds 1 hours, 45 minutes, 44 seconds
Graph construction 2013-10-03T13:54:47 35 minutes, 34 seconds 2 hours, 21 minutes, 18 seconds
Null edge purging 2013-10-03T14:01:09 6 minutes, 22 seconds 2 hours, 27 minutes, 40 seconds
Selection of optimal read markers 2013-10-03T14:51:05 49 minutes, 56 seconds 3 hours, 17 minutes, 36 seconds
Detection of assembly seeds 2013-10-03T17:43:04 2 hours, 51 minutes, 59 seconds 6 hours, 9 minutes, 35 seconds
Estimation of outer distances for paired reads 2013-10-03T17:59:38 16 minutes, 34 seconds 6 hours, 26 minutes, 9 seconds
Bidirectional extension of seeds 2013-10-04T00:14:34 6 hours, 14 minutes, 56 seconds 12 hours, 41 minutes, 5 seconds
Merging of redundant paths 2013-10-05T01:43:40 1 days, 1 hours, 29 minutes, 6 seconds 1 days, 14 hours, 10 minutes, 11 seconds
Generation of contigs 2013-10-05T02:12:52 29 minutes, 12 seconds 1 days, 14 hours, 39 minutes, 23 seconds
Scaffolding of contigs 2013-10-05T09:27:55 7 hours, 15 minutes, 3 seconds 1 days, 21 hours, 54 minutes, 26 seconds
Counting sequences to search 2013-10-05T09:27:55 0 seconds 1 days, 21 hours, 54 minutes, 26 seconds
Graph coloring 2013-10-05T09:28:09 14 seconds 1 days, 21 hours, 54 minutes, 40 seconds
Counting contig biological abundances 2013-10-05T09:34:05 5 minutes, 56 seconds 1 days, 22 hours, 36 seconds
Counting sequence biological abundances 2013-10-05T09:34:05 0 seconds 1 days, 22 hours, 36 seconds
Loading taxons 2013-10-05T09:34:17 12 seconds 1 days, 22 hours, 48 seconds
Loading tree 2013-10-05T09:34:31 14 seconds 1 days, 22 hours, 1 minutes, 2 seconds
Processing gene ontologies 2013-10-05T09:34:55 24 seconds 1 days, 22 hours, 1 minutes, 26 seconds
Computing neighbourhoods 2013-10-05T09:34:55 0 seconds 1 days, 22 hours, 1 minutes, 26 seconds

Basically, the distributed merger puts identical things together to remove redundancy in the assembly. This is necessary because there are 512 ranks exploring the same distributed graph and they sometimes meet each other.

My director Jacques Corbeil wants to assemble a human genome in 1 hour (using titan.ccs.ornl.gov). To reach that goal, the step of merging redundant paths must be improved.

Here is the submission script. For starters, the new -detect-sequence-files option of "Smart Ray" finds out the list of options required automatically.

$ cat HiSeq-2500-NA12878-demo-2x150-11.sh
#PBS -S /bin/bash
#PBS -N HiSeq-2500-NA12878-demo-2x150-11
#PBS -o HiSeq-2500-NA12878-demo-2x150-11.stdout
#PBS -e HiSeq-2500-NA12878-demo-2x150-11.stderr
#PBS -A nne-790-ac
#PBS -l walltime=02:00:00:00
#PBS -l nodes=64:ppn=8
#PBS -l gattr=ckpt

cd $PBS_O_WORKDIR

module use /rap/nne-790-ab/modulefiles
module load nne-790-ab/seb-devtools/1.0.0

mpiexec -n 512 \
-output-filename HiSeq-2500-NA12878-demo-2x150-11 \
apps/ray/616d2a26cc1e39f59325a0e632af46262edaa12c-1/Ray \
-k 31 \
-o HiSeq-2500-NA12878-demo-2x150-11 \
-read-write-checkpoints HiSeq-2500-NA12878-demo-2x150.SavedState \
-route-messages \
-detect-sequence-files HiSeq-2500-NA12878-demo-2x150 \

The -route-messages activate the polytope software message router. This reduces the latency on low-end supercomputers.

As usual, Ray reports a summary of what it did:

Scaffolds >= 500 nt
 Number: 460241
 Total length: 2690216650
 Average: 5845
 N50: 10495
 Median: 3356
 Largest: 160698



Another interesting feature is the MessageRouter final report on the polytope values:

[MessageRouter] Rank 0 will die in 16 seconds, will not route anything after that point.
[Polytope] Load values:
AlphabetSize: 8
WordLength: 3
Self: 0,0,0
  0,0,0 (0) -> 1,0,0 (1) Load: 69520331
  0,0,0 (0) -> 2,0,0 (2) Load: 67191666
  0,0,0 (0) -> 3,0,0 (3) Load: 68039340
  0,0,0 (0) -> 4,0,0 (4) Load: 68091189
  0,0,0 (0) -> 5,0,0 (5) Load: 68122849
  0,0,0 (0) -> 6,0,0 (6) Load: 67318393
  0,0,0 (0) -> 7,0,0 (7) Load: 70024675
  0,0,0 (0) -> 0,1,0 (8) Load: 67981950
  0,0,0 (0) -> 0,2,0 (16) Load: 66576237
  0,0,0 (0) -> 0,3,0 (24) Load: 67868556
  0,0,0 (0) -> 0,4,0 (32) Load: 67772883
  0,0,0 (0) -> 0,5,0 (40) Load: 68159708
  0,0,0 (0) -> 0,6,0 (48) Load: 67893735
  0,0,0 (0) -> 0,7,0 (56) Load: 67753182
  0,0,0 (0) -> 0,0,1 (64) Load: 66892153
  0,0,0 (0) -> 0,0,2 (128) Load: 73895416
  0,0,0 (0) -> 0,0,3 (192) Load: 75511892
  0,0,0 (0) -> 0,0,4 (256) Load: 71191911
  0,0,0 (0) -> 0,0,5 (320) Load: 73344055
  0,0,0 (0) -> 0,0,6 (384) Load: 73852478
  0,0,0 (0) -> 0,0,7 (448) Load: 68183524



So, to conclude, this dataset requires 46 hours with 512 (Xeon) cores. From that 46 hours, 25 hours are consumed by the merger and 7 hours are consumed by the scaffolder.

No comments:

There was an error in this gadget