Plant genome (white spruce), Illumina HiSeq 2000, IBM Blue Gene/Q, and Ray

The SRA056234 dataset contains reads for Picea glauca (white spruce). The reads were obtained with the Illumina HiSeq 2000. It's 2.8 TiB of uncompressed fastq files.

$ du -sh blocks/
2.8T    blocks/

We are using  Ray on a IBM Blue Gene/Q. In particular, we are using 512 nodes with 1 IBM PowerPC A2 processor and 16 GiB of DDR3 memory per node. Each processor has 16 cores, and each core has 4 threads.

Because of memory limitation per core, we are only using 16 MPI ranks per node for the time being. Therefore, we are using 16384 Ray processes each with a maximum of 1 GiB of memory on the Blue Gene/Q. We have 16 TiB of distributed memory.

First,  we can list the Ray plugins we are using.

$ ls -1 SRA056234-Picea-glauca-2012-12-18-4/Plugins/
plugin_Amos.txt
plugin_CoverageGatherer.txt
plugin_DummySun.txt
plugin_EdgePurger.txt
plugin_FusionData.txt
plugin_FusionTaskCreator.txt
plugin_GeneOntology.txt
plugin_GenomeNeighbourhood.txt
plugin_JoinerTaskCreator.txt
plugin_KmerAcademyBuilder.txt
plugin_Library.txt
plugin_MachineHelper.txt
plugin_MessageProcessor.txt
plugin_Mock.txt
plugin_NetworkTest.txt
plugin_Partitioner.txt
plugin_PhylogenyViewer.txt
plugin_Scaffolder.txt
plugin_Searcher.txt
plugin_SeedExtender.txt
plugin_SeedingData.txt
plugin_SequencesIndexer.txt
plugin_SequencesLoader.txt
plugin_SwitchMan.txt
plugin_VerticesExtractor.txt


The file partition has a nice layout. On the Blue Gene/Q, I needed to split my files in 4288 fastq files with a maximum of 2000000 sequences each because I/O operations are offloaded to I/O drawers. Aside from that, Ray works as is.

$ head SRA056234-Picea-glauca-2012-12-18-4/FilePartition.txt
#File   Name    FirstSequence   LastSequence    NumberOfSequences
0       blocks/SRR525188_1-block-0.fastq        0       1999999 2000000
1       blocks/SRR525188_2-block-0.fastq        2000000 3999999 2000000
2       blocks/SRR525188_1-block-1.fastq        4000000 5999999 2000000
3       blocks/SRR525188_2-block-1.fastq        6000000 7999999 2000000
4       blocks/SRR525188_1-block-10.fastq       8000000 9999999 2000000
5       blocks/SRR525188_2-block-10.fastq       10000000        11999999        2000000
6       blocks/SRR525188_1-block-11.fastq       12000000        13999999        2000000
7       blocks/SRR525188_2-block-11.fastq       14000000        15999999        2000000
8       blocks/SRR525188_1-block-12.fastq       16000000        17999999        2000000
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/FilePartition.txt
4278    blocks/SRR525214_1-block-98.fastq       8498092212      8500092211      2000000
4279    blocks/SRR525214_2-block-98.fastq       8500092212      8502092211      2000000
4280    blocks/SRR525214_1-block-99.fastq       8502092212      8504092211      2000000
4281    blocks/SRR525214_2-block-99.fastq       8504092212      8506092211      2000000
4282    blocks/SRR525215_1-block-0.fastq        8506092212      8508092211      2000000
4283    blocks/SRR525215_2-block-0.fastq        8508092212      8510092211      2000000
4284    blocks/SRR525215_1-block-1.fastq        8510092212      8512092211      2000000
4285    blocks/SRR525215_2-block-1.fastq        8512092212      8514092211      2000000
4286    blocks/SRR525215_1-block-2.fastq        8514092212      8515006210      913999
4287    blocks/SRR525215_2-block-2.fastq        8515006211      8515920209      913999


The 8515920210 input sequences are uniformly distributed onto 16384 MPI ranks. There are 519770 sequences per MPI rank.


$ head SRA056234-Picea-glauca-2012-12-18-4/SequencePartition.txt
#Rank   FirstSequence   LastSequence    NumberOfSequences
0       0       519769  519770
1       519770  1039539 519770
2       1039540 1559309 519770
3       1559310 2079079 519770
4       2079080 2598849 519770
5       2598850 3118619 519770
6       3118620 3638389 519770
7       3638390 4158159 519770
8       4158160 4677929 519770
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/SequencePartition.txt
16374   8510713980      8511233749      519770
16375   8511233750      8511753519      519770
16376   8511753520      8512273289      519770
16377   8512273290      8512793059      519770
16378   8512793060      8513312829      519770
16379   8513312830      8513832599      519770
16380   8513832600      8514352369      519770
16381   8514352370      8514872139      519770
16382   8514872140      8515391909      519770
16383   8515391910      8515920209      528300


Finally, the 42831057656 k-mers are distributed uniformly on 16384 MPI ranks, with around 2614200 k-mers per MPI rank.

$ head SRA056234-Picea-glauca-2012-12-18-4/GraphPartition.txt   
#Rank   NumberOfKmers   IdealNumberOfKmers      Difference      RelativeDifference
#TotalKmers: 42831057656
#Ranks: 16384
#IdealNumberOfKmers: 2614200
0       2611430 2614200 -2770   -0.10596%
1       2612276 2614200 -1924   -0.073598%
2       2613476 2614200 -724    -0.0276949%
3       2611618 2614200 -2582   -0.0987683%
4       2616320 2614200 2120    0.0810956%
5       2615454 2614200 1254    0.0479688%
 

$ tail SRA056234-Picea-glauca-2012-12-18-4/GraphPartition.txt
16374   2612820 2614200 -1380   -0.0527886%
16375   2615682 2614200 1482    0.0566904%
16376   2610428 2614200 -3772   -0.144289%
16377   2617236 2614200 3036    0.116135%
16378   2615978 2614200 1778    0.0680132%
16379   2614000 2614200 -200    -0.00765052%
16380   2619520 2614200 5320    0.203504%
16381   2614508 2614200 308     0.0117818%
16382   2614830 2614200 630     0.0240992%
16383   2614080 2614200 -120    -0.00459031%


The RelativeDifference column indicates really good automated load balancing.

Ray is currently running the slave mode RAY_SLAVE_MODE_INDEX_SEQUENCES.

Comments

Popular posts from this blog

A survey of the burgeoning industry of cloud genomics

Generating neural machine instructions for multi-head attention

Adding ZVOL VIRTIO disks to a guest running on a host with the FreeBSD BHYVE hypervisor