Pulling next-generation sequences from the European Nucleotide Archive (ENA)

Today, I have read the abstract (and the methods) of a PNAS (Proceedings of the National Academy of Sciences) paper entitled Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. In this paper, researchers sequenced the DNA of a high number of Staphylococcus aureus isolates from patients over a certain period of time. This is science.

So this is nice and all, but after reading some bits of it, I wanted the data to do my own tests.

Getting sequence data is actually very easy, thanks to the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC has 3 members (alphabetical order):
  • DNA Data Bank of Japan (DDBJ)
  • European Nucleotide Archive (ENA)
  • GenBank (raw sequence data is stored in the Sequence Read Archive (SRA))
In a nutshell, everything submitted to one of these gets mirrored to the others. The Metadata Model is also quite nice.

SRA is unusable because I have to download data files in the SRA format. This SRA format -- which is described in the Sequence Read Archive Handbook -- is only a source of impediments and provides no benefits to anyone (this is my opinion and not necessarily a fact).

I use mostly just ENA. ENA's user interface is way simplier and it supports a rich set of HTTP queries that return HTML (Hyper Text Markup Language), XML (Extensible Markup Language), or TSV (Tab-separated values). These features are very useful to search with keywords, to gather metadata, or to fetch raw data. Furthermore, ENA is quite beautiful with all the green (my last name is Boisvert, Bois means wood and vert means green). So I like green.

DDBJ is really cool too, but there is no way of getting metadata in XML (like in ENA). The nice thing about DDBJ is that they compress their data with bzip2 instead of gzip (used by ENA).

Here are a bunch of query examples with the accession ERA087387 that I found in the PNAS paper above.

Accession ERA087387:
http://www.ebi.ac.uk/ena/data/view/ERA087387 (HTML)
http://www.ebi.ac.uk/ena/data/view/ERA087387&display=xml (XML)

Samples from ERS093118 to ERS093127:
http://www.ebi.ac.uk/ena/data/view/ERS093118-ERS093127 (HTML)
http://www.ebi.ac.uk/ena/data/view/ERS093118-ERS093127&display=xml (XML)

File list for sample ERS093118:
http://www.ebi.ac.uk/ena/data/view/ERS093118 (HTML)
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS093118&result=read_run&fields=scientific_name,instrument_model,fastq_md5,fastq_ftp (TSV)

Finally, I think that all the huge projects like Human Microbiome Project, 1000 genomes, ENCODE, and so on should use Globus Online to allow people to pull all the data with GridFTP (which is faster).

Edit 2013-11-05: Jonathan Trow from SRA kindly pointed out that NCBI also has support for XML exportation.

To get sample metadata in XML for an accession, simply download this:


Get a CSV sample file:



Nice public microbiome datasets for metagenomics

Edit 2013-10-29: added humongous dataset from Qin et al. 2012

There are at least two 3 huge datasets for metagenomics. One of the challenge is to analyze all the data simultaneously in an integrative fashion.

That's where Ray Surveyor will do when ready !

Ray Surveyor will generate a Gramian Matrix, and also a Pairwise Distance Matrix. All of this is based on reference-free algorithms.

Here are some huge datasets:

Dataset #1: A human gut microbial gene catalog established by deep metagenomic sequencing (ERA000116)
Paper: A human gut microbial gene catalogue established by metagenomic sequencing
Size: 406 GiB (compressed with gzip)
Accession: ERA000116
Samples: 124

Dataset #2: HMP Studies and samples provided by HMP program and staged at NCBI (SRA012041)

Paper: Structure, function and diversity of the healthy human microbiome
Size: 4.1TiB (compressed with gzip)
Accession: SRA012041
Samples: 764

Dataset #3:Type 2 Diabetes gut metagenome (microbiome) data from 368 Chinese samples and updated metagenome gene catalog (SRA045646 + SRA050230)

Paper: A metagenome-wide association study of gut microbiota in type 2 diabetes
Size: 373 Gbases (stage I: SRA045646) + 902 Gbases (stage II: SRA050230)
Accession: http://dx.doi.org/10.5524/100036

Samples: 348 (stage I 145, stage II: 200, validation: 23)


Samples for dataset ERA000116

3.8G    ERS006485
1.6G    ERS006486
4.2G    ERS006487
3.8G    ERS006488
3.9G    ERS006489
4.0G    ERS006490
4.3G    ERS006491
3.6G    ERS006492
3.8G    ERS006493
14G    ERS006494
3.3G    ERS006495
3.9G    ERS006496
11G    ERS006497
1.5G    ERS006498
1.6G    ERS006499
1.8G    ERS006500
1.9G    ERS006501
2.9G    ERS006502
1.5G    ERS006503
3.5G    ERS006504
1.4G    ERS006505
3.5G    ERS006506
3.9G    ERS006507
2.7G    ERS006508
3.6G    ERS006509
2.7G    ERS006510
3.9G    ERS006511
2.9G    ERS006512
3.9G    ERS006513
3.8G    ERS006514
2.9G    ERS006515
2.9G    ERS006516
3.7G    ERS006517
3.6G    ERS006518
3.8G    ERS006519
1.9G    ERS006520
2.8G    ERS006521
3.4G    ERS006522
3.5G    ERS006523
3.8G    ERS006524
3.5G    ERS006525
3.9G    ERS006526
3.7G    ERS006527
2.8G    ERS006528
1.9G    ERS006529
3.6G    ERS006530
3.8G    ERS006531
3.6G    ERS006532
3.8G    ERS006533
1.9G    ERS006534
3.6G    ERS006535
3.6G    ERS006536
1.6G    ERS006537
3.2G    ERS006538
3.7G    ERS006539
4.3G    ERS006540
4.5G    ERS006541
3.8G    ERS006542
3.4G    ERS006543
4.0G    ERS006544
4.0G    ERS006545
3.3G    ERS006546
3.4G    ERS006547
1.2G    ERS006548
3.6G    ERS006549
3.7G    ERS006550
3.2G    ERS006551
1.5G    ERS006552
3.0G    ERS006553
3.9G    ERS006554
3.9G    ERS006555
3.9G    ERS006556
3.8G    ERS006557
1.5G    ERS006558
3.2G    ERS006559
1.6G    ERS006560
5.2G    ERS006561
3.1G    ERS006562
2.9G    ERS006563
3.9G    ERS006564
3.4G    ERS006565
4.3G    ERS006566
2.7G    ERS006567
3.6G    ERS006568
1.9G    ERS006569
1.6G    ERS006570
2.7G    ERS006571
3.0G    ERS006572
1.7G    ERS006573
3.4G    ERS006574
3.2G    ERS006575
3.2G    ERS006576
3.3G    ERS006577
1.8G    ERS006578
3.8G    ERS006579
3.5G    ERS006580
3.1G    ERS006581
2.9G    ERS006582
2.9G    ERS006583
3.6G    ERS006584
3.9G    ERS006585
3.5G    ERS006586
3.9G    ERS006587
3.7G    ERS006588
1.6G    ERS006589
3.7G    ERS006590
3.2G    ERS006591
1.7G    ERS006592
4.3G    ERS006593
3.6G    ERS006594
3.5G    ERS006595
1.2G    ERS006596
3.6G    ERS006597
2.9G    ERS006598
3.5G    ERS006599
1.4G    ERS006600
1.4G    ERS006601
4.1G    ERS006602
3.9G    ERS006603
3.3G    ERS006604
3.5G    ERS006605
4.0G    ERS006606
4.0G    ERS006607
3.4G    ERS006608

Samples for dataset SRA012041

4.8M    SRS011061
23G    SRS011084
8.9G    SRS011086
3.0M    SRS011090
11G    SRS011098
1.1G    SRS011105
13G    SRS011111
17G    SRS011115
13G    SRS011126
1.7G    SRS011132
13G    SRS011134
8.0G    SRS011140
1.8G    SRS011144
7.9G    SRS011152
13G    SRS011239
9.0G    SRS011243
2.1G    SRS011247
5.6G    SRS011255
1.7G    SRS011263
2.1G    SRS011269
12G    SRS011271
6.6M    SRS011302
12G    SRS011306
5.2G    SRS011310
5.8G    SRS011343
1.6G    SRS011355
1.4M    SRS011397
12G    SRS011405
11G    SRS011452
12G    SRS011529
1.1G    SRS011584
12G    SRS011586
11G    SRS012273
11G    SRS012279
1.6G    SRS012281
8.0G    SRS012285
1.4G    SRS012291
1.5G    SRS012294
3.2G    SRS012663
9.3G    SRS012902
1.4G    SRS013155
12G    SRS013158
11G    SRS013164
7.2G    SRS013170
11G    SRS013215
8.2G    SRS013234
2.6G    SRS013239
10G    SRS013252
4.9G    SRS013258
7.6G    SRS013261
1.4G    SRS013269
11G    SRS013476
5.2G    SRS013502
3.3G    SRS013506
9.1G    SRS013521
3.9G    SRS013533
968M    SRS013542
2.4G    SRS013637
844K    SRS013687
872K    SRS013705
2.9G    SRS013711
9.7G    SRS013723
11G    SRS013800
6.5G    SRS013818
2.4G    SRS013825
5.2G    SRS013836
4.0K    SRS013876
8.8G    SRS013879
1.8G    SRS013881
1013M    SRS013942
3.6G    SRS013945
4.0K    SRS013946
3.6G    SRS013947
7.3G    SRS013948
7.1G    SRS013949
2.3G    SRS013950
7.7G    SRS013951
1.5G    SRS013956
4.5G    SRS014107
8.7G    SRS014124
2.3G    SRS014126
8.6G    SRS014235
3.9G    SRS014271
7.3G    SRS014287
9.1G    SRS014313
1.5M    SRS014343
6.9G    SRS014459
1016M    SRS014464
1.1G    SRS014465
4.0K    SRS014466
1.4G    SRS014468
7.8G    SRS014470
1.3G    SRS014472
5.8G    SRS014473
1.5G    SRS014474
591M    SRS014475
5.1G    SRS014476
1.1G    SRS014477
1.3G    SRS014494
4.0K    SRS014573
1.4G    SRS014575
9.0G    SRS014578
9.1G    SRS014613
1.2G    SRS014629
4.5G    SRS014682
7.1G    SRS014683
11G    SRS014684
862M    SRS014686
8.9G    SRS014687
3.0G    SRS014689
2.9G    SRS014690
3.6G    SRS014691
11G    SRS014692
7.5G    SRS014888
1.6G    SRS014890
6.1G    SRS014894
1.5G    SRS014901
9.7G    SRS014923
8.8G    SRS014979
3.8G    SRS015038
1.9G    SRS015040
4.0K    SRS015044
1.2G    SRS015051
1.6G    SRS015054
978M    SRS015055
11G    SRS015057
1.8G    SRS015059
4.0K    SRS015060
1.5G    SRS015061
1.2G    SRS015062
6.8G    SRS015063
4.0K    SRS015064
7.8G    SRS015065
4.0K    SRS015071
1.2G    SRS015072
1.2G    SRS015073
8.9G    SRS015133
4.8G    SRS015154
7.2G    SRS015158
4.0K    SRS015168
11G    SRS015174
12G    SRS015190
11G    SRS015209
2.6G    SRS015215
8.7G    SRS015217
1.2G    SRS015225
5.3G    SRS015264
931M    SRS015269
6.6G    SRS015272
1.6G    SRS015274
8.0G    SRS015278
16G    SRS015369
3.7G    SRS015374
3.5G    SRS015378
5.6G    SRS015381
6.6G    SRS015395
1.8G    SRS015425
911M    SRS015430
36G    SRS015431
8.9G    SRS015434
3.2G    SRS015436
11G    SRS015440
1.2G    SRS015450
8.5M    SRS015470
6.3G    SRS015537
2.2G    SRS015540
5.8G    SRS015574
9.9G    SRS015578
18G    SRS015640
7.0G    SRS015644
2.5G    SRS015646
2.0G    SRS015650
9.0G    SRS015663
2.1G    SRS015745
4.0K    SRS015752
5.3G    SRS015755
9.1G    SRS015762
8.7G    SRS015782
1.8G    SRS015793
11G    SRS015794
9.2G    SRS015797
2.4G    SRS015799
6.2G    SRS015803
9.2G    SRS015854
38G    SRS015890
12G    SRS015893
1.9G    SRS015895
7.4G    SRS015899
1.6G    SRS015921
5.3G    SRS015937
8.0G    SRS015941
5.4G    SRS015947
8.1G    SRS015960
4.1G    SRS015985
6.5G    SRS015989
2.3G    SRS015996
8.3G    SRS016002
13G    SRS016018
1020M    SRS016033
8.8G    SRS016037
1.9G    SRS016039
4.4G    SRS016043
7.1G    SRS016056
11G    SRS016086
1.1G    SRS016088
5.5G    SRS016092
9.7G    SRS016095
598M    SRS016105
1012M    SRS016111
875M    SRS016188
1.9G    SRS016191
1.7G    SRS016196
6.2G    SRS016200
12G    SRS016203
9.1G    SRS016225
9.2G    SRS016267
1.1G    SRS016292
1.4G    SRS016297
8.6G    SRS016319
6.3G    SRS016331
11G    SRS016335
8.3G    SRS016342
2.0G    SRS016349
4.0K    SRS016360
1.9G    SRS016434
11G    SRS016495
7.4G    SRS016501
1.5G    SRS016503
1.1G    SRS016513
1.1G    SRS016516
12G    SRS016517
9.1G    SRS016529
2.4G    SRS016533
1.4M    SRS016541
1.5G    SRS016553
1.8G    SRS016559
9.6G    SRS016569
7.2G    SRS016575
1.4G    SRS016581
2.1G    SRS016584
13G    SRS016585
4.3G    SRS016600
5.9G    SRS016740
7.8G    SRS016746
1.5G    SRS016752
11G    SRS016753
3.1G    SRS016944
12G    SRS016954
13G    SRS016989
15G    SRS017007
3.6G    SRS017013
5.0G    SRS017025
1.7G    SRS017044
7.1G    SRS017076
3.1G    SRS017080
3.0G    SRS017088
2.7M    SRS017103
9.2G    SRS017120
4.6G    SRS017127
11G    SRS017139
1.5G    SRS017156
1.1M    SRS017191
15G    SRS017209
5.6G    SRS017215
9.5G    SRS017227
509M    SRS017244
6.9G    SRS017247
6.6G    SRS017304
12G    SRS017307
9.9G    SRS017433
8.5G    SRS017439
2.5G    SRS017441
13G    SRS017445
1.4G    SRS017451
2.6G    SRS017497
9.1G    SRS017511
1.5G    SRS017520
10G    SRS017521
9.3G    SRS017533
2.1G    SRS017537
2.3G    SRS017687
7.1G    SRS017691
1.5G    SRS017697
2.5G    SRS017700
11G    SRS017701
7.0G    SRS017713
9.3G    SRS017808
2.5G    SRS017810
4.0K    SRS017814
1.5G    SRS017820
11G    SRS017821
7.4G    SRS017849
6.4G    SRS017851
12G    SRS018133
712K    SRS018145
9.5G    SRS018149
10G    SRS018157
9.5G    SRS018300
1.3G    SRS018312
12G    SRS018313
5.0G    SRS018329
5.7G    SRS018337
11G    SRS018351
7.7G    SRS018357
3.5G    SRS018359
1.5G    SRS018369
9.4G    SRS018394
9.9G    SRS018427
11G    SRS018439
9.4G    SRS018443
1.5G    SRS018463
4.0K    SRS018569
11G    SRS018573
8.0G    SRS018575
18G    SRS018585
2.3M    SRS018591
16G    SRS018656
13G    SRS018661
7.6G    SRS018665
966M    SRS018671
7.4G    SRS018739
3.4G    SRS018769
2.2G    SRS018774
6.5G    SRS018778
1.2G    SRS018784
12G    SRS018791
2.5M    SRS018817
7.1G    SRS018969
2.3G    SRS018971
3.9G    SRS018975
2.9G    SRS018978
4.6G    SRS018981
37G    SRS018984
1.6G    SRS019015
4.8G    SRS019016
2.0G    SRS019019
11G    SRS019022
1.5G    SRS019024
1.8G    SRS019025
3.3G    SRS019026
4.0K    SRS019027
6.8G    SRS019028
3.7G    SRS019029
11G    SRS019030
2.1G    SRS019033
4.0K    SRS019039
3.8G    SRS019045
3.3G    SRS019063
3.2G    SRS019064
1013M    SRS019067
37G    SRS019068
9.6G    SRS019071
4.0G    SRS019073
9.1G    SRS019077
2.0G    SRS019081
1.3G    SRS019087
2.0G    SRS019116
4.0K    SRS019119
1.8G    SRS019120
9.0G    SRS019122
3.7G    SRS019124
6.5G    SRS019125
6.4G    SRS019126
4.9G    SRS019127
5.8G    SRS019128
2.5G    SRS019129
8.5G    SRS019161
4.0K    SRS019215
7.1G    SRS019219
1.4G    SRS019221
3.9G    SRS019225
5.5G    SRS019245
11G    SRS019267
7.4G    SRS019327
1.9G    SRS019329
3.9G    SRS019333
1.1G    SRS019339
1.2G    SRS019379
3.7G    SRS019381
988M    SRS019386
4.4G    SRS019387
9.1G    SRS019389
3.6G    SRS019391
9.4G    SRS019397
7.9G    SRS019582
4.0G    SRS019587
3.5G    SRS019591
540M    SRS019597
1.2G    SRS019600
11G    SRS019601
8.5G    SRS019607
9.9G    SRS019685
9.5G    SRS019787
1.6G    SRS019867
3.6G    SRS019872
2.1M    SRS019894
5.0G    SRS019906
12G    SRS019910
11G    SRS019968
9.5G    SRS019974
2.1G    SRS019976
7.7G    SRS019980
1.6G    SRS019986
1.7G    SRS019989
8.1G    SRS020220
2.5G    SRS020222
8.3G    SRS020226
1.5G    SRS020232
13G    SRS020233
5.0G    SRS020261
7.1G    SRS020263
9.7G    SRS020328
6.1G    SRS020334
1.6G    SRS020336
9.3G    SRS020340
1.7G    SRS020349
3.4G    SRS020386
9.3G    SRS020856
3.2G    SRS020858
4.7G    SRS020862
1.4G    SRS020868
11G    SRS020869
3.0G    SRS021473
7.1G    SRS021477
1.6G    SRS021483
12G    SRS021484
712K    SRS021496
9.1G    SRS021948
8.8G    SRS021954
3.7G    SRS021960
1.4G    SRS021969
2.9G    SRS021986
1.3G    SRS022006
9.3G    SRS022071
8.8G    SRS022077
1.7G    SRS022079
4.2G    SRS022083
1.4G    SRS022092
1.9G    SRS022129
5.7G    SRS022137
9.3G    SRS022143
2.4G    SRS022145
3.4G    SRS022149
1.1G    SRS022158
2.1G    SRS022524
4.7G    SRS022530
1.1G    SRS022532
7.3G    SRS022536
467M    SRS022545
7.0G    SRS022602
2.8M    SRS022609
11G    SRS022621
3.9G    SRS022625
1.2M    SRS022645
7.4G    SRS022713
7.8G    SRS022719
1.4G    SRS022721
4.0K    SRS022725
1.3G    SRS022734
12G    SRS023176
1.7G    SRS023212
6.5G    SRS023346
8.2G    SRS023352
625M    SRS023354
3.8G    SRS023358
8.5G    SRS023468
9.9G    SRS023526
1.4G    SRS023534
5.5G    SRS023538
6.3G    SRS023557
9.0G    SRS023583
1.6G    SRS023591
9.4G    SRS023595
7.1G    SRS023604
5.5G    SRS023617
12G    SRS023829
8.5G    SRS023835
1.8G    SRS023837
4.2G    SRS023841
2.5G    SRS023847
1.5G    SRS023850
716K    SRS023914
868K    SRS023926
9.4G    SRS023930
840K    SRS023938
11G    SRS023958
5.5G    SRS023964
1.6G    SRS023970
13G    SRS023971
3.4G    SRS023987
9.3G    SRS024009
8.9G    SRS024015
2.8G    SRS024017
9.0G    SRS024021
1.8G    SRS024064
1.6G    SRS024068
12G    SRS024075
9.5G    SRS024081
6.0G    SRS024087
11G    SRS024132
9.2G    SRS024138
2.9G    SRS024140
4.3G    SRS024144
2.3M    SRS024265
2.0M    SRS024277
4.4G    SRS024281
2.7M    SRS024289
1.7G    SRS024301
1.8G    SRS024310
4.0K    SRS024318
12G    SRS024331
2.8G    SRS024347
11G    SRS024355
8.5G    SRS024375
2.4G    SRS024377
7.5G    SRS024381
14G    SRS024388
3.3G    SRS024424
1.5G    SRS024428
12G    SRS024435
11G    SRS024441
8.7G    SRS024447
2.6G    SRS024470
3.2G    SRS024482
9.4G    SRS024549
1.8G    SRS024557
5.4G    SRS024561
1.4G    SRS024567
9.2G    SRS024580
5.4G    SRS024596
5.4G    SRS024598
1.7G    SRS024620
11G    SRS024625
5.2G    SRS024637
1.5G    SRS024641
8.3G    SRS024649
2.8G    SRS024655
7.7G    SRS042131
5.2G    SRS042284
1.2G    SRS042428
1.8G    SRS042457
9.4G    SRS042628
9.0G    SRS042643
1.3G    SRS042858
9.1G    SRS042910
7.6G    SRS042984
11G    SRS043001
5.6G    SRS043018
2.3G    SRS043239
9.4G    SRS043411
9.3G    SRS043422
1.4G    SRS043646
11G    SRS043663
3.7G    SRS043676
12G    SRS043701
5.0G    SRS043755
4.9G    SRS043772
1.7G    SRS043803
4.0K    SRS044366
9.0G    SRS044373
1.9G    SRS044474
7.2G    SRS044486
1.2G    SRS044626
4.9G    SRS044662
17G    SRS044742
10G    SRS045004
4.0K    SRS045049
3.7G    SRS045127
7.4G    SRS045197
1.6G    SRS045254
2.7G    SRS045262
4.0K    SRS045313
10G    SRS045606
9.3G    SRS045645
11G    SRS045713
8.6G    SRS045715
2.1G    SRS045978
1.4G    SRS046344
1.8G    SRS046623
5.4G    SRS046686
5.4G    SRS046688
1.1G    SRS046973
8.9G    SRS047014
7.5G    SRS047044
4.0K    SRS047100
9.0G    SRS047113
7.9G    SRS047210
7.9G    SRS047219
2.1G    SRS047225
1.1G    SRS047254
1.4G    SRS047265
1.7G    SRS047335
1008K    SRS047634
36M    SRS047708
8.7G    SRS047824
1.9G    SRS047844
7.7G    SRS048164
9.0G    SRS048411
2.4G    SRS048536
3.0G    SRS048719
6.4G    SRS048791
4.6G    SRS048870
8.6G    SRS049147
7.6G    SRS049164
1.1G    SRS049237
4.7G    SRS049268
677M    SRS049283
6.7G    SRS049318
9.0G    SRS049389
11G    SRS049712
1.1G    SRS049744
11G    SRS049900
9.9G    SRS049959
11G    SRS049995
1.7G    SRS050007
1.1G    SRS050025
3.6G    SRS050029
1.2G    SRS050184
10G    SRS050244
8.7G    SRS050299
12G    SRS050422
4.0K    SRS050484
4.2G    SRS050628
12G    SRS050669
4.0K    SRS050752
8.2G    SRS050925
9.9G    SRS051031
1.5G    SRS051116
4.8G    SRS051244
3.5G    SRS051378
1.2G    SRS051505
1.2G    SRS051600
4.4G    SRS051613
7.9G    SRS051791
616M    SRS051882
9.2G    SRS051930
7.1G    SRS051941
12G    SRS052027
4.6G    SRS052227
1.1G    SRS052330
1.3G    SRS052590
3.6G    SRS052604
1.2G    SRS052620
2.1G    SRS052668
7.5G    SRS052697
997M    SRS052756
4.0K    SRS052874
7.0G    SRS052876
2.7G    SRS052988
8.4G    SRS053214
9.8G    SRS053335
9.0G    SRS053398
1.3G    SRS053437
6.1G    SRS053584
12G    SRS053603
9.1G    SRS053630
8.2G    SRS053854
9.4G    SRS053917
1.5G    SRS054061
4.6G    SRS054430
3.8G    SRS054569
13G    SRS054590
9.6G    SRS054653
8.9G    SRS054687
1.1G    SRS054776
8.3G    SRS054956
1.5G    SRS054962
2.8G    SRS055118
902M    SRS055298
6.3G    SRS055378
4.0G    SRS055401
3.9G    SRS055426
7.8G    SRS055450
2.4G    SRS055495
7.5G    SRS055982
1.4G    SRS056157
1.2G    SRS056210
5.3G    SRS056259
6.6G    SRS056323
11G    SRS056519
8.0G    SRS056622
670M    SRS056695
4.2G    SRS056796
2.9G    SRS056892
5.5G    SRS056906
2.3G    SRS057022
2.7G    SRS057083
4.8G    SRS057182
7.9G    SRS057205
500K    SRS057290
4.2G    SRS057355
6.5G    SRS057478
8.3G    SRS057539
9.6G    SRS057692
8.2G    SRS057717
8.1G    SRS057791
1.8G    SRS057807
5.8G    SRS058053
2.2G    SRS058105
5.6G    SRS058182
4.0K    SRS058186
1.8G    SRS058213
5.3G    SRS058221
13G    SRS058336
9.7G    SRS058723
940K    SRS058770
6.0G    SRS058808
5.0G    SRS062427
1.1G    SRS062520
8.1G    SRS062540
5.2G    SRS062544
1.1G    SRS062713
1.5G    SRS062752
10G    SRS062761
6.5G    SRS062878
1.1G    SRS063035
11G    SRS063040
979M    SRS063178
11G    SRS063193
3.3G    SRS063215
1.5G    SRS063272
1.8M    SRS063287
6.6G    SRS063288
1.9G    SRS063351
877M    SRS063417
1.3G    SRS063478
6.0G    SRS063603
8.3G    SRS063932
9.0G    SRS063985
6.4G    SRS063999
4.2G    SRS064219
7.3G    SRS064276
5.5G    SRS064329
1.4G    SRS064376
7.2G    SRS064423
7.4G    SRS064449
5.5G    SRS064493
11G    SRS064557
2.3G    SRS064645
1.3G    SRS064704
7.8G    SRS064774
1.1G    SRS064809
3.8G    SRS065099
110M    SRS065133
984M    SRS065142
489M    SRS065179
11G    SRS065278
5.2G    SRS065310
1.7G    SRS065335
1.9M    SRS065347
119M    SRS065431
11G    SRS065504
11G    SRS075398
14G    SRS075404
2.6G    SRS075406
13G    SRS075410
1.6G    SRS075419
8.9G    SRS077730
7.7G    SRS077736
1.5G    SRS077738
1023M    SRS077751
6.7G    SRS078176
486M    SRS078182
16G    SRS078197



Daily scrums in bioinformatics research

Since I started my doctorate, I have been using the Scrum method for managing all of my doctoral projects (product packlogs, sprint backlogs, roadmaps, stuff like that).

This summer, we had an software engineering intern in our group. To manage the intern (he reported to me), I decided (before his arrival) that we were going to hold daily scrums. Daily scrums are timeboxed meetings that last a maximum of 15 minutes each day. And each day, the daily scrum must be held at the same location and at the same time in order to create an awesome habit in the group.

In such a meeting, three exciting questions are asked to everyone by a Scrum Master. On the 5th floor of our building, I am the Scrum Master, which means that I animate and regulate our scrum meetings. These three questions are (custom flavour):

1. What did you do yesterday ?
2. What will you do today ?
3. Do you have any roadblocks ?

So we (me, the intern, and other people on our floor) met everyday for the whole duration of the intern's stay in our group.

But we stopped after the intern's departure (I don't remember why exactly, but one of the reasons was that nobody was reporting directly to me after the intern's departure).

Today, I organized a daily scrum. After that, we collectively decided that scrum meetings were going to happen everyday at 09:15 on the fifth floor.

Great !


A playground for actors and fellowship application progress

Lately, I have been in part busy with paperwork related to my postdoctoral fellowship applications. While investing quite myself into these applications, I am also researching and devising better ways to create massively parallel software tools for genomics. I came up with the idea of "RayPlatform Actor Playground" and implemented a illustrious set of classes that embody the ideas conveyed by the "Playground."
The user story that motivated this novel playground for actors is that of what I named "Ray Surveyor." The purpose of this workflow is to compare genomic content between samples using DNA content without any references implied.

This Surveyor business is simply called using a new option for Ray: "-run-surveyor". An example of command line is available here. All the required runtime code is implemented within 4 types of actors. These actor types are: Mother, CoalescenceManager, GenomeGraphReader, and StoreKeeper. Mothers are the only actors that are spawned by the system initially. They alone decide to spawn the other types.

The image below shows a typical messaging network that is used by these actors within many playgrounds at once.

Figure 1: The playground model for actors. Each RayPlatform node has 1 playground in which actors can send messages, receive messages, spawn new actors, or die from old age.

In Ray Surveyor, several patterns from the great book Enterprise Integration Patterns (the patterns from this book are available online !). In particular, the patterns Aggregator -- which is useful to group small related objects into a larger payload -- and Request-Reply -- which is a paramount requirement for regulating messaging processes in the presence of varyins transport bandwith and/or delivery latency -- are put to good use.

I plan to finish the first features of the surveyor this week. The thing will generate a Gramian matrix for a bunch of biological samples (for which we have DNA sequencing data).

Fellowship application progress

I have steadily made progress with my fellowship applications. Last week, I added two fellowships to the list of scholarships I am applying to. I wrote a fancy page with details about every fellowship I am submitting (or submitted) an application.


Fonds de recherche du Québec Postdoctoral research scholarship
Argonne National Laboratory Named Postdoctoral Fellowships
Banting Postdoctoral Fellowships I will complete this tomorrow.
Canadian Institutes of Health Research Fellowships
James Hardy Wilkinson Fellowship in Scientific Computing
Margaret Butler Fellowship in Computational Science

In summary, the Banting fellowships required the highest paperwork burden and mine is almost completed and submitted. The 3 remaining applications (CIHR Fellowships, Wilkinson Fellowship, and Butler Fellowship) require far less energy to fill and send.

Doctoral thesis: is the beginning of the end in sight ?

It sure is. My director completed his prelecture. My codirector is half-way into the abysses of my thesis. After that, it will be the initial deposit.

Bonus feature

To reward readers for consuming all the bits of this lengthy post, here is a nice picture of a metagenomic repeat that is unwinding for the greater good.

Figure 2: A metagenomic repeat that is unwinding for the greater good.