2014-08-02

The public datasets from the DOE/JGI Great Prairie Soil Metagenome Grand Challenge



I am working on a couple of very large public metagenomics datasets from the Department of Energy (DOE) Joint Genome Institute (JGI). These datasets were produced in the context of the Grand Challenge program.

Professor Janet Jansson was the Principal Investigator for the proposal named Great Prairie Soil Metagenome Grand Challenge ( Proposal ID: 949 ).


Professor C. Titus Brown wrote a blog article about this Grand Challenge.
Moreover, the Brown research group published at least one paper using these Grand Challenge datasets (assembly with digital normalization and partitioning).

Professor James Tiedje presented the Great Challenge at the 2012 Metagenomics Workshop.

Alex Copeland presented interesting work at Sequencing, Finishing and Analysis in the Future (SFAF) in 2012 related to this Grand Challenge.



Jansson's Grand Challenge included 12 projects. Below I made a list with colors (one color for the sample site and one for the type of soil).

  1. Great Prairie Soil Metagenome Grand Challenge: Kansas, Cultivated corn soil metagenome reference core (402463)
  2. Great Prairie Soil Metagenome Grand Challenge: Kansas, Native Prairie metagenome reference core (402464)
  3. Great Prairie Soil Metagenome Grand Challenge: Kansas, Native Prairie metagenome reference core (402464) (I don't know why it's listed twice)
  4. Great Prairie Soil Metagenome Grand Challenge: Kansas soil pyrotag survey (402466)
  5. Great Prairie Soil Metagenome Grand Challenge: Iowa, Continuous corn soil metagenome reference core (402461)
  6. Great Prairie Soil Metagenome Grand Challenge: Iowa, Native Prairie soil metagenome reference core (402462)
  7. Great Prairie Soil Metagenome Grand Challenge: Iowa soil pyrotag survey (402465)
  8. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Continuous corn soil metagenome reference core (402460)
  9. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Native Prairie soil metagenome reference core (402459)
  10. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Restored Prairie soil metagenome reference core (402457)
  11. Great Prairie Soil Metagenome Grand Challenge: Wisconsin, Switchgrass soil metagenome reference core (402458)
  12. Great Prairie Soil Metagenome Grand Challenge: Wisconsin soil pyrotag survey (402456)

I thank the Jansson research group for making these datasets public so that I don't have to look further for large politics-free metagenomics datasets.


Table 1: number of files, reads, and bases in the Grand Challenge datasets. Most of the sequences are paired reads.
Dataset
File count
Read count
Base count
Iowa_Continuous_Corn_Soil (details)
252 055 601 258196 708 830 076
Iowa_Native_Prairie_Soil (details)253 750 844 486326 986 888 235
Kansas_Cultivated_Corn_Soil (details)302 677 222 281272 276 185 410
Kansas_Native_Prairie_Soil (details)335 126 775 452597 933 511 278
Wisconsin_Continuous_Corn_Soil (details)181 912 865 700192 128 891 088
Wisconsin_Native_Prairie_Soil (details)202 098 317 886211 016 377 208
Wisconsin_Restored_Prairie_Soil (details)6347 778 67052 514 579 170
Wisconsin_Switchgrass_Soil (details)7448 382 76658 323 428 574
Total16418 417 788 4991 907 888 691 039


At Argonne we are using these datasets to develop a next-generation metagenomics assembler named "Spate" built on top of the Thorium actor engine. The word spate means a large number of similar things or events appearing or occurring in quick succession. With the actor model, every single message is an active message. Active messages are very neat and there is a lot of them with the actor model.


Similar posts:

No comments:

There was an error in this gadget