2013-10-30

Pulling next-generation sequences from the European Nucleotide Archive (ENA)

Today, I have read the abstract (and the methods) of a PNAS (Proceedings of the National Academy of Sciences) paper entitled Evolutionary dynamics of Staphylococcus aureus during progression from carriage to disease. In this paper, researchers sequenced the DNA of a high number of Staphylococcus aureus isolates from patients over a certain period of time. This is science.

So this is nice and all, but after reading some bits of it, I wanted the data to do my own tests.

Getting sequence data is actually very easy, thanks to the International Nucleotide Sequence Database Collaboration (INSDC). The INSDC has 3 members (alphabetical order):
  • DNA Data Bank of Japan (DDBJ)
  • European Nucleotide Archive (ENA)
  • GenBank (raw sequence data is stored in the Sequence Read Archive (SRA))
In a nutshell, everything submitted to one of these gets mirrored to the others. The Metadata Model is also quite nice.

SRA is unusable because I have to download data files in the SRA format. This SRA format -- which is described in the Sequence Read Archive Handbook -- is only a source of impediments and provides no benefits to anyone (this is my opinion and not necessarily a fact).

I use mostly just ENA. ENA's user interface is way simplier and it supports a rich set of HTTP queries that return HTML (Hyper Text Markup Language), XML (Extensible Markup Language), or TSV (Tab-separated values). These features are very useful to search with keywords, to gather metadata, or to fetch raw data. Furthermore, ENA is quite beautiful with all the green (my last name is Boisvert, Bois means wood and vert means green). So I like green.

DDBJ is really cool too, but there is no way of getting metadata in XML (like in ENA). The nice thing about DDBJ is that they compress their data with bzip2 instead of gzip (used by ENA).

Here are a bunch of query examples with the accession ERA087387 that I found in the PNAS paper above.

Accession ERA087387:
http://www.ebi.ac.uk/ena/data/view/ERA087387 (HTML)
http://www.ebi.ac.uk/ena/data/view/ERA087387&display=xml (XML)


Samples from ERS093118 to ERS093127:
http://www.ebi.ac.uk/ena/data/view/ERS093118-ERS093127 (HTML)
http://www.ebi.ac.uk/ena/data/view/ERS093118-ERS093127&display=xml (XML)

File list for sample ERS093118:
http://www.ebi.ac.uk/ena/data/view/ERS093118 (HTML)
http://www.ebi.ac.uk/ena/data/warehouse/filereport?accession=ERS093118&result=read_run&fields=scientific_name,instrument_model,fastq_md5,fastq_ftp (TSV)


Finally, I think that all the huge projects like Human Microbiome Project, 1000 genomes, ENCODE, and so on should use Globus Online to allow people to pull all the data with GridFTP (which is faster).

Edit 2013-11-05: Jonathan Trow from SRA kindly pointed out that NCBI also has support for XML exportation.

To get sample metadata in XML for an accession, simply download this:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=xml&term=SRP011011 

Get a CSV sample file:

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?save=efetch&db=sra&rettype=runinfo&term=SRP011011




No comments:

There was an error in this gadget