Next generation sequencing
- Illumina® (Solexa) Genome Analyzer™ and HiSeq
- Roche 454 Sequencer™
- Applied Biosystems SOLiD™
- FASTQ: derived from FASTA format with the addition of quality scores. Each read from a sequencer comprises an identifier line, a sequence line, a second identifier line (or with a + character) and final a quality line. This typically forms the input to a mapping program (along with a FASTA reference genome). A typical human exome FASTQ file might be around 10-15GB, which can be compressed to 5-6GB using gzip.
- SAM format: mapped/aligned sequence containing detail about alignment, mapping quality etc. This usually contains a subset of the raw reads (as some will have been discarded at the mapping stage). The SAM (or BAM) file is typically used as the substrate for variant calling algorithms and other analyses.
- BAM format: binary version of SAM. A typical human exome BAM file might be around 2GB in size.
- BED format: annotation format that describes genome regions, with the optional addition of annotation data for display of genome browser tracks.
- Other annotation formats:
describe a number of other formats suitable for generating
tracks in genome browsers.
- VCF:variant call format - this contains details about the number of reads at variant sites in the genome, plus a range of quality information. A typical human exome VCF file might contain about 20,000 lines.
Projects with available data
- 1000 genomes
- National Institute of Environmental Health Sciences SNP project
Complete exome sequencing data for 88 EGP samples, with VCF and BAM data available
Harvard University initiative for genome data sharing: aiming for 100,000 participants, currently only a limited quantity of data
eg One Yoruban human genome available from NA18507, plus analysed in/del and SNP information
Selective approaches to NGS
- Sequence capture arrays – exome, gene list, specific GWAS-hit regions etc
- PCR amplification – suitable for smaller scale
- Pooling to maximise throughput (“barcoded” or anonymous)
- FAIRE-Seq: identify regions of open chromatin, where regulatory proteins bind (formaldehyde-assisted isolation of regulatory elements)
- MAINE-Seq: identify regions of closed chromatin(MNase-mediated purification of mononucleosomes to extract histone-bound DNA sequencing)
- ChIP-Seq: identify where transcription factors bind using antibody to TF on nuclear DNA(Chromatin Immunoprecipitation sequencing)
an online forum – extremely useful for NGS
- Service providers: check with your University. This is a rapidly changing field and most universities are beginning to run systems in-house. Alternatively, commercial NGS services are available in many countries.
- ABI SOLiD: