Influenza Virus High-Throughput Influenza Sequencing Pipeline
Sequencing Strategies. The NIAID/JCVI Viral Genome Sequencing Project (IGSP) utilizes two high-throughput pipelines; an amplicon-based Sanger sequencing pipeline, and a multiplexed Next Generation sequencing pipeline. For both pipelines, the initial sample preparation involves a specialized multi-segment RT-PCR (M-RTPCR) procedure that is well suited for high-throughput nucleotide sequencing, and is used to amplify the genome of all subtypes of Influenza A virus. This technique amplifies all eight segments of any influenza virus without prior knowledge of its sequence. All samples undergo this step to ensure standardized sample extraction procedures, and to ensure enrichment of full-length influenza genomes. The RNA from specimens, ideally sent as frozen allantoic fluid or tissue culture supernatant, is used to amplify the genomes of the influenza viruses. Quality control assays are performed on the full genome amplicons to ensure amplification of the entire genome.
Human influenza samples are currently sequenced using the Sanger sequencing method. Degenerate primers are designed from aligned reference genomes using a computational PCR primer design pipeline developed at JCVI to produce tiled amplicons with an optimal length of 550 bp, with 100 bp overlap in order to provide six-fold sequence coverage of the influenza genome. An M13 sequence tag is added to the 5f end of each degenerate primer and is used for sequencing. Primers are arranged in a 96-well plate format, and all PCR reactions for each sample are performed in one plate. Sequencing reactions are performed using Big Dye Terminator chemistry (Applied Biosystems). Each amplicon is sequenced from both ends using M13 primers, and sequencing reactions are analyzed on a 3730 ABI sequencer. Raw sequence data is trimmed to remove any primer-derived sequence as well as low quality sequence, and gene sequences are assembled using a viral assembly tool.
Avian influenza samples are subject to a multiplexed Next Gen sequencing strategy to enable sequencing of 100-200 samples in a single sequencer run. Hybrid Next Generation sequencing technologies (Roche-454 and Illumina GAIIX ) can have significant advantages over Sanger sequencing platforms owing to the greater average sequence coverage, and the clonal nature of the sequence reads. The Sequence Independent Single Primer Amplification (SISPA) method is used to barcode samples for sequencing multiple samples per next-generation sequencing run. The SISPA method randomly amplifies RNA or DNA and has been used to generate consensus sequence for 100 samples to > 200X average depth of coverage per 454 or Illumina run. The activities at JCVI include 1) bar coding of amplicons from individual viruses, 2) library construction from a pool of bar coded amplicons, 3) high-throughput sequencing with 454 and Illumina technologies, 4) deconvolution of barcoded samples and assembly using a software pipeline developed at JCVI, 5) closure of samples using standard Sanger based finishing methods, and finally 6) submission of assembled sequences, meta-data and sequence files to NCBI.
The combination of SISPA and Next Generation sequencing is optimal for tackling viral genomes that are highly polymorphic, or contain unknown sequences. However, owing to biased sequence coverage, a significant number of these viral genomes require significant finishing by targeted Sanger sequencing of problematic regions. A primary objective is therefore to remove or reduce the biases in initial sequence coverage. When this has been accomplished, we envision that all viral sequencing at JCVI will be conducted on Next Generation platforms.
Assembly of Next Generation Sequencing Data. An automated assembly pipeline has been established to assemble hundreds of viral genomes that have been sequenced using a multiplex bar-coding protocol for 454 and Illumina. The assembly method involves deconvolution and binning of bar-coded reads and trimming of barcodes and random hexamers. Binned reads are assembled de novo, and compared with a database of full length genomic sequences using BLASTN to select the best reference for mapping assembly. Consensus sequences are generated and variations identified by heuristic algorithms comparing output of multiple sequencing technologies.
Finishing and Sequence submissions. Assemblies are edited computationally and manually. When there is insufficient underlying sequence information, the sample is entered into the secondary sequencing pipeline. In the secondary pipeline samples are either re-amplified using existing primers, or re-amplified using primers designed from the problematic sequence assembly itself and then resequenced. The completed sequences are validated using the NCBI validator and submitted to the NCBI Influenza Virus Resource.
