Faster Genome Analysis Enabling Clinical Application, Population-Scale Translational Research
Genome analysis pipelines are getting faster. These advances in computational pipelines are going to alleviate the notorious analysis bottleneck that challenges clinical adoption of genome sequencing. To achieve widespread clinical relevance, time to results must be cut significantly and to facilitate the next wave of understanding about the genetic origins of disease, these analysis pipelines must be robust enough to accommodate population-sized datasets of tens of thousands of genomes. Experts believe the technology to overcome these analysis challenges is now entering the marketplace. As next-generation sequencing (NGS) instruments become more commonplace in laboratories and as these platforms churn out raw data at even faster rates, access to scalable analysis tools becomes even more critical. Optimized analysis workflow solutions become the missing link—able to transform big data into clinically actionable information or scientific discoveries. To get from raw base pair data to reports of pathogenic variants requires multiple computational steps—alignment, deduplication, realignment, recalibration, and variant discovery. The resulting variant call format then requires tertiary analysis to match variants with clinically relevant information. Current analysis approaches can take weeks to complete and require bioinformatics expertise and computing infrastructure that poses a significant cost exceeding the price of actually generating sequencing data. Taken […]
An Automated Solution To overcome the challenges of analyzing these large amounts of data, White and his team developed a computational pipeline called “Churchill.” By applying novel computational techniques, the fully-automated Churchill can analyze a whole genome in 77 minutes. Churchill developers predict that the platform’s speed will have a “major impact” in clinical diagnostic sequencing. Churchill’s algorithm was licensed to Columbus-based GenomeNext for commercialization as a secure, software-as-a-service. “Accuracy and speed are extremely important even if you are dealing with one sample,” says James Hirmas, CEO of GenomeNext. “If it takes two days to get through the sequencing and then two weeks of analysis to determine the pathologic variant, that is too long to be relevant for a critically ill newborn. “ According to a Jan. 20 article in Genome Biology, Churchill’s performance was validated using the Genome in a Bottle Consortium reference sample. Churchill demonstrated high overall sensitivity (99.7 percent), accuracy (99.9 percent), and diagnostic effectiveness (99.7 percent), the highest of the three pipelines assessed. The other pipelines tested were the Genome Analysis Toolkit-Queue (using scatter-gather parallelization) and HugeSeq (using chromosomal parallelization). The developers say Churchill’s deterministic performance “sets an NGS analysis standard of 100 percent reproducibility, without sacrificing data quality.” “We aren’t naive to think that other groups aren’t trying to do this and they may achieve comparable speed in the future,” James Hirmas, GenomeNext’s CEO tells DTET. “So the issue is quality. The hidden dark secret of genome analysis tools is determinism and reproducibility.” Churchill divides the genome into thousands of smaller regions and runs them in parallel. While this sounds obvious, development was “challenging.” White says that central to Churchill’s parallelization strategy is the development of a novel deterministic algorithm that enables division of the workflow across many genomic regions with fixed boundaries or ‘subregions.’ “This division of work, if naively implemented, would have major drawbacks: read pairs spanning subregional boundaries would be permanently separated, leading to incomplete deduplication and variants on boundary edges would be lost,” White writes in Genome Biology. “To overcome this challenge, Churchill utilizes both an artificial chromosome, where interchromosomal or boundary-spanning read pairs are processed, and overlapping subregional boundaries, which together maintain data integrity and enable significant performance improvements.” Churchill’s speed is also highly scalable, enabling full analysis of the 1000 Genomes raw sequence dataset in a week using cloud resources. This, the developers say, demonstrates Churchill’s utility for population-scale genomic analysis. Churchill identified 41.2 million variants in the set with 34.4 million variant sites in common between Churchill and the 1000 Genomes Project’s analysis. The 1,088 low-coverage whole-genome samples had a total analysis cost of approximately $12,000, inclusive of data storage and processing, White says. Hirmas tells DTET that the company’s platform is well suited to both clinical laboratories and research entities engaging in largescale genomic studies. Sequencing, Hirmas explains, is run in batch jobs and it is more economical, depending on instrument size, to run 20 or even 50 samples in a tube. While 50 samples waiting for analysis doesn’t meet the thousands of genomes associated with population-scale genomics, 50 genomes may still be problematic for a lab if it takes two weeks to analyze each genome. The provision of fast genome analysis solutions as a service in the cloud is expected to accelerate clinical adoption of whole-exome and whole-genome sequencing and will enable the technology to be adopted by smaller laboratories. Genome analysis as a service eliminates many of the upfront costs and on-going overhead expenses tied to in-house analysis development. Labs can get tests up and running faster without having the outlay of capital investment to procure computer infrastructure. Additionally, labs don’t have to assemble hard-to-find bioinformatics teams. These commercial systems are scalable, meaning laboratories have access to the computational power they need when they have high volumes, but aren’t managing the overhead of on-site equipment capacity when testing volumes are low. Finally, despite the uncertainty of added regulation of sequencing-based testing and evolving security policies, GenomeNext and other emerging software-as-a-service genome analysis companies are building their systems to meet security and other laboratory regulations. For instance with these services, clinical laboratories can lock-down their analysis pipeline to meet CLIA and College of American Pathology regulations.
Population-Scale Data Analysis While the clinical implications of speedier genome analysis are clear, speed is also paramount to enabling the analysis of more genomes for translational research. It took the 1000 Genomes Project six years to sequence 2,504 individuals, analyze the genomes, and release final population variant frequencies. The often-heard frustration that sequencing the human genome has not yielded an understanding of the genetic etiology of common diseases as many scientists had hoped for can be addressed, experts say, with larger-scale genomic studies. The next series of breakthroughs in medicine will depend on populational-sized comparisons of hundreds of thousands or maybe even millions of genomes to crack the root genetic causes of disease. In order to analyze that many genotypes in a meaningful timeframe, networked cloud computers will be necessary to generate enough processing power. But to fully take advantage of the data captured in the growing repository of sequenced DNA data, large amounts of genomic information must be able to be transferred, shared, and re-analyzed in a secure fashion. “The unfolding calamity in genomics is that a great deal of this life-saving information, though already collected, is inaccessible,” Antonio Regalado writes in MIT Technology Review. “The risk of not getting data sharing right is that the genome revolution could sputter.” The ‘Internet of DNA,’ a global network of millions of genomes, was named as one of MIT Technology Review’s top 10 breakthrough technologies for 2015. The magazine believes this genome sharing may be achievable in the next two years. Regalado says DNA sequencing instruments will be able to produce 85 petabytes of data this year worldwide and twice that much in 2019. By comparison, all of Netflix’s master copies of movies take up 2.6 petabytes of storage. Genome sequencing is “largely detached,” he says, from “our greatest tool for sharing information: the Internet.” The data from the 200,000 genomes already sequenced are largely stored in disparate systems and when shared, are “moved around in hard drives and delivered by FedEx trucks.” Culturally, scientists are often reluctant to share genetic data, in part because of the legal risks surrounding privacy rules and the threat of security breaches. Patient privacy policy is slowly evolving to reflect the genomic and Internet era. In late March, the National Institutes of Health (NIH) issued a position statement on use of cloud computing services for analysis of controlled-access data. The agency decided, “In light of the advances made in security protocols for cloud computing in the past several years and given the expansion in the volume and complexity of genomic data generated by the research community, the NIH is now allowing investigators to request permission to transfer controlled-access genomic and associated phenotypic data obtained from NIH-designated data repositories ... to public or private cloud systems for data storage and analysis.” Comfort with genomic data sharing will also grow with the development of enhanced security measures, such as advanced encryption methods. According to a March 23 Nature News article, significant progress has been made using homomorphic encryption to analyze genetic data. At the iDASH Privacy & Security Workshop (San Diego; March 16) groups demonstrated that they could find disease-associated gene variants in about ten minutes using the method. In homomorphic encryption, data is encrypted on a local computer and then the scrambled data is uploaded to the cloud. Computations can be performed directly on the encrypted data in the cloud and an encrypted result is then sent back to a local computer, which decrypts the answer. The cloud-based computational pipeline would never ‘see’ the raw data, but the cryptographers say, the scheme gives the same result as calculations on unencrypted data. Early versions of the technology were hampered by the extended time analysis took on encrypted data. While the cryptographers at the workshop acknowledge that the encrypted technologies are still slower than analysis pipelines using raw data, they are “encouraged.” Five teams demonstrated homomorphic encryption schemes that could examine data from 400 people within about 10 minutes, and could pick out a disease-linked variant from among 311 spots where the genome is known to vary. It took 30 minutes to analyze 5,000 base pair stretches (a little larger than the size of a typical gene), while for larger stretches of sequence data—100,000 base pairs, or about 0.003 percent of the overall genome—analysis was not always possible, or took hours, and consumed up to 100 times more memory than computing unencrypted data, Nature News reports. “The same calculation that took a day and a half in 2012 now takes us five minutes to do,” Shai Halevi, Ph.D., a researcher in cryptography and information security at the IBM Thomas J. Watson Research Center (Yorktown Heights, N.Y.), tells Nature News. “Now is the time to ask, is this fast enough to be usable?” Takeaway: The industry is poised to dramatically reduce the amount of time needed to analyze a genome from weeks to hours. This profound improvement in speed, in combination with enhanced data sharing, management, and security methods, will accelerate both clinical adoption of whole-exome and genome testing, as well as the ability to conduct large-scale genome analysis for translational research studies.
Subscribe to Clinical Diagnostics Insider to view
Start a Free Trial for immediate access to this article