By Lori Solomon, Editor, Diagnostic Testing & Emerging Technologies
Data generated from genomics may exceed the largest generators of big data in other industries by 2025, according to a study published July 7 in PLOS Biology. Genomics’ data requirements were compared to the current biggest generators of data—astronomy, YouTube, and Twitter—across the life cycle of a dataset: acquisition, storage, distribution, and analysis.
Across astronomy, YouTube, and Twitter, data acquisition is expected to grow by up to two orders of magnitude between now and 2025-with projections of YouTube uploading an estimated 1,350 hours of video per minute (one to two exabytes of video data per year) over the next 10 years. By comparison, the amount of sequencing data produced doubles every seven months, the authors say. Current estimates based upon reads at the Sequence Read Archive are that 3.6 petabytes of raw data exist (32,000 microbial genomes, 5,000 plant and animal genomes, and 250,000 human genomes). However, current worldwide sequencing capacity is already estimated to be 35 petabases per year making exabase-scale of sequencing data per year reachable anywhere between five and 10 years from now.
The authors say, though, these sequencing data acquisition estimates are "dwarfed" by the "reasonable possibility" that 25 percent of the population in developed nations will have their genomes sequenced by 2025. This significant expected growth in human genome sequencing is partially based upon population-scale sequencing projects announced by governments worldwide (United Kingdom, Saudi Arabia, Iceland, United States, and China).
"We therefore estimate between 100 million and as many as 2 billion human genomes could be sequenced by 2025, representing four to five orders of magnitude growth in ten years and far exceeding the growth for the three other big data domains," write the authors led by Zachary Stephens, from University of Illinois at Urbana-Champaign. "Indeed, this number could grow even larger, especially since new single-cell genome sequencing technologies are starting to reveal previously unimagined levels of variation, especially in cancers, necessitating sequencing the genomes of thousands of separate cells in a single tumor."
In the realm of data storage, total genomic data could also far exceed the demands for the other large data generators. Already the 20 largest sequencing institutions use 100 petabytes of storage versus Twitter’s .5 petabytes per year. The largest astronomy data center is estimated to currently use 100 petabytes and YouTube’s storage is estimated 100 petabytes to 1 exabyte of storage, the authors report. However, substantial efforts are being made at enhancing data reduction and real-time ‘omics analysis, which could make net storage requirements similar to astronomy and YouTube.
The authors say that "concerted" community-wide planning is needed now to address the "genomical" challenges that will emerge in the next 10 years, particularly in the areas of distribution, where cloud-based solutions are likely to be the most practical, and in analysis, where large-scale machine-learning systems will be needed.