By Lori Solomon, Editor, Diagnostic Testing & Emerging Technologies
Through creation of a shared pool of anonymized genomic variant information the founders of the Allele Frequency Community believe genome sequencing data can be more useful and actionable, particularly in ethnic subpopulations. The founding members of the community, with the help of Qiagen (Germany) and its bioinformatics solutions, recently began sharing pooled allele frequency statistics to aid both patient care and biomedical research.
Non-personally identifiable statistics from members’ samples drive the community-led effort. Hundreds of founding members have pooled high-quality human exome- and genome-wide variant call datasets in a "secure, anonymized" fashion to create, what they say, is the largest freely-accessible, hosted community database of allele frequencies available to date. Access to the extensive database is granted in exchange for laboratories’ agreement to contribute their own data. Variant statistics are shared through calculation of aggregated allele frequency—the rate at which certain DNA variants are seen in certain populations.
Qiagen Bioinformatics agreed to host the data and make it accessible via its genome interpretation system. Specifically, variant call frequency-level data are stored in a laboratory’s private, HIPAA- and Safe Harbor-certified Qiagen-hosted account. Computations occur across these opted-in accounts to generate the pooled allele frequency statistics.
In what organizers call a "virtuous cycle," data from more than 104,000 samples—including more than 13,000 whole genomes, representing more than 100 countries of origin—has been collected since the community launched earlier this year. The organizers say that since launch, participating scientists have reported an average false-positive reduction of 43 percent, showing how efficient and responsible data sharing can eliminate some of the headaches associated with genome interpretation.
While a prospective disease-causing variant may be classified as “rare” based on publicly available sequence information, it may in fact be a polymorphism in an ethnic population under-represented in public databases, Ramon Felciano the chief technology officer at Qiagen, explains in an editorial published Aug. 12 in the Scientist. Without variant information specific to individual ethnic groups, clinical geneticists must invest a significant amount of time looking into these supposed disease-causing "red herring variants … diverting resources from the real causative variant."
To understand rare variants, large datasets are needed. Yet sharing of genomic information is often stymied by institutions’ strict privacy protocols, even though laboratories often collect their own private allele frequency libraries.
"With so many groups around the world generating genomic data at a breakneck pace, it’s tempting to assume that this limitation will quickly fall away as the data volume grows," Felciano writes. "But we have seen time and again that unless key information from these data sets is pooled, there is little community benefit."
In addition to members’ contributions, high-quality public resources like ExAC, CG Diversity and Exome Variant Server are incorporated into the Allele Frequency Community database. While in just a few months this community reports already becoming the “world’s largest source of ethnically diverse genomic data,” they have plans to enhance ethnic diversity further. Ethnic subpopulation frequencies will be provided once a population reaches a minimum threshold size of 100 patients or more.