< TGP QC > Quality Control of large genome datasets.
Importance The 1000 Genomes Project (TGP) is a foundational resource which serves the biomedical community as a standard reference cohort for human genetic variation. There are now five public versions of these genomes. Given the importance of this high-impact reference set of human variation, validating the quality of each release is a crucial concern.
Information

  • The various versions of the TGP were evaluated by focusing on a small subset of samples, comparing variants reported via different technologies and through comparison to 'truth set' calls from the Genome In A Bottle (GIAB) Consortium.
  • The availability of multiple versions of the same dataset enables a different type of quality control: cross-comparison among the versions, which supports identification of version-specific results and data processing failures throughout the cohort, the majority of which is not assessed by benchmarking of results against a small subset of high-quality genomes.
  • We used genome fingerprints and some additional statistics to compare the five versions of the TGP (TGP37, TGP38L, TGP38S, TGP38X and TGP38H) and their associated related samples, in terms of (a) the set of genomes analyzed, (b) known and cryptic relatedness within each cohort, (c) patterns of SNV and genotype concordance between versions, and (d) phasing concordance.
Downloadables
Communication

  • Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome!
  • Would you like to receive notification of updates to the genome dataset QC project? Follow via Twitter, or send me a note.
  • If you find the these methods and resources useful for your work, please cite:
    Robinson M, Joshi A, Vidyarthi A, Maccoun M, Rangavajjhala S and Glusman G. Quality control of large genome datasets. HGG Advances 2022, 3(3): 100123.