Genome Fingerprints Genome fingerprints enable ultrafast comparison of personal genomes.
This project is related to (but distinct from) the Genotype Fingerprints.
Importance

  • Personal genome sequences contain the information required for assessing genetic risks, matching genetic backgrounds between cases and controls in medical research, detecting duplicate individuals or close relatives for medical, legal, or historical reasons. Research purposes served by personal genome sequencing include classifying individuals by population, reconstructing human history, assessing and controlling the quality of the sequence information itself, computing kinship matrices to support genome-wide association studies (GWAS), and combining data sets for meta-analysis.
  • Many of these applications involve comparison of two or more personal genomes. However, the size, complexity, and diversity of representations in which they are stored makes comparison of personal genomes in their existing forms error-prone and slow, and therefore challenging to scale from pairs to the hundreds, thousands, or millions of individuals we will soon wish to compare in order to provide improved, personalized medical care.
  • We developed an ultra-fast method for comparing personal genomes; our method is akin to locality-sensitive hashing. We transform the standard genome representation (lists of variants relative to a reference) into 'genome fingerprints' that can be readily compared across sequencing technologies and reference versions. Because of their reduced size, computation on the genome fingerprints is fast and requires little memory. This enables scaling up a variety of important genome analyses, including quantifying relatedness, recognizing duplicative sequenced genomes in a set, population reconstruction, and many others. The original genome representation cannot be reconstructed from its fingerprint; the method thus has significant implications for privacy-preserving genome analytics.
Using it locally

  • Download the code and data sets:
    • Get the full code set from GitHub.
    • Download serialized fingerprints (L=20 and L=120) for the 2504 genomes in the 1000 genomes project: 1000genomesFingerprints.tar.gz [123 MB].
    • Download normalized fingerprints (L=200) for TGP37: TGP37 [158 MB].
    • Download normalized fingerprints (L=200) for TGP37r: TGP37r [2 MB].
    • Download normalized fingerprints (L=200) for TGP38L: TGP38L [158 MB].
    • Download normalized fingerprints (L=200) for TGP38C: TGP38C [158 MB].
    • Download normalized fingerprints (L=200) for TGP38H: TGP38H [158 MB].
    • Download normalized fingerprints (L=200) for TGP38N: TGP38N [158 MB].
    • Download normalized fingerprints (L=200) for TGP38X: TGP38X [161 MB].
    • Download normalized fingerprints (L=200) for TGP38S: TGP38S [161 MB].
    • Download normalized fingerprints (L=200) for TGP38Xr: TGP38Xr [10 MB].
    • Download normalized fingerprints (L=200) for TGP38Sr: TGP38Sr [10 MB].
    • Download normalized fingerprints (L=200) for TGP38Nr: TGP38Nr [44 MB].
Communication

  • Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome!
  • Would you like to receive notification of updates to the Genome Fingerprints method? Follow via Twitter, or send me a note.
  • If you find Genome Fingerprints useful for your work, please cite:
    Glusman G, Mauldin DE, Hood L and Robinson M. Ultrafast comparison of personal genomes via precomputed genome fingerprints. Front. Genet. 2017 8:136.