Data fingerprints' homepage

Data Fingerprints	Data fingerprints enable fast and simple comparison of semi-structured data. This project is related to (but distinct from) the Genome Fingerprints.
Summary	We present a locality-sensitive hashing strategy for summarizing semi-structured data (e.g., in JSON or XML formats) into ‘data fingerprints’: highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of semi-structured data by preserving similarity relationships. Computation on data fingerprints is fast: in one example involving complex simulated medical records, the average time to encode one record was 0.53 seconds, and the average pairwise comparison time was 3.75 microseconds. Both processes are trivially parallelizable. Applications include detection of duplicates, clustering and classification of semi-structured data, which support larger goals including summarizing large and complex data sets, quality assessment, and data mining.
Using it locally	Download the code and data sets: Get the full code set from GitHub. Data fingerprints (L=200) for the SyntheticMass sample data set (FHIR DSTU3): SyntheticMass.200.gz.
Communication	Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome! If you find Data Fingerprints useful for your work, please cite: Robinson M, Hadlock J, Yu Jiyang, Khatamian A, Aravkin AY, Deutsch EW, Price ND, Huang S and Glusman G. (2018) Fast and simple comparison of semi-structured data, with emphasis on electronic health records. Preprint: BIORXIV/2018/293183