- We present a locality-sensitive hashing strategy for summarizing semi-structured data (e.g., in JSON or XML formats) into ‘data fingerprints’: highly compressed representations which cannot recreate details in the data, yet simplify and greatly accelerate the comparison and clustering of semi-structured data by preserving similarity relationships.
- Computation on data fingerprints is fast: in one example involving complex simulated medical records, the average time to encode one record was 0.53 seconds, and the average pairwise comparison time was 3.75 microseconds. Both processes are trivially parallelizable.
- Applications include detection of duplicates, clustering and classification of semi-structured data, which support larger goals including summarizing large and complex data sets, quality assessment, and data mining.
- Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome!
- Would you like to receive notification of updates to the Data Fingerprints method? Follow via Twitter, or send me a note.
- If you find Data Fingerprints useful for your work, please cite:
Robinson M, Hadlock J, Yu Jiyang, Khatamian A, Aravkin AY, Deutsch EW, Price ND, Huang S and Glusman G. (2018) Fast and simple comparison of semi-structured data, with emphasis on electronic health records. Preprint: BIORXIV/2018/293183