Known issues: Kaviar public release build 160204-Public Issues addressed since previous release --------------------------------------- o The data source UK10K included duplicate entries for many variants. This caused inflated values for AC and AF for many Kaviar variants seen in UK10K data. These duplicates ahve been filtered out. o Inova CGI indels no longer have an extra reference base prepended. o The number of entries where the sum of MAF values is greater than 1 is greatly reduced, to 211 entries. Deficiencies which may be addressed in future releases: ------------------------------------------------------- o ISB genomes will be normalized in the next release. o The genome of David Ewing Duncan, previously in Kaviar, was omitted. It will re-appear in the next release. o 86 ISB genomes that appeared in the Jan. 2015 build are omitted for technical reasons. They may re-appear in a future release. o Although data sources have been normalized according to http://genome.sph.umich.edu/wiki/Variant_Normalization, this normalization is site-oriented rather than allele oriented. All alleles at a multi-allelic site in a particular data source are normalized together. For example, from hypothetical SOURCE1: chr1 10001 AA ACC,G This multi-allelic site cannot be further normalized using the above technique. In Kaviar, it appears as two separate variants: chr1 10001 AA ACC SOURCE1 chr1 10001 AA G SOURCE1 The first appears unnormalized. Further, this can lead to equivalent variants listed separately in Kaviar. For example: chr1 10001 AA ACC SOURCE1 chr1 10002 A CC SOURCE2 o VCF files may contain multiple lines describing the same variant due to site-oriented variant normalization. Sum the allele frequencies to obtain the correct allele frequency for the variant. o When an indel and a SNV occur at the same position, the VCF representation of the SNV appears as a dinucleotide substitution at the preceding position. (This is because the VCF specification requires that indels be prepended with a reference base, and in these cases our VCF creation process prepends that reference base to all variant alleles at that position.) o For the following data sources, only the first two variant alleles reported in the source at each position are included in Kaviar: Wellderly SSIP Malay o Allele frequency computation assumes complete coverage for all genomes. Therefore, for non-reference alleles, Kaviar's allele frequency is always a lower bound. An upcoming release will include position- specific AN values and therefore more accurate allele frequencies. Other issues to be aware of: ---------------------------- o Data sources are of varying depth and quality. Caveat emptor. Assume that the set of variants with AC=1 (over 50% of the variants in Kaviar) has a high error rate. o Allele frequency computation for indels and complex variants is quite imperfect and should not be relied upon. In particular, multi-allelic positions with indels and complex variants sometimes have a sum of AF values greater than 1 o Support for hg18 has been discontinued. o When computing allele frequencies, dbSNP provides one count if the allele is seen in no other data source. Otherwise, it provides no count. o Some variants are lost during liftover, specifically those that remap to a different chromosome. Most Kaviar data sources were provided for hg19, so Kaviar is most complete for hg19. o Although presumably all 1000 Genomes variants were contributed to dbSNP, 0.26% of the 1000 Genomes SNVs in Kaviar lack an rsid for dbSNP 146. Reason unknown. Issues addressed in release 160113-Public ----------------------------------------- o Variants are normalized according to http://genome.sph.umich.edu/wiki/Variant_Normalization for all data sources except ISB. This resolves the issue of some variants having excess context (e.g. ACGT -> ACGG at 1001 normalizes to T->G at 1004) o genome hgdp00521 from simons foundation is no longer missing variants for the last ~4m nt o Previously some variants did not display a variant sequence in the web browser output. This has been corrected. o VCF files no longer have positions with no value for AF o The first few variants for chrM in the VCF files no longer have a start position of 0. This used to cause bedtools to quit. o VCF files no longer include variant positions with AC=0 and maf=0.0 o chrM variants in Illumina genomes are mapped to a slightly different reference genome for hg19. These have been remapped to the standard reference genome for Kaviar. Issues addressed in release 150810-Public ----------------------------------------- o Default reference genome is now hg38 o For text output from web interface, a new column is added for AF. o For table, text, and json format output from web interface, the AC value for each data source (when >1) is appended to the name of the data source, in parentheses. Adjust parsers to accommodate. o Data sources and allele frequencies are no longer displayed for reference allele. They will be re-introduced when we roll out enhanced allele frequency calculations. o Data sources for alleles are no longer separated into HOM and HET. o Indels and substitutions are included from several additional data sources, most notably ISB founders. See Kaviar Ranged Sources for complete list. o VCF files contain more meta-data o In VCF files, variants that are seen only in dbSNP do not have a DS field in the INFO column