Kaviar -- Known issues

Known issues: Kaviar public release build 160204-Public

Issues addressed since previous release
---------------------------------------

o  The data source UK10K included duplicate entries for many variants.
   This caused inflated values for AC and AF for many Kaviar variants seen
   in UK10K data. These duplicates ahve been filtered out.

o  Inova CGI indels no longer have an extra reference base prepended.

o  The number of entries where the sum of MAF values is greater than 1
   is greatly reduced, to 211 entries.


Deficiencies which may be addressed in future releases:
-------------------------------------------------------

o  ISB genomes will be normalized in the next release.

o  The genome of David Ewing Duncan, previously in Kaviar, was omitted.
   It will re-appear in the next release.

o  86 ISB genomes that appeared in the Jan. 2015 build are omitted
   for technical reasons. They may re-appear in a future release.

o  Although data sources have been normalized according to
   http://genome.sph.umich.edu/wiki/Variant_Normalization,
   this normalization is site-oriented rather than allele oriented.
   All alleles at a multi-allelic site in a particular data source
   are normalized together. For example, from hypothetical SOURCE1:
     chr1  10001   AA    ACC,G
   This multi-allelic site cannot be further normalized using the
   above technique. In Kaviar, it appears as two separate variants:
     chr1  10001   AA    ACC       SOURCE1
     chr1  10001   AA    G         SOURCE1
   The first appears unnormalized.
   Further, this can lead to equivalent variants listed separately
   in Kaviar. For example:
     chr1  10001   AA  ACC         SOURCE1
     chr1  10002   A   CC          SOURCE2

o  VCF files may contain multiple lines describing the same variant
   due to site-oriented variant normalization. Sum the allele frequencies
   to obtain the correct allele frequency for the variant.

o  When an indel and a SNV occur at the same position, the VCF
   representation of the SNV appears as a dinucleotide substitution
   at the preceding position. (This is because the VCF specification
   requires that indels be prepended with a reference base, and in
   these cases our VCF creation process prepends that reference base
   to all variant alleles at that position.)

o  For the following data sources, only the first two variant alleles
   reported in the source at each position are included in Kaviar:
     Wellderly
     SSIP
     Malay

o  Allele frequency computation assumes complete coverage for all genomes.
   Therefore, for non-reference alleles, Kaviar's allele frequency
   is always a lower bound. An upcoming release will include position-
   specific AN values and therefore more accurate allele frequencies.


Other issues to be aware of:
----------------------------

o  Data sources are of varying depth and quality. Caveat emptor. Assume that
   the set of variants with AC=1 (over 50% of the variants in Kaviar) has a
   high error rate.

o  Allele frequency computation for indels and complex variants is quite
   imperfect and should not be relied upon. In particular,
   multi-allelic positions with indels and complex variants sometimes
   have a sum of AF values greater than 1

o  Support for hg18 has been discontinued.

o  When computing allele frequencies, dbSNP provides one count if the allele
   is seen in no other data source. Otherwise, it provides no count.

o  Some variants are lost during liftover, specifically those that remap to
   a different chromosome. Most Kaviar data sources were provided for hg19,
   so Kaviar is most complete for hg19.

o  Although presumably all 1000 Genomes variants were contributed to dbSNP,
   0.26% of the 1000 Genomes SNVs in Kaviar lack an rsid for dbSNP 146.
   Reason unknown.


Issues addressed in release 160113-Public
-----------------------------------------
o  Variants are normalized according to
   http://genome.sph.umich.edu/wiki/Variant_Normalization 
   for all data sources except ISB.
   This resolves the issue of some variants having excess context
   (e.g. ACGT -> ACGG at 1001 normalizes to T->G at 1004)

o  genome hgdp00521 from simons foundation is no longer
   missing variants for the last ~4m nt

o  Previously some variants did not display a variant sequence in
   the web browser output. This has been corrected.

o  VCF files no longer have positions with no value for AF

o  The first few variants for chrM in the VCF files no longer
   have a start position of 0. This used to cause bedtools to quit.

o  VCF files no longer include variant positions with AC=0 and maf=0.0

o  chrM variants in Illumina genomes are mapped to a slightly different
   reference genome for hg19. These have been remapped to the standard
   reference genome for Kaviar.


Issues addressed in release 150810-Public
-----------------------------------------
o  Default reference genome is now hg38

o  For text output from web interface, a new column is added for AF.

o  For table, text, and json format output from web interface, the AC value
   for each data source (when >1) is appended to the name of the data source,
   in parentheses. Adjust parsers to accommodate.

o  Data sources and allele frequencies are no longer displayed for reference
   allele.  They will be re-introduced when we roll out enhanced allele
   frequency calculations.

o  Data sources for alleles are no longer separated into HOM and HET.

o  Indels and substitutions are included from several additional data sources,
   most notably ISB founders. See Kaviar Ranged Sources for complete list.

o  VCF files contain more meta-data

o  In VCF files, variants that are seen only in dbSNP do not have a DS field
   in the INFO column