Skew Metrics Novel metrics for quantifying bacterial genome composition skews
Importance

  • Bacterial genomes display significant compositional biases, both in terms of G+C content and in skews (strand asymmetry in ‘T’ vs. ‘A’ and ‘G’ vs. ‘C’ usage). These biases arise from the complex interplay of differential mutation rates and multiple selective constraints, particularly for energy efficiency.
  • Extreme examples of compositional biases are found among species in the family Borreliaceae, which comprises a variety of tick-borne spirochetes and includes species causing Lyme disease (genus Borreliella, originally Borrelia) as well as those causing relapsing fever (genus Borrelia).
  • Thanks to the much expanded availability of complete genome sequences of bacterial species, it is now possible to perform large-scale comparative genomics studies. A much larger number of bacterial genomes have been drafted, assembled to different levels of contiguity (contigs, scaffolds) and tentatively annotated using automated pipelines. Most of the existing methods for analyzing compositional biases and skews rely on fully or mostly contiguous genomic sequence and on the availability of detailed annotation of genes; such methods are much less applicable to the study of drafted, incomplete genomes.
  • We present here three novel metrics for quantitative analysis of genome skews. Our metrics are robust to assembly status and work well on incomplete genomes with draft annotation. Using these metrics, we analyzed a large collection of bacterial genomes—both complete and drafted. We identified several groups of species and genera that present as outliers for one or more of the novel metrics. These outlier species are frequently pathogenic and tend to have unusual lifestyles, like B. burgdorferi.
Explore

  • The three metrics are the cross-skew, the dot-skew and the residual skew.
    The first two metrics (cross-skew and dot-skew) are computed based on the characteristics of a single bacterial genome; they quantify the strength and relationship between the mutation and selection pressures on genes on the leading vs. lagging strands.
    The third metric (residual skew) capitalizes on the current availability of thousands of complete or drafted bacterial genomes to empirically assess how unusual a genome’s skews are relative to the expected values as learned from other genomes.

  • We computed the new metrics on a collection of 7738 bacterial genomes. You can explore the results here, download the results (below), and obtain code for computing the metrics on your own data.

  • The plot to the right is interactive. Each point represents one bacterial species. You can mouse over points to get more information in a tooltip, and use the selector under the graph to choose a genus to highlight - or click on a point to select all species in the same genus. You can also explore the three metrics in more detail. (opens on a separate tab/page)

  • You can also produce skew plots for individual bacterial species.

  • Explore the space of skew vector angles.

  • These interactive plots are powered by the fantastic Vega-Lite tool.
Using it locally
Communication

  • Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome!
  • Would you like to receive notification of updates to the skew metrics method? Follow via Twitter, or send me a note.
  • If you find these skew metrics useful for your work, please cite:
    Joesch-Cohen LM, Robinson M, Jabbari N, Lausted C and Glusman G. Novel metrics for quantifying bacterial genome skews. 2017. BIORXIV/2017/176370