Skew metrics of bacterial genomes

Skew Metrics

Novel metrics for quantifying bacterial genome composition skews

Importance

Bacterial genomes display significant compositional biases, both in terms of G+C content and in skews (strand asymmetry in ‘T’ vs. ‘A’ and ‘G’ vs. ‘C’ usage). These biases arise from the complex interplay of differential mutation rates and multiple selective constraints, particularly for energy efficiency.
Extreme examples of compositional biases are found among species in the family Borreliaceae, which comprises a variety of tick-borne spirochetes and includes species causing Lyme disease (genus Borreliella, originally Borrelia) as well as those causing relapsing fever (genus Borrelia).
Thanks to the much expanded availability of complete genome sequences of bacterial species, it is now possible to perform large-scale comparative genomics studies. A much larger number of bacterial genomes have been drafted, assembled to different levels of contiguity (contigs, scaffolds) and tentatively annotated using automated pipelines. Most of the existing methods for analyzing compositional biases and skews rely on fully or mostly contiguous genomic sequence and on the availability of detailed annotation of genes; such methods are much less applicable to the study of drafted, incomplete genomes.
We present here three novel metrics for quantitative analysis of genome skews. Our metrics are robust to assembly status and work well on incomplete genomes with draft annotation. Using these metrics, we analyzed a large collection of bacterial genomes—both complete and drafted. We identified several groups of species and genera that present as outliers for one or more of the novel metrics. These outlier species are frequently pathogenic and tend to have unusual lifestyles, like B. burgdorferi.

Explore

The three metrics are the cross-skew, the dot-skew and the residual skew.
The first two metrics (cross-skew and dot-skew) are computed based on the characteristics of a single bacterial genome; they quantify the strength and relationship between the mutation and selection pressures on genes on the leading vs. lagging strands.
The third metric (residual skew) capitalizes on the current availability of thousands of complete or drafted bacterial genomes to empirically assess how unusual a genome’s skews are relative to the expected values as learned from other genomes.

We computed the new metrics on a collection of 7738 bacterial genomes. You can explore the results here, download the results (below), and obtain code for computing the metrics on your own data.

The plot to the right is interactive. Each point represents one bacterial species. You can mouse over points to get more information in a tooltip, and use the selector under the graph to choose a genus to highlight - or click on a point to select all species in the same genus. You can also explore the three metrics in more detail. (opens on a separate tab/page)

You can also produce skew plots for individual bacterial species.

Explore the space of skew vector angles.

These interactive plots are powered by the fantastic Vega-Lite tool.

Using it locally

Download the code and data sets:
- Get the full code set from GitHub.
- Download the stand-alone script: skew_metrics.pl [9603 bytes].
- Download the lqs parameters for computing the residual skew: lqs_parameters.txt [248 bytes].
- Download pre-computed skews for 7738 bacterial species: Excel table [1.5 MB] or in JSON format [5.7 MB].

Communication

Questions, comments, bug reports, and suggestions for improvements or additional data sets are most welcome!
If you find these skew metrics useful for your work, please cite:
Joesch-Cohen LM, Robinson M, Jabbari N, Lausted C and Glusman G. Novel metrics for quantifying bacterial genome skews. 2018. BMC Genomics 19, 528