--- title: "metabinR" author: - name: "Anestis Gkanogiannis" email: anestis@gkanogiannis.com package: metabinR output: BiocStyle::html_document: toc: true vignette: > %\VignetteIndexEntry{metabinR} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` # About metabinR `metabinR` performs abundance- and composition-based binning on metagenomic samples directly from FASTA or FASTQ files. Abundance-based binning (AB) analyzes long k-mers (k > 8); composition-based binning (CB) analyzes short k-mers (k < 8); hierarchical binning (ABxCB) chains the two. The heavy lifting is implemented in Java and called via [rJava](https://cran.r-project.org/package=rJava). From v2.0.0 the R API returns an S4 `MetabinResult` object and accepts Biostrings / ShortRead inputs in addition to file paths. # Installation ```{r, eval=FALSE} if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("metabinR") ``` A JDK (Java >= 17) must be available before installing `metabinR`. # Preparation ## JVM heap size JVM flags are passed through the `java.parameters` option. The `-Xmx` flag controls the maximum heap: `-Xmx1500M` or `-Xmx3G`, etc. Set this **before** loading the package. ```{r, eval=TRUE, message=FALSE} options(java.parameters = "-Xmx1500M") library(metabinR) library(data.table) library(dplyr) library(ggplot2) library(gridExtra) library(cvms) library(sabre) library(Biostrings) ``` To customise additional JVM flags (on top of the `-Xmx` heap setting), set `options(metabinR.jvm.flags = c("-XX:+UseG1GC", ...))` before the package is loaded. The defaults returned by `metabinR_jvm_options()` already enable G1GC and string deduplication. # The `MetabinResult` class Each binning function returns a `MetabinResult` S4 object: * `assignments(res)` — `DataFrame` of per-read cluster assignments. * `nClusters(res)` — number of clusters produced. * `parameters(res)` — list of parameters actually used. * `algorithm(res)` — `"AB"`, `"CB"`, or `"ABxCB"`. * `as.data.frame(res)` — the v1.x tabular layout for back-compat. # Abundance based binning example The toy simulated metagenome contains 26,664 Illumina reads (13,332 pairs of 2x150bp) sampled from 10 bacterial genomes across two abundance classes (high vs. low). Abundance ground truth: ```{r, eval=TRUE, message=FALSE, warning=FALSE} abundances <- read.table( system.file("extdata", "distribution_0.txt", package = "metabinR"), col.names = c("genome_id", "abundance", "AB_id")) ``` Read-level ground truth: ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads.mapping <- fread( system.file("extdata", "reads_mapping.tsv.gz", package = "metabinR")) %>% merge(abundances[, c("genome_id", "AB_id")], by = "genome_id") %>% arrange(anonymous_read_id) ``` Run abundance-based binning with 10-mers into 2 clusters: ```{r, eval=TRUE, message=FALSE, warning=FALSE} res.AB <- abundance_based_binning( system.file("extdata", "reads.metagenome.fasta.gz", package = "metabinR"), numOfClustersAB = 2, kMerSizeAB = 10, dryRun = FALSE, outputAB = "vignette" ) res.AB ``` `res.AB` is a `MetabinResult`. Extract the per-read table and pull the cluster labels out of it: ```{r, eval=TRUE, message=FALSE, warning=FALSE} assignments.AB <- as.data.frame(res.AB) %>% arrange(read_id) ``` `abundance_based_binning()` wrote one FASTA per cluster and a k-mer count histogram: ```{r, eval=TRUE, message=FALSE, warning=FALSE} histogram.AB <- read.table("vignette__AB.histogram.tsv", header = TRUE) ggplot(histogram.AB, aes(x = counts, y = frequency)) + geom_area() + labs(title = "kmer counts histogram") + theme_bw() ``` Evaluate against the abundance-class ground truth: ```{r, eval=TRUE, message=FALSE, warning=FALSE} eval.AB.cvms <- cvms::evaluate( data = data.frame( prediction = as.character(assignments.AB$AB), target = as.character(reads.mapping$AB_id), stringsAsFactors = FALSE), target_col = "target", prediction_cols = "prediction", type = "binomial" ) eval.AB.sabre <- sabre::vmeasure( as.character(assignments.AB$AB), as.character(reads.mapping$AB_id)) p <- cvms::plot_confusion_matrix(eval.AB.cvms) + labs(title = "Confusion Matrix", x = "Target Abundance Class", y = "Predicted Abundance Class") tab <- as.data.frame( c( Accuracy = round(eval.AB.cvms$Accuracy, 4), Specificity = round(eval.AB.cvms$Specificity, 4), Sensitivity = round(eval.AB.cvms$Sensitivity, 4), Fscore = round(eval.AB.cvms$F1, 4), Kappa = round(eval.AB.cvms$Kappa, 4), Vmeasure = round(eval.AB.sabre$v_measure, 4) ) ) grid.arrange(p, ncol = 1) knitr::kable(tab, caption = "AB binning evaluation", col.names = NULL) ``` # Composition based binning example Read-level ground truth (bacterial genome of origin): ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads.mapping <- fread( system.file("extdata", "reads_mapping.tsv.gz", package = "metabinR")) %>% arrange(anonymous_read_id) ``` Run composition-based binning with 4-mers into 10 clusters: ```{r, eval=TRUE, message=FALSE, warning=FALSE} res.CB <- composition_based_binning( system.file("extdata", "reads.metagenome.fasta.gz", package = "metabinR"), numOfClustersCB = 10, kMerSizeCB = 4, dryRun = TRUE, outputCB = "vignette" ) assignments.CB <- as.data.frame(res.CB) %>% arrange(read_id) ``` As a pure clustering task, evaluate with extrinsic measures: ```{r, eval=TRUE, message=FALSE, warning=FALSE} eval.CB.sabre <- sabre::vmeasure( as.character(assignments.CB$CB), as.character(reads.mapping$genome_id)) tab <- as.data.frame( c( Vmeasure = round(eval.CB.sabre$v_measure, 4), Homogeneity = round(eval.CB.sabre$homogeneity, 4), Completeness = round(eval.CB.sabre$completeness, 4) ) ) knitr::kable(tab, caption = "CB binning evaluation", col.names = NULL) ``` # Hierarchical (2-step ABxCB) binning example ```{r, eval=TRUE, message=FALSE, warning=FALSE} res.ABxCB <- hierarchical_binning( system.file("extdata", "reads.metagenome.fasta.gz", package = "metabinR"), numOfClustersAB = 2, kMerSizeAB = 10, kMerSizeCB = 4, dryRun = TRUE, outputC = "vignette" ) assignments.ABxCB <- as.data.frame(res.ABxCB) %>% arrange(read_id) eval.ABxCB.sabre <- sabre::vmeasure( as.character(assignments.ABxCB$ABxCB), as.character(reads.mapping$genome_id)) tab <- as.data.frame( c( Vmeasure = round(eval.ABxCB.sabre$v_measure, 4), Homogeneity = round(eval.ABxCB.sabre$homogeneity, 4), Completeness = round(eval.ABxCB.sabre$completeness, 4) ) ) knitr::kable(tab, caption = "ABxCB binning evaluation", col.names = NULL) ``` # In-memory inputs (Biostrings / ShortRead) Instead of a path, you can pass a `DNAStringSet`, `QualityScaledDNAStringSet`, or `ShortReadQ`. The Java backend reads from disk, so non-file inputs are staged to a tempfile transparently. ```{r, eval=TRUE, message=FALSE, warning=FALSE} reads <- Biostrings::readDNAStringSet( system.file("extdata", "reads.metagenome.fasta.gz", package = "metabinR")) res.AB.mem <- abundance_based_binning( reads, numOfClustersAB = 2, kMerSizeAB = 10, dryRun = TRUE ) identical(nrow(assignments(res.AB.mem)), nrow(assignments(res.AB))) ``` Clean up files written by the AB run: ```{r, eval=TRUE, message=FALSE, warning=FALSE} unlink("vignette__*") ``` # Session Info ```{r setup} utils::sessionInfo() ```