| Title: | Single Cell Ecological Taxonomic Analysis |
|---|---|
| Description: | Tools for compositional and other sample-level ecological analyses and visualizations tailored for single-cell RNA-seq data. SETA includes functions for taxonomizing celltypes, normalizing data, performing statistical tests, and visualizing results. Several tutorials are included to guide users and introduce them to key concepts. SETA is meant to teach users about statistical concepts underlying ecological analysis methods so they can apply them to their own single-cell data. |
| Authors: | Kyle Kimler [aut, cre] (ORCID: <https://orcid.org/0000-0003-4735-9064>), Marc Elosua-Bayes [aut] |
| Maintainer: | Kyle Kimler <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 1.2.0 |
| Built: | 2026-05-30 11:06:44 UTC |
| Source: | https://github.com/bioc/SETA |
SETA provides tools for compositional analysis of single-cell RNA-seq data, applying ecological principles and compositional data analysis (CoDA) methods to understand cell-type abundance patterns. The package offers a comprehensive workflow for extracting cell-type counts, applying various compositional transforms, performing latent space analysis, and visualizing results.
SETA treats cell-type abundance data as compositional data, similar to how ecologists analyze species abundance in environmental samples. This approach is appropriate because cell-type proportions sum to 1 (or 100 in one cell type affect all others.
The package implements several compositional transforms:
CLR (Centered Log-Ratio): Centers log-transformed data around the geometric mean
ALR (Additive Log-Ratio): Uses a reference cell type as denominator
ILR (Isometric Log-Ratio): Projects data onto orthonormal basis
Balance transforms: User-defined log-ratios between groups of cell types
SETA also provides multi-resolution analysis capabilities, allowing users to analyze data at different taxonomic levels (e.g., broad cell types vs. specific subtypes).
This package provides functions that return various data structures:
setaCounts(): Returns a sample-by-cell-type count matrix
setaTransform(): Returns a list with transformed counts and method information
setaLatent(): Returns a list with latent space coordinates, loadings, and variance explained
setaDistances(): Returns a data frame with pairwise distances between samples
setaTaxonomyDF(): Returns a data frame with hierarchical taxonomy information
taxonomy_to_tbl_graph(): Returns a tbl_graph object for visualization
Key functions include:
setaCounts: Extract cell-type count matrices from single-cell objects
setaTransform: Apply compositional transforms (CLR, ALR, ILR, balance)
setaLatent: Perform dimensionality reduction (PCA, PCoA, NMDS)
setaDistances: Calculate compositional distances between samples
setaTaxonomyDF: Create hierarchical taxonomies for multi-resolution analysis
taxonomy_to_tbl_graph: Convert taxonomies to graph objects for visualization
For detailed examples, see the package vignettes:
vignette("introductory_vignette", package = "SETA")
vignette("comparing_samples", package = "SETA")
vignette("reference_frames", package = "SETA")
The package is designed to be educational, teaching users about compositional data analysis principles while providing practical tools for single-cell research.
Kyle Kimler [email protected] (aut, cre)
Marc Elosua-Bayes (aut)
Aitchison, J. (1982). The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139-177.
Greenacre, M. (2018). Compositional Data Analysis in Practice. CRC Press.
Pawlowsky-Glahn, V., Egozcue, J. J., & Tolosana-Delgado, R. (2015). Modeling and Analysis of Compositional Data. John Wiley & Sons.
Useful links:
Report bugs at https://github.com/CellDiscoveryNetwork/SETA/issues
mockSeurat/mockSCE/mockLong are designed to generate synthetic single-cell
data. These data are not meant to represent biologically
meaningful use-cases, but are solely intended for use in examples, for
unit-testing, and to demonstrate SETA's general functionality.
Please don't use it in real life.
mockSeurat(ng = 200, nc = 50, nt = 3, ns = 4, nb = 2) mockLong(nc = 50, nt = 3, ns = 4, nb = 2, useBatch = TRUE) mockCount(df = mockLong()) mockSCE(nc = 500, nt = 3, ns = 4, nb = 2, useBatch = TRUE) makeTypeHierarchy(type_levels)mockSeurat(ng = 200, nc = 50, nt = 3, ns = 4, nb = 2) mockLong(nc = 50, nt = 3, ns = 4, nb = 2, useBatch = TRUE) mockCount(df = mockLong()) mockSCE(nc = 500, nt = 3, ns = 4, nb = 2, useBatch = TRUE) makeTypeHierarchy(type_levels)
ng, nc, nt, ns, nb
|
integer scalar specifying the number of genes, cells, types (groups), samples, and batches to simulate. |
useBatch |
logical scalar indicating whether to include a batch metadata column |
df |
data.frame in the format of 'mockLong()'. |
type_levels |
character vector of type levels representing cell types to be assigned to fine, mid and broad annotations |
mockSeurat returns a Seurat object
with rows = genes, columns = single cells, and cell metadata
column type containing group identifiers.
mockLong returns a data.frame object
with rows = cells, columns = cell metadata
column fine_type, mid_type, broad_type
containing group identifiers at different hierarchical levels.
mockCount returns a data.frame object
with rows = type x sample, columns = metadata
column bc containing the number of cells per type x sample.
mockSCE returns a SingleCellExperiment object
with rows = genes, columns = single cells, and cell metadata
column type containing group identifiers.
makeTypeHierarchy returns a list of as many elements as levels
in the hierarchy, with names corresponding to the type levels and
values containing the corresponding type identifiers at that level.
seu <- mockSeurat() sce <- mockSCE() hierarchy <- makeTypeHierarchy(c("Lineage", "Type", "State"))seu <- mockSeurat() sce <- mockSCE() hierarchy <- makeTypeHierarchy(c("Lineage", "Type", "State"))
* **character vector of leaf names** present in 'colnames(counts)' * **character vector of higher level labels** that appear in a column of 'taxonomyDF' ('taxonomy_col') * **numeric vector of column indices**
resolveGroup(spec, counts, taxonomyDF = NULL, taxonomy_col = NULL)resolveGroup(spec, counts, taxonomyDF = NULL, taxonomy_col = NULL)
spec |
A character or numeric vector specifying a group. See Details. |
counts |
Numeric matrix: samples |
taxonomyDF |
A |
taxonomy_col |
Character. Which column of |
If higher level labels are supplied, the function returns *all leaves* (finest level labels = 'rownames(taxonomyDF)') whose 'taxonomy_col' entry matches the requested label(s).
An integer vector of column indices inside 'counts'.
## Mini counts matrix mat <- matrix(1, 3, 4, dimnames = list(NULL, c("AT1", "AT2", "Fib1", "Fib2"))) ## Taxonomy table taxDF <- data.frame( broad_type = c("Epi","Epi","Stroma","Stroma"), row.names = c("AT1","AT2","Fib1","Fib2") ) ## Resolve by leaf names resolveGroup(c("AT1","AT2"), mat, taxDF, "broad_type") ## Resolve by higher level label resolveGroup("Stroma", mat, taxDF, "broad_type")## Mini counts matrix mat <- matrix(1, 3, 4, dimnames = list(NULL, c("AT1", "AT2", "Fib1", "Fib2"))) ## Taxonomy table taxDF <- data.frame( broad_type = c("Epi","Epi","Stroma","Stroma"), row.names = c("AT1","AT2","Fib1","Fib2") ) ## Resolve by leaf names resolveGroup(c("AT1","AT2"), mat, taxDF, "broad_type") ## Resolve by higher level label resolveGroup("Stroma", mat, taxDF, "broad_type")
Applies the ALR transform to an integer matrix of counts using a specified reference taxon. Samples are in rows and taxa in columns.
setaALR(counts, ref, pseudocount = 1)setaALR(counts, ref, pseudocount = 1)
counts |
A numeric matrix with rows as samples and columns as taxa. |
ref |
Either the reference taxon name (a character string, which must
appear in |
pseudocount |
Numeric. Added to every count to avoid |
For each sample, the transform computes
, where
is the pseudocount, for all taxa except the reference.
A list with:
A string indicating the ALR transform with the reference taxon.
A matrix with one row per sample and (n_taxa - 1) columns.
# Example with 2 samples and 2 taxa: mat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("TaxonA", "TaxonB") # Using TaxonA as the reference. out <- setaALR(mat, ref = "TaxonA", pseudocount = 0) out$counts# Example with 2 samples and 2 taxa: mat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("TaxonA", "TaxonB") # Using TaxonA as the reference. out <- setaALR(mat, ref = "TaxonA", pseudocount = 0) out$counts
'setaBalance()' computes *one or more* biologically meaningful balances (log-ratios) from a count matrix. Each balance is defined by two groups of taxa: **numerator** ('num') and **denominator** ('denom'). Groups may be given as leaf names, higher-level labels (resolved through a 'taxonomyDF'), or column indices. The resulting balance will be positive if weighted in the numerator direction, and negative toward the denominator.
setaBalance( counts, balances, taxonomyDF = NULL, taxonomy_col = NULL, normalize_to_parent = FALSE, pseudocount = 1 )setaBalance( counts, balances, taxonomyDF = NULL, taxonomy_col = NULL, normalize_to_parent = FALSE, pseudocount = 1 )
counts |
Numeric matrix with rows = samples and columns = taxa. |
balances |
A single balance (list with 'num', 'denom') **or** a named list of such lists for multiple balances. |
taxonomyDF |
Optional. A data frame from [setaTaxonomyDF()] used to expand higher-level labels into their descendant leaves. |
taxonomy_col |
Character. Column in 'taxonomyDF' whose values should match any higher-level labels given in 'balances'. |
normalize_to_parent |
Logical (default 'FALSE'). If 'TRUE', each sample
is re-closed to the sub-composition formed by the union of |
pseudocount |
Numeric. Value added to every count to avoid 'log(0)'. Default '1'. |
For every balance and every sample the function returns
where is the geometric mean of the (pseudocount-adjusted) counts in
the respective group.
A list with
"balance".
Matrix with dimensions samples balances. Column names are the
balance names (or "Balance1" if unnamed).
# Toy metadata & taxonomy table (from setaTaxonomyDF documentation) meta <- data.frame( bc = paste0("cell", 1:6), fine_type = c("AT1","AT2","AT1","Fib1","Fib1","AT2"), mid_type = c("Alv","Alv","Alv","Fib","Fib","Alv"), broad_type = c("Epi","Epi","Epi","Stroma","Stroma","Epi") ) taxDF <- setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type")) # Fake counts (2 samples x n_taxa leaves) set.seed(687) cnt <- matrix(rpois(2 * 3, 10), nrow = 2) colnames(cnt) <- rownames(taxDF) # (a) One balance: Epi vs Stroma (broad_type level) bal1 <- list(num = "Epi", denom = "Stroma") out1 <- setaBalance(cnt, bal1, taxonomyDF = taxDF, taxonomy_col = "broad_type") out1$counts # (b) Two balances in one call bals <- list( epi_vs_stroma = list(num = "Epi", denom = "Stroma"), AT1_vs_AT2 = list(num = "AT1", denom = "AT2") ) out2 <- setaBalance(cnt, bals, taxonomyDF = taxDF, taxonomy_col = "fine_type") out2$counts# Toy metadata & taxonomy table (from setaTaxonomyDF documentation) meta <- data.frame( bc = paste0("cell", 1:6), fine_type = c("AT1","AT2","AT1","Fib1","Fib1","AT2"), mid_type = c("Alv","Alv","Alv","Fib","Fib","Alv"), broad_type = c("Epi","Epi","Epi","Stroma","Stroma","Epi") ) taxDF <- setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type")) # Fake counts (2 samples x n_taxa leaves) set.seed(687) cnt <- matrix(rpois(2 * 3, 10), nrow = 2) colnames(cnt) <- rownames(taxDF) # (a) One balance: Epi vs Stroma (broad_type level) bal1 <- list(num = "Epi", denom = "Stroma") out1 <- setaBalance(cnt, bal1, taxonomyDF = taxDF, taxonomy_col = "broad_type") out1$counts # (b) Two balances in one call bals <- list( epi_vs_stroma = list(num = "Epi", denom = "Stroma"), AT1_vs_AT2 = list(num = "AT1", denom = "AT2") ) out2 <- setaBalance(cnt, bals, taxonomyDF = taxDF, taxonomy_col = "fine_type") out2$counts
,
where is the geometric mean of the row.Centered Log-Ratio (CLR) Transform
Applies a CLR transform to a matrix of counts.
Samples should be in rows and taxa (cell types) in columns.
For each sample, the transform computes
,
where is the geometric mean of the row.
setaCLR(counts, pseudocount = 1)setaCLR(counts, pseudocount = 1)
counts |
An integer matrix of cell-type counts with samples in rows. |
pseudocount |
Numeric. Added to all entries to avoid |
The CLR transform is defined sample-wise as
where
for sample , and is the number of taxa. Here is the pseudocount.
A list with:
A string indicating the transform ("CLR").
A matrix of the same dimensions as the input after the CLR transform.
Aitchison, J. (1982). The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139–177.
mat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("Taxon1", "Taxon2") out <- setaCLR(mat, pseudocount = 0) out$countsmat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("Taxon1", "Taxon2") out <- setaCLR(mat, pseudocount = 0) out$counts
Given a long-form data.frame,
creates a type-by-sample matrix of cell counts.
Users can specify the column names for cell types, samples, and barcodes.
setaCounts(obj, cell_type_col = "type", sample_col = "sample", bc_col = "bc")setaCounts(obj, cell_type_col = "type", sample_col = "sample", bc_col = "bc")
obj |
A long-form data.frame, Seurat object, or SingleCellExperiment object. |
cell_type_col |
Column name for cell types (default "type") |
sample_col |
Column name for sample IDs (default "sample") |
bc_col |
Column name for barcodes (default "bc") Use '"rownames"' to extract barcodes from row names. |
A sample-by-celltype matrix of counts.
# For a data.frame with custom column names: set.seed(687) df <- data.frame( barcode = paste0("cell", 1:10), cellType = sample(c("Tcell", "Bcell"), 10, TRUE), sampleID = sample(c("sample1","sample2"), 10, TRUE) ) cmat <- setaCounts(df, cell_type_col = "cellType", sample_col = "sampleID", bc_col = "barcode") print(head(cmat))# For a data.frame with custom column names: set.seed(687) df <- data.frame( barcode = paste0("cell", 1:10), cellType = sample(c("Tcell", "Bcell"), 10, TRUE), sampleID = sample(c("sample1","sample2"), 10, TRUE) ) cmat <- setaCounts(df, cell_type_col = "cellType", sample_col = "sampleID", bc_col = "barcode") print(head(cmat))
Calculates a pairwise distance matrix between samples based on user-specified
or default ("euclidean") distance metrics. If used on CLR-transformed
data, the default Euclidean distance is the Aitchison distance,
which is commonly used in compositional data analysis (CoDA).
setaDistances(transformed_counts, method = "euclidean")setaDistances(transformed_counts, method = "euclidean")
transformed_counts |
Numeric matrix: rows as samples and columns as taxa
(e.g., output from |
method |
Character. Distance metric for |
This function calculates distances between samples.
Output is a long-form structure convenient to merge with sample-level
metadata using merge or left_join.
A long-form data.frame with three columns:
Sample ID of the first sample in the pairwise comparison.
Sample ID of the second sample in the pairwise comparison.
Numeric distance between the two samples.
Aitchison, J. (1982). The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139-177.
# Example CLR transformed data (2 samples, 3 taxa) mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE) colnames(mat) <- c("Taxon1", "Taxon2", "Taxon3") rownames(mat) <- c("SampleA", "SampleB") clr_mat <- setaCLR(mat, pseudocount = 0) # Calculate Euclidean (Aitchison) distance dist_df <- setaDistances(clr_mat) print(head(dist_df))# Example CLR transformed data (2 samples, 3 taxa) mat <- matrix(c(1, 2, 3, 4, 5, 6), nrow = 2, byrow = TRUE) colnames(mat) <- c("Taxon1", "Taxon2", "Taxon3") rownames(mat) <- c("SampleA", "SampleB") clr_mat <- setaCLR(mat, pseudocount = 0) # Calculate Euclidean (Aitchison) distance dist_df <- setaDistances(clr_mat) print(head(dist_df))
Isometric Log-Ratio (ILR) Transform Applies the ILR transform to an integer counts matrix. For each sample (row), the data are log-transformed (with an optional Box Cox like transformation) then projected onto an orthonormal Helmert basis, reducing dimensionality by one.
setaILR(counts, boxcox_p = 0, taxTree = NULL, pseudocount = 1)setaILR(counts, boxcox_p = 0, taxTree = NULL, pseudocount = 1)
counts |
An integer matrix of celltype counts with samples in rows. |
boxcox_p |
Numeric. If nonzero, a Box Cox type transform is applied to the log-values. Default is 0 (no Box Cox transformation). |
taxTree |
Unused. Reserved for future taxonomic-balance approaches. |
pseudocount |
Numeric. Added to avoid |
The ILR transform is computed as follows:
Add a pseudocount and take the natural logarithm:
If boxcox_p != 0, apply the Box Cox like transform:
Project the log-transformed data onto an orthonormal Helmert basis computed via QR decomposition.
A list with:
A string indicating the ILR transform.
If boxcox_p is nonzero,
the value is indicated in the method string.
A matrix of ILR-transformed values with
ncol(counts) - 1 columns
and the same number of rows (samples) as the input.
Aitchison, J. (1982). The Statistical Analysis of Compositional Data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139-177.
# Example matrix: rows are samples, columns are cell types. mat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("A", "B") # ILR transformation reduces the dimension by 1. out <- setaILR(mat, boxcox_p = 0, pseudocount = 1) out$counts# Example matrix: rows are samples, columns are cell types. mat <- matrix(c(1, 2, 4, 8), nrow = 2, byrow = TRUE) colnames(mat) <- c("A", "B") # ILR transformation reduces the dimension by 1. out <- setaILR(mat, boxcox_p = 0, pseudocount = 1) out$counts
Given an object produced by one of the
seta* transform functions (e.g., setaCLR),
this function applies a dimension reduction method (PCA, PCoA, or NMDS) to
transform_obj$counts.
setaLatent(transform_obj, method = c("PCA", "PCoA", "NMDS"), dims = 2)setaLatent(transform_obj, method = c("PCA", "PCoA", "NMDS"), dims = 2)
transform_obj |
A list returned by, e.g., |
method |
A string specifying the dimension reduction method.
One of |
dims |
Integer number of dimensions to return. Default is 2. |
PCA: Uses stats::prcomp
on the rows of transform_obj$counts.
PCoA: Computes a distance matrix
via stats::dist, then
applies classical multidimensional scaling (stats::cmdscale).
NMDS: Uses MASS::isoMDS to compute
non-metric MDS from the distance matrix.
Each method returns a data frame of coordinates in latentSpace,
plus additional information specific to that method.
A list containing:
The chosen latent space method.
A data frame of coordinates in the chosen latent space
with dims columns.
For PCA, the loadings matrix. Otherwise NA.
Variance explained (for PCA or PCoA) or stress (for NMDS).
set.seed(687) mat <- matrix(rpois(20, lambda=5), nrow=4) # small 4x5 matrix colnames(mat) <- paste0("C", 1:5) clr_out <- setaCLR(mat) latent_pca <- setaLatent(clr_out, method="PCA", dims=2) latent_pca$latentSpaceset.seed(687) mat <- matrix(rpois(20, lambda=5), nrow=4) # small 4x5 matrix colnames(mat) <- paste0("C", 1:5) clr_out <- setaCLR(mat) latent_pca <- setaLatent(clr_out, method="PCA", dims=2) latent_pca$latentSpace
log2(CPM) Transform Computes the log2 counts-per-million (CPM) for each sample. Samples are in rows and taxa in columns.
setaLogCPM(counts, pseudocount = 1, size_factors = NULL, scale_factor = 1e+06)setaLogCPM(counts, pseudocount = 1, size_factors = NULL, scale_factor = 1e+06)
counts |
A numeric matrix with rows as samples and columns as taxa. |
pseudocount |
Numeric. Added to counts to avoid |
size_factors |
Optional numeric vector of library sizes for each sample.
If |
scale_factor |
Numeric. The scaling factor, typically |
The transform is
where is the pseudocount, is the per-sample library size, and
is scale_factor.
A list with:
The string "logCPM".
A matrix of the same dimensions with log2-transformed CPM values.
mat <- matrix(c(10, 20, 100, 200), nrow = 2, byrow = TRUE) out <- setaLogCPM(mat, pseudocount = 1) out$countsmat <- matrix(c(10, 20, 100, 200), nrow = 2, byrow = TRUE) out <- setaLogCPM(mat, pseudocount = 1) out$counts
This function extracts sample-level metadata from a dataframe. It ensures that each metadata column contains unique values per sample. If a metadata column contains non-unique values within any sample, that column is excluded from the output, and the user is notified via a warning. Useful when preparing metadata for visualizations or analyses where sample-level inspection is required.
setaMetadata(x, sample_col = "Sample ID", meta_cols = NULL)setaMetadata(x, sample_col = "Sample ID", meta_cols = NULL)
x |
An object of class dataframe which contains cell-level or sample-level metadata. |
sample_col |
Character. The sample identifier column name in source_obj. Default is 'sample_id'. This column is used to group the metadata. |
meta_cols |
Character vector. Names of metadata columns to retain. If NULL, all columns present in the source object are considered. However, only those columns where all entries are identical within each sample are included in the final output. |
A dataframe where each row corresponds to a unique sample and each column represents a metadata variable that has uniform values within samples. Columns with non-unique values within any sample are excluded, and a warning lists these columns.
# Using a Seurat object # if (requireNamespace("SeuratObject", quietly = TRUE)) { # meta_df <- setaMetadata(seurat_obj[[]], # sample_col="donor_id", # meta_cols=c("disease", "Severity")) # } # # Using a SingleCellExperiment object with default parameters # if (requireNamespace("SingleCellExperiment", quietly = TRUE)) { # meta_df <- setaMetadata(data.frame(colData(sce_obj))) # # Using a dataframe and extracting all possible metadata columns # meta_df <- setaMetadata(df) # }# Using a Seurat object # if (requireNamespace("SeuratObject", quietly = TRUE)) { # meta_df <- setaMetadata(seurat_obj[[]], # sample_col="donor_id", # meta_cols=c("disease", "Severity")) # } # # Using a SingleCellExperiment object with default parameters # if (requireNamespace("SingleCellExperiment", quietly = TRUE)) { # meta_df <- setaMetadata(data.frame(colData(sce_obj))) # # Using a dataframe and extracting all possible metadata columns # meta_df <- setaMetadata(df) # }
Percentage Transform Converts each row (sample) of a counts matrix to percentages of its row sum.
setaPercent(counts)setaPercent(counts)
counts |
A numeric matrix with rows as samples and columns as taxa. |
Useful for simplified comparisons and as an input to non-parametric tests.
A list with:
The string "percent".
A matrix of the same dimensions as
counts, where each row sums to 100.
mat <- matrix(c(1,2,4,8), nrow = 2, byrow = TRUE) out <- setaPercent(mat) out$countsmat <- matrix(c(1,2,4,8), nrow = 2, byrow = TRUE) out <- setaPercent(mat) out$counts
setaTaxonomyDF() converts **one long-form metadata data.frame**-typically colData(sce), seu[[]], or any frame you already have, into a tidy taxonomy table. Each row corresponds to a unique value of the *finest* label (the **last** element of 'resolution_cols'), and every coarser label sits in its own column.
setaTaxonomyDF( obj, resolution_cols = c("broad_type", "mid_type", "fine_type"), bc_col = "bc" )setaTaxonomyDF( obj, resolution_cols = c("broad_type", "mid_type", "fine_type"), bc_col = "bc" )
obj |
A data.frame or similar object containing cell metadata. |
resolution_cols |
A character vector of column names indicating hierarchical taxonomy (from broad to fine). |
bc_col |
Optional. The name of the column containing barcodes, or "rownames" if they are row names. |
## What the input must contain * exactly **one row per cell** * at least one **barcode** column (default '"bc"'). Pass 'bc_col = "rownames"' if barcodes live in 'rownames(obj)'. * **all** columns listed in 'resolution_cols'
No 'Seurat'/'SingleCellExperiment' objects are accepted here: extract their metadata/colData first, then hand it in as a 'data.frame'
## Value A 'data.frame' whose **rownames** are the finest label. If any finest label maps to more than one set of coarser labels the function should stop with an informative error.
A 'data.frame' with one row per unique value of the finest label (the last entry in 'resolution_cols'), and one column for each resolution level. The row names are set to the finest label values. If any finest label maps to more than one combination of coarser labels, the function stops with an informative error.
meta <- data.frame( bc = paste0("cell", 1:6), broad_type = c("Epi","Epi","Epi","Stroma","Stroma","Epi"), mid_type = c("Alv","Alv","Alv","Fib","Fib","Alv"), fine_type = c("AT1","AT2","AT1","Fib1","Fib1","AT2") ) setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type")) ## barcodes can be in rownames with bc_col = "rownames" (as in Seurat Object) rownames(meta) <- meta$bc meta$bc <- NULL setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type"), bc_col = "rownames")meta <- data.frame( bc = paste0("cell", 1:6), broad_type = c("Epi","Epi","Epi","Stroma","Stroma","Epi"), mid_type = c("Alv","Alv","Alv","Fib","Fib","Alv"), fine_type = c("AT1","AT2","AT1","Fib1","Fib1","AT2") ) setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type")) ## barcodes can be in rownames with bc_col = "rownames" (as in Seurat Object) rownames(meta) <- meta$bc meta$bc <- NULL setaTaxonomyDF(meta, resolution_cols = c("broad_type","mid_type","fine_type"), bc_col = "rownames")
counts matrix
should have rows as samples and columns as taxa. Optionally, you can supply
a taxonomy data frame to perform a within-lineage transform at a specified
resolution.Wrapper for Compositional Transforms with Optional Within-Lineage Resolutions
A convenience function that dispatches to one of the transforms:
CLR, ALR, ILR, percent, or logCPM. Note that the input counts matrix
should have rows as samples and columns as taxa. Optionally, you can supply
a taxonomy data frame to perform a within-lineage transform at a specified
resolution.
setaTransform( counts, method = c("CLR", "ALR", "ILR", "percent", "logCPM", "balance"), ref = NULL, taxTree = NULL, pseudocount = 1, size_factors = NULL, taxonomyDF = NULL, taxonomy_col = NULL, within_resolution = FALSE, balances = NULL, normalize_to_parent = FALSE )setaTransform( counts, method = c("CLR", "ALR", "ILR", "percent", "logCPM", "balance"), ref = NULL, taxTree = NULL, pseudocount = 1, size_factors = NULL, taxonomyDF = NULL, taxonomy_col = NULL, within_resolution = FALSE, balances = NULL, normalize_to_parent = FALSE )
counts |
A numeric matrix with rows as samples and columns as taxa. |
method |
A character string specifying which transform to apply.
One of |
ref |
Reference taxon (only used if |
taxTree |
Optional tree for ILR (not yet implemented). |
pseudocount |
Numeric, used by CLR, ALR, ILR, and logCPM. Default is 1. |
size_factors |
For logCPM scaling. If |
taxonomyDF |
Optional data frame specifying higher-level groupings
for each taxon. Row names of |
taxonomy_col |
The column of |
within_resolution |
Logical. If |
balances |
For '"balance"': a single balance list or a named list; |
normalize_to_parent |
Logical, passed to [setaBalance()]. |
A list with the following elements:
The core transform, e.g. \"CLR\", \"ALR\", etc.
Logical indicating if a within-lineage transform was used.
The name of the column in taxonomyDF used for
grouping (lineages) if within_resolution = TRUE, otherwise
NULL.
The resulting matrix after transformation, with the same
dimensions as the input counts.
mat <- matrix(c(1, 2, 4, 8, 3, 6, 9, 12), nrow = 2, byrow = TRUE) colnames(mat) <- c("TaxonA1", "TaxonA2", "TaxonB1", "TaxonB2") # Build a taxonomy data frame labeling lineages df_lineage <- data.frame( Lineage = c("LineageA", "LineageA", "LineageB", "LineageB"), row.names = colnames(mat) ) # Apply CLR transform to all columns together out1 <- setaTransform(mat, method = "CLR") # Apply CLR within each Lineage out2 <- setaTransform( mat, method = "CLR", taxonomyDF = df_lineage, taxonomy_col = "Lineage", within_resolution = TRUE )mat <- matrix(c(1, 2, 4, 8, 3, 6, 9, 12), nrow = 2, byrow = TRUE) colnames(mat) <- c("TaxonA1", "TaxonA2", "TaxonB1", "TaxonB2") # Build a taxonomy data frame labeling lineages df_lineage <- data.frame( Lineage = c("LineageA", "LineageA", "LineageB", "LineageB"), row.names = colnames(mat) ) # Apply CLR transform to all columns together out1 <- setaTransform(mat, method = "CLR") # Apply CLR within each Lineage out2 <- setaTransform( mat, method = "CLR", taxonomyDF = df_lineage, taxonomy_col = "Lineage", within_resolution = TRUE )
This function takes a data frame describing a hierarchical taxonomy across
multiple columns (e.g., broad -> mid -> fine). Each row represents a unique
path through the hierarchy. The function introduces a single root node
(named root_name) above the first hierarchy column, then constructs
a directed tree in which each level connects to the next.
After building the graph, it appends node-level metadata by looking up
which rows (and columns) in tax_df contain each node. This allows
you to color or facet by different levels of the taxonomy when using ggraph.
taxonomy_to_tbl_graph(tax_df, columns = NULL, root_name = "AllCells")taxonomy_to_tbl_graph(tax_df, columns = NULL, root_name = "AllCells")
tax_df |
A data frame with one row per unique path in the hierarchy.
For example, if your columns are |
columns |
A character vector of column names in |
root_name |
A character string naming the artificial root node,
inserted above the first hierarchy column. Default is |
1. The function first builds an edge list
Root -> level1 for each row
level1 -> level2
...
level_{N-1} -> levelN
and removes duplicates, creating a single connected tree.
2. It then annotates each node with the best-known taxonomy data. For
a node named x, we look up all rows of tax_df where
x appears in columns, gather the distinct values from each
col, and store them joined with \"|\" if more than one
distinct value is found.
This means if a node is shared among multiple broad categories (uncommon, but
possible), that node's broad column will contain something like
\"Epithelial|Stromal\".
A tbl_graph object (directed) with a single root node. The node
data includes extra columns corresponding to each level in columns.
If a node corresponds to multiple categories at a given level, these are
combined with \"|\".
# Minimal example with a 3-level hierarchy (broad -> mid -> fine) tax_df_example <- data.frame( broad = c("Epithelial", "Epithelial", "Stromal"), mid = c("Alveolar", "Alveolar", "Fibroblast"), fine = c("AlveolarType1", "AlveolarType2", "Fibroblast1"), stringsAsFactors = FALSE ) library(tidygraph) library(ggraph) library(ggplot2) # Build a single-root tree and incorporate node metadata tbl_g <- taxonomy_to_tbl_graph( tax_df_example, columns = c("broad", "mid", "fine"), root_name = "AllCells" ) # Inspect node data (metadata for each node) as.data.frame(tbl_g, "nodes") # Visualize with ggraph, coloring by 'broad' level ggraph(tbl_g, layout = "tree") + geom_edge_diagonal() + geom_node_point(aes(color = broad), size = 3) + geom_node_text(aes(label = name), vjust = 1, hjust = 0.5) + theme_minimal() + labs(title = "Single-Root Taxonomy Tree")# Minimal example with a 3-level hierarchy (broad -> mid -> fine) tax_df_example <- data.frame( broad = c("Epithelial", "Epithelial", "Stromal"), mid = c("Alveolar", "Alveolar", "Fibroblast"), fine = c("AlveolarType1", "AlveolarType2", "Fibroblast1"), stringsAsFactors = FALSE ) library(tidygraph) library(ggraph) library(ggplot2) # Build a single-root tree and incorporate node metadata tbl_g <- taxonomy_to_tbl_graph( tax_df_example, columns = c("broad", "mid", "fine"), root_name = "AllCells" ) # Inspect node data (metadata for each node) as.data.frame(tbl_g, "nodes") # Visualize with ggraph, coloring by 'broad' level ggraph(tbl_g, layout = "tree") + geom_edge_diagonal() + geom_node_point(aes(color = broad), size = 3) + geom_node_text(aes(label = name), vjust = 1, hjust = 0.5) + theme_minimal() + labs(title = "Single-Root Taxonomy Tree")