Often, high-dimensional cytometry experiments collect tens or hundreds or millions of cells in total, and it can be useful to downsample to a smaller, more computationally tractable number of cells - either for a final analysis or while developing code.
To do this, {tidytof} implements the
tof_downsample() verb, which allows downsampling using 3
methods: downsampling to an integer number of cells, downsampling to a
fixed proportion of the total number of input cells, or downsampling to
a fixed cellular density in phenotypic space.
tof_downsample()Using {tidytof}’s built-in dataset
phenograph_data, we can see that the original size of the
dataset is 1000 cells per cluster, or 3000 cells in total:
data(phenograph_data)
phenograph_data |>
dplyr::count(phenograph_cluster)
#> # A tibble: 3 × 2
#> phenograph_cluster n
#> <chr> <int>
#> 1 cluster1 1000
#> 2 cluster2 1000
#> 3 cluster3 1000To randomly sample 200 cells per cluster, we can use
tof_downsample() using the “constant”
method:
phenograph_data |>
# downsample
tof_downsample(
group_cols = phenograph_cluster,
method = "constant",
num_cells = 200
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
#> # A tibble: 3 × 2
#> phenograph_cluster n
#> <chr> <int>
#> 1 cluster1 200
#> 2 cluster2 200
#> 3 cluster3 200Alternatively, if we wanted to sample 50% of the cells in each
cluster, we could use the “prop” method:
phenograph_data |>
# downsample
tof_downsample(
group_cols = phenograph_cluster,
method = "prop",
prop_cells = 0.5
) |>
# count the number of downsampled cells in each cluster
count(phenograph_cluster)
#> # A tibble: 3 × 2
#> phenograph_cluster n
#> <chr> <int>
#> 1 cluster1 500
#> 2 cluster2 500
#> 3 cluster3 500And finally, we might also be interested in taking a slightly
different approach to downsampling that reduces the number of cells not
to a fixed constant or proportion, but to a fixed density in
phenotypic space. For example, the following scatterplot demonstrates
that there are certain areas of phenotypic density in
phenograph_data that contain more cells than others along
the cd34/cd38 axes:
rescale_max <-
function(x, to = c(0, 1), from = range(x, na.rm = TRUE)) {
x / from[2] * to[2]
}
phenograph_data |>
# preprocess all numeric columns in the dataset
tof_preprocess(undo_noise = FALSE) |>
# plot
ggplot(aes(x = cd34, y = cd38)) +
geom_hex() +
coord_fixed(ratio = 0.4) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
scale_fill_viridis_c(
labels = function(x) round(rescale_max(x), 2)
) +
labs(
fill = "relative density"
)To reduce the number of cells in our dataset until the local density
around each cell in our dataset is relatively constant, we can use the
“density” method of tof_downsample:
phenograph_data |>
tof_preprocess(undo_noise = FALSE) |>
tof_downsample(method = "density", density_cols = c(cd34, cd38)) |>
# plot
ggplot(aes(x = cd34, y = cd38)) +
geom_hex() +
coord_fixed(ratio = 0.4) +
scale_x_continuous(limits = c(NA, 1.5)) +
scale_y_continuous(limits = c(NA, 4)) +
scale_fill_viridis_c(
labels = function(x) round(rescale_max(x), 2)
) +
labs(
fill = "relative density"
)Thus, we can see that the density after downsampling is more uniform
(though not exactly uniform) across the range of
cd34/cd38 values in
phenograph_data.
For more details, check out the documentation for the 3 underlying
members of the tof_downsample_* function family (which are
wrapped by tof_downsample):
tof_downsample_constanttof_downsample_proptof_downsample_densitysessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] tidyr_1.3.2 stringr_1.6.0
#> [3] HDCytoData_1.32.0 flowCore_2.24.0
#> [5] SummarizedExperiment_1.42.0 Biobase_2.72.0
#> [7] GenomicRanges_1.64.0 Seqinfo_1.2.0
#> [9] IRanges_2.46.0 S4Vectors_0.50.1
#> [11] MatrixGenerics_1.24.0 matrixStats_1.5.0
#> [13] ExperimentHub_3.2.0 AnnotationHub_4.2.0
#> [15] BiocFileCache_3.2.0 dbplyr_2.5.2
#> [17] BiocGenerics_0.58.1 generics_0.1.4
#> [19] forcats_1.0.1 ggplot2_4.0.3
#> [21] dplyr_1.2.1 tidytof_1.6.0
#> [23] rmarkdown_2.31
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 sys_3.4.3 jsonlite_2.0.0
#> [4] shape_1.4.6.1 magrittr_2.0.5 farver_2.1.2
#> [7] vctrs_0.7.3 memoise_2.0.1 sparsevctrs_0.3.6
#> [10] htmltools_0.5.9 S4Arrays_1.12.0 curl_7.1.0
#> [13] SparseArray_1.12.2 sass_0.4.10 parallelly_1.47.0
#> [16] bslib_0.11.0 httr2_1.2.2 lubridate_1.9.5
#> [19] cachem_1.1.0 buildtools_1.0.0 igraph_2.3.1
#> [22] lifecycle_1.0.5 iterators_1.0.14 pkgconfig_2.0.3
#> [25] Matrix_1.7-5 R6_2.6.1 fastmap_1.2.0
#> [28] future_1.70.0 digest_0.6.39 AnnotationDbi_1.74.0
#> [31] RSpectra_0.16-2 RSQLite_3.53.1 labeling_0.4.3
#> [34] filelock_1.0.3 cytolib_2.24.0 yardstick_1.4.0
#> [37] timechange_0.4.0 httr_1.4.8 polyclip_1.10-7
#> [40] abind_1.4-8 compiler_4.6.0 bit64_4.8.2
#> [43] withr_3.0.2 doParallel_1.0.17 S7_0.2.2
#> [46] viridis_0.6.5 DBI_1.3.0 ggforce_0.5.0
#> [49] MASS_7.3-65 lava_1.9.1 embed_1.2.2
#> [52] rappdirs_0.3.4 DelayedArray_0.38.2 tools_4.6.0
#> [55] otel_0.2.0 future.apply_1.20.2 nnet_7.3-20
#> [58] glue_1.8.1 grid_4.6.0 Rtsne_0.17
#> [61] recipes_1.3.2 gtable_0.3.6 tzdb_0.5.0
#> [64] class_7.3-23 data.table_1.18.4 hms_1.1.4
#> [67] utf8_1.2.6 tidygraph_1.3.1 XVector_0.52.0
#> [70] RcppAnnoy_0.0.23 ggrepel_0.9.8 BiocVersion_3.23.1
#> [73] foreach_1.5.2 pillar_1.11.1 RcppHNSW_0.7.0
#> [76] splines_4.6.0 tweenr_2.0.3 lattice_0.22-9
#> [79] survival_3.8-6 bit_4.6.0 RProtoBufLib_2.24.0
#> [82] tidyselect_1.2.1 maketools_1.3.2 Biostrings_2.80.1
#> [85] knitr_1.51 gridExtra_2.3 xfun_0.57
#> [88] graphlayouts_1.2.3 hardhat_1.4.3 timeDate_4052.112
#> [91] stringi_1.8.7 yaml_2.3.12 evaluate_1.0.5
#> [94] codetools_0.2-20 ggraph_2.2.2 tibble_3.3.1
#> [97] BiocManager_1.30.27 cli_3.6.6 uwot_0.2.4
#> [100] rpart_4.1.27 jquerylib_0.1.4 Rcpp_1.1.1-1.1
#> [103] globals_0.19.1 png_0.1-9 parallel_4.6.0
#> [106] gower_1.0.2 readr_2.2.0 blob_1.3.0
#> [109] listenv_0.10.1 glmnet_5.0 viridisLite_0.4.3
#> [112] ipred_0.9-15 ggridges_0.5.7 scales_1.4.0
#> [115] prodlim_2026.03.11 crayon_1.5.3 purrr_1.2.2
#> [118] rlang_1.2.0 KEGGREST_1.52.0