Install the released version from Bioconductor:
If you use BreastSubtypeR, please cite:
For BibTeX/LaTeX, run in R:
Breast cancer (BC) is a biologically heterogeneous disease with intrinsic molecular subtypes (e.g., Luminal A, Luminal B, HER2-enriched, Basal-like, Normal-like) that inform biological interpretation and clinical decision-making. While clinical assays such as Prosigna provide standardized subtyping in the clinic, research implementations have proliferated and diverge in pre-processing, gene mapping, and algorithmic assumptions—reducing reproducibility and complicating cross-cohort analyses.
BreastSubtypeR consolidates multiple published
gene-expression signature classifiers into a unified, assumption-aware
Bioconductor package with: - a unified multi-method API (run many
classifiers in one call), - AUTO mode for cohort-aware
method selection, - standardized, method-specific pre-processing for
multiple input types (raw counts, FPKM, log2-processed arrays), - Entrez
ID–based probe/gene mapping, - and a local Shiny app
(iBreastSubtypeR) for non-programmers.
BS_Multi): execute several classifiers in a
single call and compare results side by side.iBreastSubtypeR):
point-and-click analysis; data stay on your machine.SummarizedExperiment
compatibility.The package includes implementations of commonly used subtyping methods (NC-based and SSP-based):
| Method id | Short description | Group | Reference |
|---|---|---|---|
parker.original |
Original PAM50 by Parker et al., 2009 | NC-based | Parker et al., 2009 |
genefu.scale |
PAM50 implementation as in the genefu R package (scaled version) | NC-based | Gendoo et al., 2016 |
genefu.robust |
PAM50 implementation as in the genefu R package (robust version) | NC-based | Gendoo et al., 2016 |
cIHC |
Conventional ER-balancing using immunohistochemistry (IHC) | NC-based | Ciriello et al., 2015 |
cIHC.itr |
Iterative version of cIHC | NC-based | Curtis et al., 2012 |
PCAPAM50 |
Selects IHC-defined ER subsets, then uses Principal Component Analysis (PCA) to create ESR1 expression-based ER-balancing | NC-based | Raj-Kumar et al., 2019 |
ssBC |
Subgroup-specific gene-centering PAM50 | NC-based | Zhao et al., 2015 |
ssBC.v2 |
Updated subgroup-specific gene-centering PAM50 with refined quantiles | NC-based | Fernandez-Martinez et al., 2020 |
AIMS |
Absolute Intrinsic Molecular Subtyping (AIMS) method | SSP-based | Paquet & Hallett, 2015 |
sspbc |
Single-Sample Predictors for Breast Cancer (AIMS adaptation) | SSP-based | Staaf et al., 2022 |
The examples below use small example datasets shipped with the
package. For your own data, provide a SummarizedExperiment
with clinical metadata in colData (e.g.,
PatientID, ER/HER2; for ROR: TSIZE,
NODE).
1) Map & prepare (method-specific pre-processing + mapping)
# Pre-processing: automatically apply tailored normalization, map probes/IDs to Entrez,
# and (optionally) impute missing values
data_input <- Mapping(
OSLO2EMIT0obj$se_obj,
method = "max", # mapping strategy (example)
RawCounts = FALSE,
impute = TRUE,
verbose = FALSE
)Notes
Mapping() prepares expression inputs for downstream
subtyping functions by:
2^x) for SSP-based methods.2^x) for SSP-based methods.method argument),BS_Multi or
single-method callers.?Mapping for the full parameter list (e.g.,
RawCounts, method, impute,
verbose) and Methods (Sections 2.3–2.4)
in the paper for a complete description of the input/normalization
pipeline.2) Multi-method run (user-defined)
methods <- c("parker.original", "PCAPAM50", "sspbc")
res <- BS_Multi(
data_input = data_input,
methods = methods,
Subtype = FALSE,
hasClinical = FALSE
)
# Per-sample calls (methods × samples)
head(res$res_subtypes, 5)
#> parker.original PCAPAM50 sspbc entropy
#> OSLO2EMIT0.001 LumA LumA LumB 0.9182958
#> OSLO2EMIT0.002 Basal Basal Basal 0.0000000
#> OSLO2EMIT0.003 LumA LumA LumA 0.0000000
#> OSLO2EMIT0.004 LumA LumA LumA 0.0000000
#> OSLO2EMIT0.005 Normal LumA Normal 0.91829583) AUTO mode (cohort-aware selection) + visualize
AUTO evaluates cohort diagnostics (for example, ER/HER2 distribution, subtype purity, and subgroup sizes) and selects methods compatible with the cohort. It disables classifiers whose distributional assumptions would likely be violated.
res_auto <- BS_Multi(
data_input = data_input,
methods = "AUTO",
Subtype = FALSE,
hasClinical = FALSE
)
# visualize multi-method output and concordance
Vis_Multi(res_auto$res_subtypes)AUTO logic (clarifications)
lower_ratio = 0.39, upper_ratio = 0.69.n_ERpos_threshold = 15n_ERneg_threshold = 18n_TN_threshold = 18 (currently aligned with
ER−)n_ERposHER2pos_threshold = n_ERposHER2neg_threshold = round(n_ERpos_threshold / 2)n_ERnegHER2pos_threshold = n_ERnegHER2neg_threshold = round(n_ERneg_threshold / 2)Notes. Thresholds are selection gates for method eligibility; they do not force a consensus call.
Provenance & future updates. The ER+ (15) and ER− (18) cohort minimums are simulation-based defaults. ER/HER2 subgroup thresholds (approx. half of each ER total) are heuristic and may be updated as additional simulation studies are completed. For TNBC, we currently use the ER− minimum (18) as the cohort cutoff; TN-specific thresholds may likewise refined in future releases.
4) Single-method run
PAM50 (NC-based)
res_pam <- BS_parker(
se_obj = data_input$se_NC, # object prepared for NC-based methods
calibration = "Internal",
internal = "medianCtr",
Subtype = FALSE,
hasClinical = FALSE
)AIMS (SSP-based)
BreastSubtypeR routes the supplied input to the
appropriate, method-specific pre-processing pipeline
automatically — see ?BS_Multi and Methods (Section 2.3) in
the paper for details.AUTOmethods = "AUTO"
(i.e. BS_Multi(methods = "AUTO", ...)) for exploratory
datasets or cohorts of unknown / skewed composition.AUTO when you want the package to select
only classifiers compatible with the cohort (it disables
methods whose assumptions appear violated).BS_parker()).AUTO is designed to avoid
misapplication of NC-based classifiers when cohort assumptions
are violated; it does not produce a forced consensus
label.For users new to R, we offer an intuitive Shiny app for interactive molecular subtyping.
If needed, install UI dependencies and re-run:
The app runs locally; no data leave your machine.
What you can do:
- Upload expression, clinical, and feature-annotation tables (clinical
lives in colData).
- Run single methods, or run multiple classifiers at once with
BS_Multi and AUTO enabled for cohort-aware
selection.
- Choose 5-class (incl. Normal-like) or 4-class (AIMS is 5-class
only).
- Inspect per-sample concordance (entropy), heatmap and pie
summaries.
- Export Calls-only or Full metrics. ROR is available for NC methods
when TSIZE/NODE are present and numeric.
The Shiny UI provides a “Load example data…” button that preloads a small demo cohort (expression, clinical, annotation). After loading, click Preprocess & map (Step 1), then proceed to analyses (Step 2).
Programmatic access to the same files:
BreastSubtypeR harmonises many published, signature-based classifiers
but has known limitations:
It is not a clinical-grade replacement for assays like Prosigna;
clinical validation requires paired clinical assay data.
AUTO selects compatible methods; it does not perform consensus voting by default.
Yang Q., Hartman J., Sifakis E.G. (2025). BreastSubtypeR: a unified R/Bioconductor package for intrinsic molecular subtyping in breast cancer research. NAR Genomics and Bioinformatics, 7(4):lqaf131. https://doi.org/10.1093/nargab/lqaf131
Parker JS, Mullins M, Cheang MCU, Leung S, Voduc D, Vickery T, et al. (2009). Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol, 27(8):1160–1167. https://doi.org/10.1200/JCO.2008.18.1370
Gendoo DMA, Ratanasirigulchai N, Schröder MS, Pare L, Parker JS, Prat A, Haibe-Kains B. (2016). Genefu: an R/Bioconductor package for computation of gene expression-based signatures in breast cancer. Bioinformatics, 32(7):1097–1099. https://doi.org/10.1093/bioinformatics/btv693
Ciriello G, Gatza ML, Beck AH, Wilkerson MD, Rhie SK, Pastore A, et al. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2):506–519. https://doi.org/10.1016/j.cell.2015.09.033
Curtis C, Shah SP, Chin S-F, Turashvili G, Rueda OM, Dunning MJ, et al. (2012). The genomic and transcriptomic architecture of 2,000 breast tumors reveals novel subgroups. Nature, 486:346–352. https://doi.org/10.1038/nature10983
Raj-Kumar PK, Liu J, Hooke JA, Kovatich AJ, Kvecher L, Shriver CD, Hu H. (2019). PCA-PAM50 improves subtype assignment in ER-positive breast cancer. Sci Rep, 9:14386. https://doi.org/10.1038/s41598-019-44339-4
Zhao X, Rodland EA, Tibshirani R, Edvardsen H, Sauer T, Hovig E. (2015). Systematic evaluation of subtype prediction using gene expression profiles and intrinsic subtyping methods. Breast Cancer Res, 17:55. https://doi.org/10.1186/s13058-015-0520-4
Fernandez-Martinez A, Krop IE, Hillman DW, Polley M-YC, Parker JS, Huebner L, et al. (2020). Survival, pathologic response, and PAM50 subtype in stage II–III HER2-positive breast cancer treated with neoadjuvant chemotherapy and trastuzumab ± lapatinib. J Clin Oncol, 38(19):2140–2150. https://doi.org/10.1200/JCO.20.01276
Paquet ER, Hallett MT. (2015). Absolute assignment of breast cancer intrinsic molecular subtype. J Natl Cancer Inst, 107(1):357. https://doi.org/10.1093/jnci/dju357
Staaf J, Ringnér M, Vallon-Christersson J. (2022). Simple single-sample predictors for breast cancer subtype identification using gene expression data. npj Breast Cancer, 8:104. https://doi.org/10.1038/s41523-022-00465-3
sessionInfo()
#> R version 4.6.0 (2026-04-24)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] BreastSubtypeR_1.4.0 BiocStyle_2.40.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.42.0 gtable_0.3.6
#> [3] impute_1.86.0 circlize_0.4.18
#> [5] shape_1.4.6.1 rjson_0.2.23
#> [7] xfun_0.58 bslib_0.11.0
#> [9] ggplot2_4.0.3 GlobalOptions_0.1.4
#> [11] ggrepel_0.9.8 Biobase_2.72.0
#> [13] lattice_0.22-9 vctrs_0.7.3
#> [15] tools_4.6.0 generics_0.1.4
#> [17] stats4_4.6.0 parallel_4.6.0
#> [19] proxy_0.4-29 tibble_3.3.1
#> [21] cluster_2.1.8.2 pkgconfig_2.0.3
#> [23] Matrix_1.7-5 data.table_1.18.4
#> [25] RColorBrewer_1.1-3 S7_0.2.2
#> [27] S4Vectors_0.50.1 lifecycle_1.0.5
#> [29] compiler_4.6.0 farver_2.1.2
#> [31] stringr_1.6.0 Seqinfo_1.2.0
#> [33] codetools_0.2-20 ComplexHeatmap_2.28.0
#> [35] clue_0.3-68 class_7.3-23
#> [37] htmltools_0.5.9 sys_3.4.3
#> [39] buildtools_1.0.0 sass_0.4.10
#> [41] yaml_2.3.12 pillar_1.11.1
#> [43] crayon_1.5.3 jquerylib_0.1.4
#> [45] cachem_1.1.0 DelayedArray_0.38.2
#> [47] iterators_1.0.14 abind_1.4-8
#> [49] foreach_1.5.2 tidyselect_1.2.1
#> [51] digest_0.6.39 stringi_1.8.7
#> [53] dplyr_1.2.1 maketools_1.3.2
#> [55] fastmap_1.2.0 grid_4.6.0
#> [57] SparseArray_1.12.2 colorspace_2.1-2
#> [59] cli_3.6.6 magrittr_2.0.5
#> [61] S4Arrays_1.12.0 e1071_1.7-17
#> [63] withr_3.0.2 scales_1.4.0
#> [65] XVector_0.52.0 rmarkdown_2.31
#> [67] matrixStats_1.5.0 otel_0.2.0
#> [69] png_0.1-9 GetoptLong_1.1.1
#> [71] evaluate_1.0.5 knitr_1.51
#> [73] GenomicRanges_1.64.0 IRanges_2.46.0
#> [75] doParallel_1.0.17 rlang_1.2.0
#> [77] Rcpp_1.1.1-1.1 glue_1.8.1
#> [79] BiocManager_1.30.27 BiocGenerics_0.58.1
#> [81] jsonlite_2.0.0 R6_2.6.1
#> [83] MatrixGenerics_1.24.0