---
title: "Introduction to rhinotypeR"
author:
- name: "Martha Luka"
email: "marthaluka20@gmail.com"
- name: "Ruth Nanjala"
- name: "Wafaa Rashed"
- name: "Winfred Gatua"
- name: "Olaitan Awe"
output: BiocStyle::html_document
vignette: >
%\VignetteIndexEntry{Introduction to rhinotypeR}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include=FALSE}
options(repos = c(CRAN = "https://cran.rstudio.com/"))
```
## Background
The `rhinotypeR` package is designed to automate the genotyping of rhinoviruses using the VP4/2 genomic region. Having worked on rhinoviruses for a few years, I noticed that assigning genotypes after sequencing was particularly laborious, and needed several manual interventions. We, therefore, developed this package to address this challenge by streamlining the process.
It provides:
- Distance computation (`pairwiseDistances()`, `overallMeanDistance()`)
- Genotype assignment (`assignTypes()`)
- Visualization (`plotFrequency()`, `plotDistances()`, `plotTree()`, `SNPeek()`, `plotAA()`)
- Data helpers (`alignToRefs()` for optional alignment in R; `deleteMissingDataSites()` for global gap deletion)
- Prototype references (`getPrototypeSeqs()` if you prefer to align externally)
This vignette walks through two practical workflows:
**A. Align externally** (Your tool of choice) with `getPrototypeSeqs()` → merge with new data, align with tool of choice → import → `assignTypes()`.
**B. Align in R** (with packaged prototypes) using `alignToRefs()` → `assignTypes()`.
**Tip:** Genotype assignment relies on distances to prototype strains and requires that those prototypes are present in the alignment you pass to `assignTypes()`. The `alignToRefs()` helper makes this easy by appending packaged prototypes and aligning jointly.
## Installing the package
You can install rhinotypeR from Bioconductor using:
```{r, eval=FALSE}
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("rhinotypeR")
```
## Load rhinotypeR
```{r}
# Load package
library(rhinotypeR)
# Load example data
data(rhinovirusVP4, package = "rhinotypeR")
```
## Input expectations
- **Format:** FASTA (DNA for VP4/2; AA only for `plotAA()`)
- **Region:** VP4/2; sequences should be homologous
- **Length/QC:** We recommend ≥350 bp (typical ~420 bp) and careful trimming to the target region
- **Alignment:** Required before genotype assignment. Do it either in R (`alignToRefs()`) or outside R and then import (e.g., with `Biostrings::readDNAStringSet()`)
### Option A: Align externally
Prefer your own alignment tool? Use `getPrototypeSeqs()` to export the packaged prototypes to your local device. These should then be combined with the newly generated sequences, aligned using suitable software, curated and imported into R. There are a multitude of existing alignment software including [MAFFT](https://mafft.cbrc.jp/alignment/server/) and [MUSCLE](https://www.ebi.ac.uk/jdispatcher/).
For example, to download to the Desktop directory:
```{r, eval=FALSE}
getPrototypeSeqs("~/Desktop")
# You will get "~/Desktop/RVRefs.fasta"
# 2) Combine RVRefs.fasta with your new sequences and align in your tool (e.g., MAFFT).
# 3) Manually curate (trim to VP4/2 span, resolve poor columns), then save as FASTA.
```
Use the Biostrings package to read the curated alignment FASTA file into R:
```{r}
aln_curated <- Biostrings::readDNAStringSet(
system.file("extdata", "input_aln.fasta", package = "rhinotypeR")
)
```
### Option B: Align in R
Use this when you want a fully scripted path in R. `alignToRefs()` appends packaged references and runs `msa::msa()` using ClustalW/ClustalOmega/MUSCLE. It can optionally crop the final alignment to the non-gap span of a chosen prototype (`trimToRef = TRUE`, default; anchor with `refName`).
```{r eval=FALSE}
# Use package dataset: rhinovirusVP4
# Align user sequences + prototypes in R (choose method)
aln <- alignToRefs(
seqData = rhinovirusVP4,
method = "ClustalW", # or "ClustalOmega", "Muscle"
trimToRef = TRUE, # crop to reference non-gap span
refName = "JN855971.1_A107" # default anchor; change if desired
)
aln
```
**Why trimming and gap handling matter:**
- `trimToRef = TRUE` crops the alignment to the exact span covered by a chosen prototype — helpful when user reads spill over the target region
- `deleteGapsGlobally()` performs global deletion (drop any column with a gap in any sequence). This is different from pairwise deletion used within distance calculations. Use with care — this modifies the alignment
## Assign types
```{r}
# Input A
res <- assignTypes(aln_curated, model = "IUPAC", deleteGapsGlobally = FALSE, threshold = 0.105)
head(res)
```
```{r eval=FALSE}
# OR similarly for input B
res <- assignTypes(aln, model = "IUPAC", deleteGapsGlobally = FALSE, threshold = 0.105)
head(res)
```
**Assigning types: rules and outputs**
`assignTypes()` compares each query to prototypes and returns:
- `query` (sequence ID)
- `assignedType` (or "unassigned")
- `distance` (to nearest prototype — always reported, even if unassigned)
- `reference` (prototype accession/name used)
**Thresholds:** We default to `threshold = 0.105` (~10.5%) in line with common practice for VP4/2 (A/C) and close to B; adjust per your analysis or species.
## Plot results
```{r}
plotFrequency(res)
```
## Distances and summaries
```{r}
# Distances among all sequences
d <- pairwiseDistances(
fastaData = rhinovirusVP4,
model = "IUPAC",
deleteGapsGlobally = FALSE # set TRUE to apply global deletion inside the function
)
```
The distance matrix looks like:
```{r, echo=FALSE}
d[1:5, 1:7]
```
To visualize the pairwise distances using a heatmap or a phylogenetic tree:
```{r}
# Mean distance (overall diversity)
overallMeanDistance(rhinovirusVP4, model = "IUPAC")
# Visual summaries
## Heatmap
plotDistances(d)
## Simple tree (from distances)
sampled_distances <- d[1:30, 1:30]
plotTree(sampled_distances, hang = -1, cex = 0.6,
main = "A simple tree", xlab = "", ylab = "Genetic distance")
```
## SNP and amino acid views
The `SNPeek()` function visualizes single nucleotide polymorphisms (SNPs) in the sequences, with a selected sequence acting as the reference. To specify the reference sequence, move it to the bottom of the alignment before importing into R. Substitutions are color-coded by nucleotide:
- A = green
- T = red
- C = blue
- G = yellow
`SNPeek()` (nucleotide) and `plotAA()` (protein) support zooming and highlighting. To show all sequence names, increase the plotting device height (RStudio plot pane or `png(height=...)`).
```{r}
# SNP view (nucleotide)
SNPeek(rhinovirusVP4)
# AA view (requires an AAStringSet)
## Option 1 -- read an aligned amino acid sequence
aa_seq <- Biostrings::readAAStringSet(system.file("extdata", "test_translated.fasta", package = "rhinotypeR"))
plotAA(aa_seq)
```
```{r, eval=FALSE}
## Option 2 -- translating DNA
# Remove gaps before translate()
aln_no_gaps <- Biostrings::DNAStringSet(
gsub("-", "", as.character(rhinovirusVP4))
)
#translate
aa <- Biostrings::translate(aln_no_gaps)
aa_aln <- msa::msa(aa) # align as plotAA expects equal width
aa_aln <- as(aa_aln, "AAStringSet") # transform to AAStringSet
plotAA(aa_aln)
```
**Tip:** `show_only_highlighted = TRUE` focuses on chosen rows while keeping mismatch context relative to the reference.
## Quality and troubleshooting
* **Prototypes missing?** `assignTypes()` will error if prototypes are not present in your alignment. Use Option A (`alignToRefs()`) or Option B (`getPrototypeSeqs()` + external align).
* **Misaligned region?** Use `trimToRef = TRUE` and select an appropriate `refName` that spans your intended VP4/2 region.
* **Gaps distort distances?** Consider `deleteMissingDataSites()` (global deletion), or use a pairwise deletion model where appropriate in distance routines.
* **SNP/AA plots truncated?** Increase device height or use the `highlight_*` options and `show_only_highlighted = TRUE`.
## Conclusion
The `rhinotypeR` package simplifies the process of genotyping rhinoviruses and analyzing their genetic data. By automating various steps and providing visualization tools, it enhances the efficiency and accuracy of rhinovirus epidemiological studies.
## Session info
```{r}
sessionInfo()
```