---
title: "UPDhmm User Guide: From Detection to Postprocessing"
author: 
  - name: "Marta Sevilla Porras"
    affiliation:
      - "Universitat Pompeu Fabra (UPF)"
      - "Centro de Investigación Biomédica en Red (CIBERER)"
    email: "marta.sevilla@upf.edu"
  - name: "Carlos Ruiz Arenas"
    affiliation: 
      - "Universidad de Navarra (UNAV)"
    email: "cruizarenas@unav.es"
date: "`r Sys.Date()`"
package: "`r BiocStyle::pkg_ver('UPDhmm')`"
output: 
  BiocStyle::html_document:
    number_sections: false
    toc: true
    fig_caption: true
    toc_float: true
vignette: >
  %\VignetteIndexEntry{UPDhmm User Guide: From Detection to Postprocessing}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(
  comment = "",
  warning = FALSE,
  message = FALSE,
  cache = FALSE
)
```


# Background
Uniparental disomy (UPD) is a rare genetic condition where an individual inherits both copies of a chromosome from one parent, as opposed to the typical inheritance of one copy from each parent.
The extent of its contribution as a causative factor in rare genetic diseases remains largely unexplored.

UPDs can lead to disease either by inheriting a carrier pathogenic mutation as homozygous from a carrier parent or by causing errors in genetic imprinting.
Nevertheless, there are currently no standardized methods available for the detection and characterization of these events.

We have developed UPDhmm, an R/Bioconductor package that provides a tool to detect, classify, and locate uniparental disomy events in trio genetic data.

# Method overview
UPDhmm relies on a Hidden Markov Model (HMM) to identify regions with UPD.

In our HMM model, the observations are the genotype combinations of the father, mother, and proband at each genomic variant.
The hidden states represent five inheritance patterns:

- Normal (Mendelian inheritance)

- Maternal isodisomy

- Paternal isodisomy

- Maternal heterodisomy

- Paternal heterodisomy

Emission probabilities were defined based on the expected inheritance patterns, and the Viterbi algorithm was used to infer the most likely sequence of hidden states along the genome.

The final output consists of genomic segments whose inferred state deviates from Mendelian inheritance — i.e., likely UPD regions.

# Set-up the packages
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE)) {
  install.packages("BiocManager")
}
BiocManager::install("UPDhmm")
BiocManager::install("regioneR")
BiocManager::install("karyoploteR")

```


# Loading libraries
```{r }
library(UPDhmm)
library(dplyr)
```

# Quick start
The input required by the package is a multisample VCF file containing genotypes for a trio (proband, mother, and father).

Requirements:

**Only biallelic variants are supported.**

**Accepted genotype (GT) formats: 0/0, 0/1, 1/0, 1/1 (or their phased equivalents 0|0, 0|1, 1|0, 1|1).**

For details on how to preprocess VCF files and prepare them in the appropriate format, please refer to the “Preprocessing and Input Preparation” vignette included with the package.


# Detecting and postprocessing UPD events

In this tutorial, we will perform a complete analysis using two trio VCF files included with the UPDhmm package.

By following the steps below, you will learn how to:

1. Load and preprocess the VCF files.

2. Detect UPD events using the HMM-based approach.

3. Apply quality filters to the detected events.

4. Collapse and summarize results per individual.

5. Perform postprocessing analyses to identify potential true UPDs by detecting recurrent or artifact-prone genomic regions across samples.


# 1. Load and preprocess the VCFs

```{r}
library(VariantAnnotation)
library(regioneR)
library(karyoploteR)

#Example files
files <- c(
  system.file(package = "UPDhmm", "extdata", "sample1.vcf.gz"),
  system.file(package = "UPDhmm", "extdata", "sample2.vcf.gz")
)


# Define the trio structure
trio_list <- list(
  list(
    proband = "NA19675",
    mother  = "NA19678",
    father  = "NA19679"
  ),
  list(
    proband = "NA19685",
    mother  = "NA19688",
    father  = "NA19689"
  )
)

# Read and preprocess each VCF
vcf_list <- mapply(function(file, trio) {
  vcf <- readVcf(file)
  vcfCheck(vcf,
           proband = trio$proband,
           mother  = trio$mother,
           father  = trio$father)
}, 
files, trio_list, SIMPLIFY = FALSE)


```

Each processed VCF is now ready for UPD detection.

# 2. Detect UPD events using the HMM-based approach

The principal function of `UPDhmm` package, `calculateEvents()`, is the central
function for identifying Uniparental Disomy (UPD) events. It takes the output
from the previous `vcfCheck()` function and systematically analyzes genomic
data, splitting the VCF into chromosomes and applying the Viterbi algorithm.


`calculateEvents()` function encompasses a serie of subprocesses for
identifying Uniparental disomies (UPDs):

(1) Split vcf into chromosomes
(2) Apply Viterbi algorithm to every chromosome
(3) Convert Viterbi results into contiguous genomic blocks by collapsing consecutive variants sharing the same inferred state and compute block-level statistics. Optionally, calculate depth-based ratios for each block.
(4) Calculate block-level statistics including Mendelian error counts.

In this example, we run the function in serial mode, but `calculateEvents()` now includes an optional argument, *BPPARAM*, which controls parallel execution settings and is passed to bplapply. By default, it uses BiocParallel::SerialParam() (serial execution). To enable parallelization, provide a BiocParallel backend, for example:


```{r}
events_list <- lapply(vcf_list, calculateEvents, add_ratios = TRUE)

head(events_list[[1]])

```
Note: when running under R CMD check or Bioconductor build systems, the number of workers may be automatically limited (usually less than 2).


## Results description

The `calculateEvents` function returns a `data.frame` containing all  
detected events in the provided trio. If no events are found, the function  
will return an empty `data.frame`.  

By default, this function only calculates the number of Mendelian errors as a block-level statistic. Sequencing depth ratios can be included by enabling the optional parameter `add_ratios = TRUE`, which defaults to `FALSE`.


| Column name         | Description                                                                                                                                                                                                                          |
|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `ID`               | Identifier of the proband (child sample)                                                                                                                                                                                              |
| `chromosome`         | Chromosome name                                                                                                                                                                                                                       |
| `start`            | Start position of the block (genomic coordinate)                                                                                                                                                                                      |
| `end`              | End position of the block                                                                                                                                                                                                             |
| `group`            | Predicted inheritance state (`iso_mat`, `iso_fat`, `het_mat`, `het_fat`)                                                                                                                                                              |
| `n_snps`           | Number of informative SNPs within the event                                                                                                                                                                                           |
| `ratio_proband`    | *Optional*. Ratio between the average sequencing depth inside the event and the depth outside it for the proband. A value close to 1 indicates balanced coverage; significant deviations may suggest copy-number changes.             |
| `ratio_mother`     | *Optional*. Same ratio computed for the maternal sample                                                                                                                                                                               |
| `ratio_father`     | *Optional*. Same ratio computed for the paternal sample                                                                                                                                                                               |
| `n_mendelian_error`| Number of Mendelian inconsistencies detected within the region                                                                                                                                                                        |

In this example, we run the function in serial mode, but `calculateEvents()` now includes an optional argument `BPPARAM` that controls parallel execution.

| Argument  | Description                                                                                                                                                                                                                                |
|-----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `BPPARAM` | Parallelization settings, passed to **`BiocParallel::bplapply()`**. By default, the function uses `BiocParallel::SerialParam()` (serial execution). To enable parallel computation, a BiocParallel backend can be specified — for example: |


```{r, eval=FALSE}

library(BiocParallel)
BPPARAM <- MulticoreParam(workers = min(2, parallel::detectCores()))
events_list <- lapply(vcf_list, calculateEvents, BPPARAM = BPPARAM)

```


# 3. Collapse and summarize results per individual

The output of `calculateEvents` needs some pre-processing before the interpretation of the results. On one hand, `calculateEvents` can detect small regions which are likely to be artifacts, instead of real UPD events. On the other, large UPD events, i.e. spanning across a whole chromosome, can be detected as multiple contiguous events. 

The function `collapseEvents()` handles both situations. First,`collapseEvents()` filters the detected events using the following parameters:  

- n_mendelian_error (default > 2): minimum number of Mendelian errors supporting the UPD event. Requiring multiple Mendelian inconsistencies reduces the likelihood that the signal is driven by sporadic genotyping errors.

- size (default = 500,000): minimum length (in base pairs) of the event to be considered, aimed at excluding short spurious segments frequently observed as false positives.

These thresholds were defined based on the evaluation conducted in the publication in which the package was introduced and validated, including simulations of UPD events of different sizes and application to a real trio-based cohort, with the aim of removing small events and excluding those likely driven by technical noise.

Second, `collapseEvents()` merges UPD events with the same predicted type occurring on the same chromosome into a single collapsed event. This step ensures that extended UPD regions initially detected as multiple adjacent segments are represented as a single biologically meaningful event per individual and chromosome.


```{r}
collapsed_list <- lapply(events_list, collapseEvents)
collapsed_all <- do.call(rbind.data.frame, c(collapsed_list, make.row.names = FALSE))
head(collapsed_all)
```
The resulting table contains one row per individual, chromosome, and UPD type, with the following columns:

| Column name             | Description                                                                 |
|-------------------------|-----------------------------------------------------------------------------|
| `ID`                    | Identifier of the proband                                                   |
| `chromosome`            | Chromosome                                                                  |
| `start` / `end`         | Genomic span of collapsed block                                             |
| `group`                 | Predicted state                                                             |
| `n_events`              | Number of events collapsed                                                  |
| `total_mendelian_error` | Sum of Mendelian errors                                                     |
| `total_size`            | Total genomic span covered                                                  |
| `total_snps`            | Total SNPs in the overlapping events                                        |
| `prop_covered`          | Proportion of the region covered by events                                   |
| `collapsed_events`      | Comma-separated list of merged events                                       |
| `ratio_proband`         | Weighted mean ratio of the proband across the collapsed events              |
| `ratio_mother`          | Weighted mean ratio of the mother across the collapsed events               |
| `ratio_father`          | Weighted mean ratio of the father across the collapsed events               |


The read-depth ratios are reported as supportive metrics to facilitate downstream interpretation. Values close to one indicate balanced coverage across trio members, whereas strong deviations may suggest copy-number gains or losses rather than true UPD events. Importantly, these ratios are not used as hard filtering criteria at this stage, but are provided to aid quality control and biological interpretation.

## 4. Perform postprocessing analyses

In this final step, we perform postprocessing analyses to refine the detected UPD events and highlight potentially true or recurrent regions across samples. This helps to distinguish genuine biological UPDs from artifact-prone genomic regions (e.g., repetitive or complex loci that may systematically produce false positives).

To explore whether similar events recur across individuals, UPDhmm provides two complementary functions:

- `identifyRecurrentRegions()` — Identifies recurrent genomic regions shared across multiple samples by clustering overlapping genomic intervals.  
  The function first filters events based on a Mendelian error threshold, then groups overlapping regions using an agglomerative hierarchical clustering approach applied chromosome-wise.  
  Clusters supported by a minimum number of distinct individuals are collapsed into recurrent genomic regions.  

  It returns a `GRanges` object containing the recurrent regions, along with metadata describing sample support and the individual events contributing to each region.

  **Arguments:**
  - `df` — The input data.frame containing the genomic regions to evaluate.  
    This table must include columns describing chromosome (`chromosome`), genomic coordinates (`start`, `end`), sample identifiers, and a column with Mendelian error counts (`n_mendelian_error` or `total_mendelian_error`).

  **Filtering arguments:**
  - `ID_col` — The name of the column containing sample identifiers.  
    It allows the function to determine how many unique individuals support each overlapping genomic region.
  - `error_threshold` (default = 100) — Maximum number of Mendelian errors allowed for a region to be considered in the recurrence analysis.  
    Regions exceeding this threshold are excluded from recurrence counting. These regions, which often span broad genomic segments, are not considered in the recurrence calculation since they could overlap with many smaller events and bias the results. Instead, they are retained in the output as non-recurrent events, under the assumption that they may represent genuine large-scale alterations rather than noise.
  - `min_support` (default = 3) — Minimum number of distinct samples required for a cluster of overlapping events to be considered recurrent.  
    This parameter can be adjusted depending on the cohort size — smaller values detect more shared regions, while higher thresholds focus on the most consistently recurrent events.
  - `max_dist` (default = 0.3) — Maximum distance threshold used to cut the hierarchical clustering dendrogram.  
    Distance between two events is defined as:  
    \deqn{1 - (overlap\_width / max(event widths))}  
    Smaller values enforce stricter overlap similarity.
  - `linkage` (default = "complete") — Linkage method used in agglomerative hierarchical clustering.  
    With *complete* linkage, the distance between two clusters is defined as the maximum distance between all pairs of events across clusters.

- `markRecurrentRegions()` — Annotates each event in a results table (from `calculateEvents()` or `collapseEvents()`) as recurrent or non-recurrent, depending on whether it sufficiently overlaps any region defined as recurrent.  
  Recurrence is determined by evaluating the fraction of the event length that overlaps a recurrent genomic region. Events are classified as recurrent only if this overlap fraction exceeds a user-defined threshold.

  The function returns the same data.frame provided as input, but with two additional columns:
  - `Recurrent` — Categorical variable indicating whether the event overlaps a recurrent region ("Yes" or "No").
  - `n_samples` — Number of unique samples supporting the recurrent region (as determined in `identifyRecurrentRegions()`).

  These annotations facilitate downstream filtering of events, allowing users to retain only non-recurrent regions, which are more likely to represent true UPD candidates rather than technical artifacts or common genomic polymorphisms.

In summary, these two functions can be combined to systematically identify and flag recurrent genomic regions, facilitating the filtering of potential technical artifacts or recurrent UPDs.

```{r}
recurrentRegions <- identifyRecurrentRegions(
  df = collapsed_all,
  ID_col = "ID",
  error_threshold = 100,
  min_support = 2,
  max_dist = 0.97
)

head(recurrentRegions)
annotatedEvents <- markRecurrentRegions(
  df = collapsed_all,
  recurrent_regions = recurrentRegions
)

head(annotatedEvents)
```

Alternatively, recurrent regions can be provided directly as an external BED file (e.g., `SFARI.bed`), 
converted into a `GRanges` object, and used in the same way. 

The `SFARI.bed` file was generated from our analysis of the **SFARI cohort**, 
where recurrent genomic intervals were identified across trios using `UPDhmm`. 
These regions represent loci frequently showing **artifactual or low-confidence UPD signals** 
in the SFARI dataset and can be used as a **mask** to improve specificity in other analyses.

```{r}

library(regioneR)
bed <- system.file(package = "UPDhmm", "extdata", "SFARI.bed")
bed_df <- read.table(
    bed,
    header = TRUE
)

bed_gr <- toGRanges(bed_df)

annotatedEvents <- markRecurrentRegions(
  df = collapsed_all,
  recurrent_regions = bed_gr
)

head(annotatedEvents)
```

Regions annotated as non-recurrent and passing all quality filters (low Mendelian error count and consistent read depth ratios) are the most likely true UPD candidates.
Conversely, highly recurrent or error-prone regions should be interpreted with caution, as they may reflect systematic technical artifacts or structural genomic complexity.


# 5. Visualizing UPD events across the genome

After identifying and postprocessing UPD events, a final and highly informative step is to visualize their genomic distribution.
Visualization helps to quickly assess the size, chromosomal location, and type (maternal or paternal; hetero- or isodisomy) of the detected events.

The `karyoploteR` package provides elegant genome-wide visualization tools that can be integrated directly with the UPDhmm results.
The following custom plotting function, `plotUPDKp()`, enables the visualization of UPD segments detected in a trio across all autosomes.
To visualize events across the genome, you can use the karyoploteR package:

```{r}
library(karyoploteR)
library(regioneR)

# Function to visualize UPD events across the genome
plotUPDKp <- function(updEvents) {
  #--------------------------------------------------------------
  # 1. Detect coordinate columns
  #--------------------------------------------------------------
  if (all(c("start", "end") %in% colnames(updEvents))) {
    start_col <- "start"
    end_col <- "end"
  } else if (all(c("min_start", "max_end") %in% colnames(updEvents))) {
    start_col <- "min_start"
    end_col <- "max_end"
  } else {
    stop("Input must contain either (start, end) or (min_start, max_end) columns.")
  }

  #--------------------------------------------------------------
  # 2. Ensure chromosome naming format (e.g. "chr3")
  #--------------------------------------------------------------
  updEvents$seqnames <- ifelse(grepl("^chr", updEvents$chromosome),
                               updEvents$seqnames,
                               paste0("chr", updEvents$chromosome))

  #--------------------------------------------------------------
  # 3. Helper function to safely create GRanges only if events exist
  #--------------------------------------------------------------
  to_gr_safe <- function(df, grp) {
    subset_df <- subset(df, group == grp)
    if (nrow(subset_df) > 0) {
      toGRanges(subset_df[, c("seqnames", start_col, end_col)])
    } else {
      NULL
    }
  }

  #--------------------------------------------------------------
  # 4. Separate UPD event types
  #--------------------------------------------------------------
  het_fat <- to_gr_safe(updEvents, "het_fat")
  het_mat <- to_gr_safe(updEvents, "het_mat")
  iso_fat <- to_gr_safe(updEvents, "iso_fat")
  iso_mat <- to_gr_safe(updEvents, "iso_mat")

  #--------------------------------------------------------------
  # 5. Create the genome ideogram
  #--------------------------------------------------------------
  kp <- plotKaryotype(genome = "hg19")

  #--------------------------------------------------------------
  # 6. Plot regions by inheritance type (if any exist)
  #--------------------------------------------------------------
  if (!is.null(het_fat)) kpPlotRegions(kp, het_fat, col = "#AAF593")
  if (!is.null(het_mat)) kpPlotRegions(kp, het_mat, col = "#FFB6C1")
  if (!is.null(iso_fat)) kpPlotRegions(kp, iso_fat, col = "#A6E5FC")
  if (!is.null(iso_mat)) kpPlotRegions(kp, iso_mat, col = "#E9B864")

  #--------------------------------------------------------------
  # 7. Add legend
  #--------------------------------------------------------------
  legend("topright",
         legend = c("Het_Fat", "Het_Mat", "Iso_Fat", "Iso_Mat"),
         fill = c("#AAF593", "#FFB6C1", "#A6E5FC", "#E9B864"))
}
# Example: visualize UPD events for the first trio
plotUPDKp(annotatedEvents)

```


# Summary

In summary, the `UPDhmm` package provides a complete workflow for the detection, characterization, and visualization of uniparental disomy (UPD) events from trio-based sequencing data.
Starting from raw VCF files, users can preprocess, model inheritance patterns through a Hidden Markov Model, and identify genomic regions consistent with UPD.
Postprocessing steps, including the collapsing of events and the identification of recurrent or artifact-prone regions, enable a refined interpretation of results and the prioritization of true UPD candidates.
Finally, genome-wide visualization with karyoploteR offers an intuitive way to inspect and report detected UPDs.

Altogether, `UPDhmm facilitates a standardized and reproducible approach for the study of UPD across large cohorts, integrating statistical detection, quality control, and interpretative visualization into a single coherent framework.


# Session Info

```{r}
sessionInfo()
```