---
title: "fastreeR Vignette"
author:
-   name: "Anestis Gkanogiannis"
    email: anestis@gkanogiannis.com
package: fastreeR
output:
    BiocStyle::html_document:
        toc: true
        toc_depth: 2
        number_sections: true
vignette: >
    %\VignetteIndexEntry{fastreeR}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
    warning = FALSE,
    message = FALSE,
    collapse = TRUE,
    comment = "#>"
)
```

# About fastreeR

`fastreeR` provides fast, low-memory routines for sample-level population
genetics on whole VCF or FASTA files: pairwise distances, phylogenetic trees,
and hierarchical clustering, without loading the data into R. All heavy
lifting is performed by a multi-threaded Java backend; the R package is a thin
wrapper that streams results back.

Typical use cases:

- Exploratory clustering of samples from a large multi-sample VCF.
- Scanning a VCF in windows to spot regions where the sample topology differs
  (introgression, selection, structural variation).
- k-mer based distances and phylogenies from many FASTA sequences.

## Function overview

| Function         | Input                | Output                                       | Notes |
|------------------|----------------------|----------------------------------------------|-------|
| `vcf2istats`     | VCF file             | `data.frame` of per-sample stats             | het / hom / missing counts and percentages |
| `vcf2dist`       | VCF file             | `dist` object (or list / `data.frame` in windowed mode) | cosine-type distance; supports windowed output |
| `vcf2tree`       | VCF file             | Newick string (or `data.frame` in windowed mode) | supports streaming bootstrap replicates |
| `vcf2clusters`   | VCF file             | `list(tree, clusters)`                       | `vcf2dist` + `dist2clusters` in one call |
| `fasta2dist`     | one or more FASTA    | `dist` object                                | d2_S k-mer dissimilarity |
| `dist2tree`      | `dist` object / file | Newick string                                | hierarchical clustering wrapper |
| `dist2clusters`  | `dist` object / file | `list(tree, clusters)`                       | dynamic tree cutting |
| `tree2clusters`  | Newick string / file | cluster assignment                           | dynamic tree cutting from a tree |

## Compressed VCF input

`vcf2dist`, `vcf2tree` and related functions accept either a plain `.vcf`
file or a gzip-compressed `.vcf.gz` file transparently — gzip inputs are
decompressed to a temporary file before being handed to the Java backend, as
illustrated by the bundled `samples.vcf.gz` used throughout this vignette.

# Installation

To install `fastreeR` package:
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("fastreeR")
```

# Preparation

## Allocate RAM and load required libraries

**No more GBs of RAM!** Only the distance matrix is kept in memory:

* `4 bytes x (#samples²) x #threads`
* Example: 1000 samples with 32 threads → **\~128MB RAM**

**VCF caching is minimal:**
Only **2 VCF lines per thread** are pre-cached.

* In the simple diploid case (e.g., `0/1`, `1|0`), each genotype requires \~4 characters (8 bytes).
* For 1000 samples and 32 threads, this adds up to **\~1MB RAM**.

JVM will need at least 64-128 MB in order to efficiently run.

**Total memory footprint: just a few hundred MB, even for large datasets.**

~~You should allocate minimum 10 bytes per sample per variant of RAM for the JVM.
The more RAM you allocate, the faster the execution will be (less pauses 
for garbage collection).~~

In order to allocate RAM, a special parameter needs to be passed while JVM 
initializes. JVM parameters can be passed by setting `java.parameters` option.
The `-Xmx` parameter, followed (without space) by an integer value and a 
letter, is used to tell JVM what is the maximum amount of heap RAM that it can
use. The letter in the parameter (uppercase or lowercase), indicates RAM units.

For example, parameters `-Xmx1024m` or `-Xmx1024M` or `-Xmx1g` or `-Xmx1G`, all
allocate 1 Gigabyte or 1024 Megabytes of maximum RAM for JVM.

```{r, eval=TRUE}
options(java.parameters = "-Xmx1G")
library(fastreeR)
library(utils)
library(ape)
library(stats)
library(grid)
library(BiocFileCache)
has_ggtree <- requireNamespace("ggtree", quietly = TRUE)
if (has_ggtree) suppressPackageStartupMessages(library(ggtree))
```

## Download sample vcf file

We download, in a temporary location, a small vcf file 
from 1K project, with around 150 samples and 100k variants (SNPs and INDELs).
We use `BiocFileCache` for this retrieval process 
so that it is not repeated needlessly.
If for any reason we cannot download, we use the small sample vcf from 
`fastreeR` package.
```{r, eval=TRUE}
bfc <- BiocFileCache::BiocFileCache(ask = FALSE)
tempVcfUrl <-
    paste0("https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/",
        "1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/",
        "supporting/related_samples/",
        "ALL.chrX.shapeit2_integrated_snvindels_v2a_related_samples_27022019.",
        "GRCh38.phased.vcf.gz")
tempVcf <- BiocFileCache::bfcquery(bfc,field = "rname", "tempVcf")$rpath[1]
if(is.na(tempVcf) || is.null(tempVcf)) {
    tryCatch(
    { tempVcf <- BiocFileCache::bfcadd(bfc,"tempVcf",fpath=tempVcfUrl)[[1]]
    },
    error=function(cond) {
        tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR")
    },
    warning=function(cond) {
        tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR")
    }
    )
}
if(!file.exists(tempVcf) ||  file.size(tempVcf) == 0L) {
    tempVcf <- system.file("extdata", "samples.vcf.gz", package="fastreeR")
}
```

## Download sample fasta files

We download, in temporary location, some small bacterial genomes.
We use `BiocFileCache` for this retrieval process 
so that it is not repeated needlessly.
If for any reason we cannot download, we use the small sample fasta from 
`fastreeR` package.
```{r, eval=TRUE}
tempFastasUrls <- c(
    #Mycobacterium liflandii
    paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/",
        "Mycobacterium_liflandii/latest_assembly_versions/",
        "GCF_000026445.2_ASM2644v2/GCF_000026445.2_ASM2644v2_genomic.fna.gz"),
    #Pelobacter propionicus
    paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/",
        "Pelobacter_propionicus/latest_assembly_versions/",
        "GCF_000015045.1_ASM1504v1/GCF_000015045.1_ASM1504v1_genomic.fna.gz"),
    #Rickettsia prowazekii
    paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/",
        "Rickettsia_prowazekii/latest_assembly_versions/",
        "GCF_000022785.1_ASM2278v1/GCF_000022785.1_ASM2278v1_genomic.fna.gz"),
    #Salmonella enterica
    paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/",
        "Salmonella_enterica/reference/",
        "GCF_000006945.2_ASM694v2/GCF_000006945.2_ASM694v2_genomic.fna.gz"),
    #Staphylococcus aureus
    paste0("https://ftp.ncbi.nih.gov/genomes/refseq/bacteria/",
        "Staphylococcus_aureus/reference/",
        "GCF_000013425.1_ASM1342v1/GCF_000013425.1_ASM1342v1_genomic.fna.gz")
)
tempFastas <- list()
fallback_fasta <- system.file("extdata", "samples.fasta.gz",
                              package = "fastreeR")
use_fallback <- FALSE
for (i in seq_len(5)) {
    tempFastas[[i]] <- BiocFileCache::bfcquery(bfc, field = "rname",
                                                paste0("temp_fasta", i))$rpath[1]
    if (is.na(tempFastas[[i]])) {
        path_i <- tryCatch(
            BiocFileCache::bfcadd(bfc, paste0("temp_fasta", i),
                                  fpath = tempFastasUrls[i])[[1]],
            error   = function(cond) NA_character_,
            warning = function(cond) NA_character_
        )
        tempFastas[[i]] <- path_i
    }
    if (is.na(tempFastas[[i]]) ||
        !file.exists(tempFastas[[i]]) ||
        file.size(tempFastas[[i]]) == 0L) {
        use_fallback <- TRUE
        break
    }
}
if (use_fallback) {
    tempFastas <- fallback_fasta
}
```

# Functions on vcf files

## Sample Statistics

```{r echo=TRUE, fig.cap="Sample statistics from vcf file", fig.wide=TRUE}
myVcfIstats <- fastreeR::vcf2istats(inputFile = tempVcf)
plot(myVcfIstats[,7:9])
```

## Calculate distances from vcf

The most time consuming process is calculating distances between samples.
Assign more processors in order to speed up this operation.
```{r, eval=TRUE}
myVcfDist <- fastreeR::vcf2dist(inputFile = tempVcf, threads = 1)
```

## Histogram of distances

```{r echo=TRUE, fig.cap="Histogram of distances from vcf file", fig.wide=TRUE}
graphics::hist(myVcfDist, breaks = 100, main=NULL, 
                                xlab = "Distance", xlim = c(0,max(myVcfDist)))
```
We note two distinct groups of distances. One around of 
distance value 0.05 and the second around distance value 0.065.

## Plot tree from `fastreeR::dist2tree`

Notice that the generated tree is ultrametric.
```{r echo=TRUE, fig.cap="Tree from vcf with fastreeR", fig.wide=TRUE}
myVcfTree <- fastreeR::dist2tree(inputDist = myVcfDist)
plot(ape::read.tree(text = myVcfTree), direction = "down", cex = 0.3)
ape::add.scale.bar()
ape::axisPhylo(side = 2)
```

Of course the same can be achieved directly from the vcf file, 
without calculating distances.
```{r echo=TRUE, fig.cap="Tree from vcf with fastreeR", fig.wide=TRUE}
myVcfTree <- fastreeR::vcf2tree(inputFile = tempVcf, threads = 1)
plot(ape::read.tree(text = myVcfTree), direction = "down", cex = 0.3)
ape::add.scale.bar()
ape::axisPhylo(side = 2)
```
As expected from the histogram of distances, two groups of samples also 
emerge in the tree. The two branches, one at height around 0.055 and the second 
around height 0.065, are clearly visible.

## Bootstrapping example

You can request streaming bootstrap replicates directly from the VCF source
by setting the `bootstrap` parameter. The Java backend will perform the
requested number of replicates and encode bootstrap support values at internal
nodes in the returned Newick string. The following example shows how to call
`vcf2tree` with bootstrapping and how to inspect the node support values using
`ape`.

### Bootstrap support explained

Setting the `bootstrap` parameter instructs the Java backend to resample
variants (SNP columns) and compute replicate trees. The per-node support is calculated as
the percentage of replicates that contain the same bipartition (standard
bootstrap). These support values are encoded in the Newick string returned by
`vcf2tree` and are accessible after parsing the Newick with
`ape::read.tree()` (they typically appear in `tree$node.label`).

### Interpretation guidance

As a rule of thumb, interpret bootstrap values roughly as:

- more than 90% : strong support
- 70-89% : moderate support
- < 70%  : weak support

These are heuristic guidelines and should be used with caution.

### Reproducibility note

Bootstrap resampling is stochastic and runs inside the Java backend, so
support values will vary slightly between runs. For deterministic downstream
analyses, save the produced Newick string to disk and re-use it rather than
re-running the replicates.

### Bootstrap example (small, runnable)

The chunk below runs a modest number of replicates so it is safe for the
vignette build. For production use, set `bt_reps` to 100-1000; runtime grows
roughly linearly with the number of replicates.

```{r eval=TRUE, fig.cap="Tree from vcf with fastreeR and bootstrap support (ape)", fig.wide=TRUE}
# Small number of replicates for vignette build speed.
# For production: bt_reps <- 200   # or 500-1000
bt_reps <- 10
myBootTree <- fastreeR::vcf2tree(inputFile = tempVcf, threads = 1, bootstrap = bt_reps)

# Parse with ape and inspect bootstrap support (stored in node.label)
tr <- ape::read.tree(text = myBootTree)
# robust parse: remove anything but digits and dot, then as.numeric
raw_lbls <- tr$node.label
node_support <- if (!is.null(raw_lbls)) {
  # turn "", NA or non-numeric into NA
  s <- gsub("[^0-9.]", "", raw_lbls)
  s[s == ""] <- NA
  as.numeric(s)
} else {
  numeric(0)
}
print(head(tr$node.label))
plot(tr, direction = "down", cex = 0.3)
if (length(node_support) > 0) {
  # round and show as integers, place without frames
  ape::nodelabels(text = round(node_support, 0),
                  cex = 0.7,
                  frame = "none",
                  adj = c(-0.2, 0.5))      # adjust to move labels slightly off-node

  # optional: color labels by support
  cols <- ifelse(node_support >= 90, "black",
                 ifelse(node_support >= 70, "orange", "red"))
  ape::nodelabels(text = round(node_support, 0), cex = 0.7, frame = "none", col = cols)
  
  # colour the branch behind each internal node
  bgcols <- ifelse(node_support >= 90, "lightgreen",
                   ifelse(node_support >= 70, "khaki", "lightpink"))
  ape::nodelabels(text = round(node_support, 0), cex = 0.7, frame = "circle", bg = bgcols, col = "black")
}
```

### Optional: nicer plotting with ggtree

If you have `ggtree` installed, you can produce a more polished plot and
annotate node supports.

```{r eval=has_ggtree, fig.cap="Tree from vcf with fastreeR and bootstrap support (ggtree)", fig.wide=TRUE}
  # internal node numbers are Ntip+1 : Ntip+Nnode
  ntips <- ape::Ntip(tr)
  nints <- ape::Nnode(tr)
  internal_nodes <- (ntips + 1):(ntips + nints)

  df_nodes <- data.frame(node = internal_nodes, support = node_support)

  # Create categorical support classes for coloring and define colors
  df_nodes$category <- cut(df_nodes$support,
                           breaks = c(-Inf, 69, 89, Inf),
                           labels = c("weak", "moderate", "strong"))

  fills <- c(strong = "lightgreen", moderate = "khaki", weak = "lightpink")
  cols <- c(strong = "black", moderate = "orange", weak = "red")

  p <- ggtree(tr) + geom_tiplab(size = 2)

  # Attach node support data to the tree plotting data and add colored points + labels
  p <- p %<+% df_nodes +
      ggtree::geom_point2(aes(subset = !isTip, fill = category), shape = 21,
                          color = "black", size = 3, show.legend = FALSE) +
      ggtree::geom_text2(aes(subset = !isTip, label = round(support, 0)),
                          hjust = -0.2, size = 2, show.legend = FALSE) +
      scale_fill_manual(values = fills)

  print(p)
```

### Command-line examples

Run from the Python CLI (local JVM memory allocation via `--mem`):

```bash
python fastreeR.py VCF2TREE -i input.vcf -o output_with_boot.nwk --threads 8 --bootstrap 100 --mem 1024
```

Or using Docker:

```bash
docker run --rm -v $(pwd):/data gkanogiannis/fastreer:latest \
    VCF2TREE -i /data/input.vcf -o /data/output_with_boot.nwk --threads 8 --bootstrap 100 --mem 1024
```

These commands produce a Newick tree file with bootstrap support values encoded
at internal nodes; parse it in R with `ape::read.tree()` to inspect `node.label`.

### JVM / rJava troubleshooting tips

If you encounter `rJava` initialization errors or out-of-memory issues when
calling `vcf2tree()` from R, set the JVM heap before loading the package, for
example:

```{r eval=FALSE}
# set JVM max heap to 2GB before loading fastreeR
options(java.parameters = '-Xmx2G')
library(fastreeR)
```

Also ensure Java 11+ is installed and on your PATH. On Windows, point R to
the correct Java installation (matching 64/32-bit R) if needed.

## Windowed VCF analysis

`vcf2dist` and `vcf2tree` can emit one distance matrix (or Newick tree) per
genomic window. This is useful for scanning a VCF for regions whose sample
topology differs from the genome-wide signal — e.g. candidate introgression
blocks, selective sweeps, or structural variants. Windows are tiled (the step
defaults to the window size) and never straddle chromosome boundaries.

Two window definitions are supported, mutually exclusive:

- `windowBp` — windows of N base pairs.
- `windowVariants` — windows of N consecutive variants.

Bootstrap replicates are not available in windowed mode.

### Windowed distance matrices

With the default `longFormat = FALSE`, `vcf2dist` returns a named list of
`dist` objects, one per window, keyed `"chrom:start-end"`.

```{r eval=TRUE}
win_dists <- fastreeR::vcf2dist(
    inputFile = tempVcf,
    threads = 1,
    windowVariants = 500
)
length(win_dists)
head(names(win_dists), 3)
```

Passing `longFormat = TRUE` returns a long-form `data.frame` of all pairwise
distances across all windows, which is convenient for `data.table` /
`ggplot2` downstream work:

```{r eval=TRUE}
win_long <- fastreeR::vcf2dist(
    inputFile = tempVcf,
    threads = 1,
    windowVariants = 500,
    longFormat = TRUE
)
head(win_long)
```

### Windowed trees

`vcf2tree` in windowed mode returns a `data.frame` with one row per window
and columns `chrom, start, end, nvariants, newick`.

```{r eval=TRUE, fig.cap="Per-window trees from vcf", fig.wide=TRUE}
win_trees <- fastreeR::vcf2tree(
    inputFile = tempVcf,
    threads = 1,
    windowVariants = 500
)
head(win_trees[, c("chrom", "start", "end", "nvariants")])

n_show <- min(3, nrow(win_trees))
if (n_show > 0) {
    op <- graphics::par(mfrow = c(1, n_show), mar = c(2, 1, 2, 1))
    for (k in seq_len(n_show)) {
        tr_k <- ape::read.tree(text = win_trees$newick[k])
        plot(tr_k, direction = "down", cex = 0.3,
             main = paste0(win_trees$chrom[k], ":",
                           win_trees$start[k], "-", win_trees$end[k]))
    }
    graphics::par(op)
}
```

## Plot tree from `stats::hclust`

For comparison, we generate a tree by using `stats` package and distances
calculated by `fastreeR`.
```{r echo=TRUE, fig.cap="Tree from vcf with stats::hclust", fig.wide=TRUE}
myVcfTreeStats <- stats::hclust(myVcfDist)
plot(myVcfTreeStats, ann = FALSE, cex = 0.3)
```
Although it does not initially look very similar, because it is not ultrametric,
it is indeed quite the same tree. We note again the two groups (two branches) 
of samples and the 4 samples, possibly clones, that they show very close 
distances between them.

## Hierarchical Clustering

We can identify the two groups of samples, apparent from the hierarchical tree,
by using `dist2clusters` 
or `vcf2clusters` 
or `tree2clusters`.
By playing a little with the `cutHeight` parameter, we find that a
value of `cutHeight=0.067` cuts the tree into two branches.
The first group contains 106 samples and the second 44.
```{r, eval=TRUE}
myVcfClust <- fastreeR::dist2clusters(inputDist = myVcfDist, cutHeight = 0.067)
if (length(myVcfClust) > 1) {
    tree <- myVcfClust[[1]]
    clusters <- myVcfClust[[2]]
    tree
    clusters
}
```

# Functions on fasta files

Similar analysis we can perform when we have samples represented as 
sequences in a fasta file.

## Calculate distances from fasta

Use of the downloaded sample fasta file :
```{r, eval=TRUE}
myFastaDist <- fastreeR::fasta2dist(tempFastas, kmer = 6)
```
Or use the provided by `fastreeR` fasta file of 48 bacterial RefSeq :
```{r, eval=FALSE}
myFastaDist <- fastreeR::fasta2dist(
    system.file("extdata", "samples.fasta.gz", package="fastreeR"), kmer = 6)
```


## Histogram of distances

```{r echo=TRUE, fig.cap="Histogram of distances from fasta file",fig.wide=TRUE}
graphics::hist(myFastaDist, breaks = 100, main=NULL, 
                                xlab="Distance", xlim = c(0,max(myFastaDist)))
```

## Plot tree from `fastreeR::dist2tree`

```{r echo=TRUE, fig.cap="Tree from fasta with fastreeR", fig.wide=TRUE}
myFastaTree <- fastreeR::dist2tree(inputDist = myFastaDist)
plot(ape::read.tree(text = myFastaTree), direction = "down", cex = 0.3)
ape::add.scale.bar()
ape::axisPhylo(side = 2)
```

## Plot tree from `stats::hclust`

```{r echo=TRUE, fig.cap="Tree from fasta with stats::hclust", fig.wide=TRUE}
myFastaTreeStats <- stats::hclust(myFastaDist)
plot(myFastaTreeStats, ann = FALSE, cex = 0.3)
```

# Session Info

```{r sessionInfo}
utils::sessionInfo()
```