---
title: "_igblastr_ overview"
author:
- name: Hervé Pagès
affiliation: Fred Hutch Cancer Center, Seattle, WA
- name: Kellie MacPhee
affiliation: Fred Hutch Cancer Center, Seattle, WA
- name: Ollivier Hyrien
affiliation: Fred Hutch Cancer Center, Seattle, WA
date: "Compiled `r BiocStyle::doc_date()`; Modified 9 June 2026"
package: igblastr
vignette: |
%\VignetteIndexEntry{igblastr overview}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
output:
BiocStyle::html_document:
toc_float: true
---
```{r setup, include=FALSE}
library(BiocStyle)
library(igblastr)
reset_germline_dbs()
```
# Introduction
The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a
specialized bioinformatics tool developed by the National Center for
Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR)
and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs
sequence alignment and annotation, with key outputs including germline
V, D, and J gene assignments; characterization of somatic hypermutations
introduced during affinity maturation; identification of
complementarity-determining regions (CDR1–CDR3), framework
regions (FWR1–FWR4), and isotype; and both nucleotide and protein-level
alignments. These outputs form the basis for many downstream analyses
of BCR and TCR repertoires, including clonotype identification, lineage
tracing, and repertoire feature characterization.
`r Biocpkg("igblastr")` is an R/Bioconductor package that provides functions
to conveniently install and use a local IgBLAST installation from within
R while offering additional BCR or TCR sequence annotation.
The package is designed to make it as easy as possible to run IgBLAST in R
by streamlining the installation of both IgBLAST and its associated germline
databases. In particular, these installations can be performed with a single
function call, do not require root access, and can persist across R sessions.
The main function in the package is `igblastn()`, a wrapper to the `igblastn`
_standalone executable_ included in IgBLAST. In addition to `igblastn()`,
the package provides several other features, including:
- A function (`install_igblast()`) to conveniently download and
install a pre-compiled IgBLAST from NCBI.
- A set of built-in germline databases from
[OGRDB](https://ogrdb.airr-community.org/), the AIRR Community’s
Open Germline Receptor Database.
- Functions to download and install germline databases from the
[IMGT/V-QUEST download site](https://www.imgt.org/download/V-QUEST/),
and to configure them for use with `igblastn()`.
- The `install_custom_germline_db()` function to install a germline database
from user-supplied germline gene allele sequences.
- Functions to compute, access, and manipulate IgBLAST _internal data_
and _auxiliary data_.
- A set of built-in constant region (C-region) databases from IMGT/V-QUEST.
- Utility functions to parse the results returned by `igblastn()`.
- A simple tool to vizualize the annotated sequences in a browser.
- The `percent_mutation()` function to compute the percent mutation in
the V, D, J segments of a set of BCR or TCR sequences.
- Utility functions to download data from
[OAS](https://opig.stats.ox.ac.uk/webapps/oas/), the Observed
Antibody Space database, and prepare it for use with IgBLAST.
- Etc.
Useful links:
- IgBLAST is described at
- IgBLAST web interface:
- OGRDB:
- IMGT/V-QUEST download site:
Please use to report bugs,
provide feedback, request features (etc) about `r Biocpkg("igblastr")`.
# Install and load the package
Like any Bioconductor package, `r Biocpkg("igblastr")` should be installed
with `BiocManager::install()`:
```{r, eval=FALSE}
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("igblastr")
```
`BiocManager::install()` will take care of installing the package dependencies
that are missing.
To load `r Biocpkg("igblastr")` into your R session, run:
```{r, message=FALSE}
library(igblastr)
```
# Install and check IgBLAST
## Install IgBLAST
If IgBLAST is already installed on your system, you can tell
`r Biocpkg("igblastr")` to use it by setting the environment variable
`IGBLAST_ROOT` to the path of your IgBLAST installation.
See `?IGBLAST_ROOT` for more information.
Otherwise, simply call `install_igblast()` to install the latest
version of IgBLAST. As of March 2025, NCBI provides pre-compiled
versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon.
`install_igblast()` will automatically download the appropriate pre-compiled
version of IgBLAST for your platform from the NCBI FTP site, and install it
in a location that will be remembered across R sessions.
```{r}
if (!has_igblast())
install_igblast()
```
See `?install_igblast` for more information.
Note that we use `has_igblast()` to avoid reinstalling IgBLAST if
`r Biocpkg("igblastr")` already has access to a working IgBLAST installation.
## Check IgBLAST
To display basic information about the IgBLAST installation used
by `r Biocpkg("igblastr")`, run:
```{r}
igblast_info()
```
# Install and select a germline database
The `r Biocpkg("igblastr")` package includes a FASTA file containing
8,437 paired heavy- and light-chain human antibody sequences (16,874
individual sequences) retrieved from OAS. These sequences will serve as
our _query sequences_, that is, the immunoglobulin (Ig) sequences that
analyzed in this vignette.
Before we can do so, the `igblastn` _standalone executable_ included
in IgBLAST -- and, by extension, our `igblastn()` function -- requires
access to human germline V, D, and J gene allele sequences. To be used
by `igblastn`, these sequences must first be organized into three separate
BLAST databases: the V-region, D-region, and J-region databases. Collectively,
we will refer to these three databases as the _germline database_.
## Built-in germline databases
The `r Biocpkg("igblastr")` package includes a set of built-in germline
databases for human and mouse that were obtained from the OGRDB database
(AIRR community). These can be listed with:
```{r}
list_germline_dbs(builtin.only=TRUE)
```
The last part of the database name indicates the date of the download in
YYYYMM format.
The `intdata` and `auxdata` columns indicate whether a database includes
its own annotations for the germline V alleles (`intdata` column)
and germline J alleles (`auxdata` column). More on this in the
"[Internal data and auxiliary data](#internal-data-and-auxiliary-data)"
section below.
If our query sequences are from humans, or one of the mouse strains listed
above, or rhesus monkey, then we can already select the appropriate database
with `use_germline_db()` and skip the subsection below.
## Install additional germline databases
AIRR/OGRDB and IMGT/V-QUEST are two providers of germline databases
that can be used with IgBLAST. If, for any reason, none of the built-in
AIRR/OGRDB germline databases is suitable (e.g. your query sequences
are not from human, mouse, or rhesus monkey), then you can
use `install_IMGT_germline_db()` to install additional germline
databases. Below, we show how to install the latest human germline
database from IMGT/V-QUEST.
### Install germline database from IMGT/V-QUEST
First, we list the most recent IMGT/V-QUEST releases:
```{r}
head(list_IMGT_releases())
```
The organisms included in release 202614-2 are:
```{r}
list_IMGT_organisms("202614-2")
```
Next, we install the human germline database from the latest IMGT/V-QUEST
release:
```{r, message=FALSE}
install_IMGT_germline_db("202614-2", organism="Homo sapiens")
```
See `?install_IMGT_germline_db` for more information.
### Install germline database from user-supplied gene allele sequences
The `r Biocpkg("igblastr")` package provides the `install_custom_germline_db()`
function to create and install a germline database from a set of
user-supplied FASTA files containing germline V/D/J gene allele sequences.
See `?install_custom_germline_db` for more information.
## Select the germline database to use with igblastn()
Finally, we select the newly installed germline database as the germline
database to use with `igblastn()`:
```{r}
use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL")
```
See `?use_germline_db` for more information.
To see the full list of cached germline dbs:
```{r}
list_germline_dbs()
```
Note that the asterisk (`*`) displayed at the far right of the output
from `list_germline_dbs()` indicates the currently selected germline
database (you may need to scroll horizontally to see the asterisk).
See `?list_germline_dbs` for more information.
# Internal data and auxiliary data
Note that the newly installed germline database
(`IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL`) includes
its own _internal data_ and _auxiliary data_, as reported in
the `intdata` and `auxdata` columns.
## What are the _internal data_ and _auxiliary data_?
The `intdata` and `auxdata` columns indicate whether a database
includes annotations for the germline V alleles (`intdata` column) and
germline J alleles (`auxdata` column). These annotations consist of
reporting the coding frame start position and FWR/CDR boundaries on
the V and J sequences. Note that the FWR/CDR boundaries on the V sequences
are also known as the V gene delineations. For the J sequences the only
FWR/CDR boundary is the CDR3/FWR4 boundary.
When analyzing BCR or TCR sequences with IgBLAST, the latter needs
access to this information in order to annotate the former.
In IgBLAST's terminology, the annotations for the germline V and J
alleles are called _internal data_ and _auxiliary data_, respectively.
## IgBLAST-provided internal/auxiliary data
IgBLAST provides _internal data_ and _auxiliary data_ for 5 organisms:
human, mouse, rabbit, rat, and rhesus monkey. We'll sometimes refer to
these organisms as IgBLAST organisms.
This data can be accessed with `load_intdata()` and `load_auxdata()`:
```{r}
head(load_intdata("human"))
head(load_auxdata("human"))
```
See `?load_intdata` and `?load_auxdata` for more information.
However, this data doesn't get updated on a regular basis by the NCBI
folks, and can be incomplete or out-of-sync with the germline V and J
alleles provided by AIRR/OGRDB or IMGT/V-QUEST.
## igblastr-generated internal/auxiliary data
Unlike the _internal data_ and _auxiliary data_ provided by IgBLAST, the
_internal data_ and _auxiliary data_ included in a germline database is
generated by `r Biocpkg("igblastr")`, and is guaranteed to be complete
and in sync with the germline V and J alleles in the database.
This data can also be accessed with `load_intdata()` and `load_auxdata()`:
```{r}
head(load_intdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))
head(load_auxdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))
```
See `?load_intdata` and `?load_auxdata` for more information.
### How is this data generated?
For the built-in OGRDB germline databases (the `_OGRDB.*` databases), the
_internal data_ and _auxiliary data_ is provided by AIRR/OGRDB.
For a germline database created with `install_IMGT_germline_db()` like
`IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL`:
- _internal data_: The V gene delineations are inferred from the gaps in
the nucleotide sequences of the V alleles provided by IMGT/V-QUEST.
- _auxiliary data_: The procedure to generate this data is described in
the `r Biocpkg("igblastr")` Wiki
[here](https://github.com/HyrienLab/igblastr/wiki/Auxiliary-data-in-igblastr#igblastr-generated-auxiliary-data).
### How to use the igblastr-generated internal/auxiliary data with igblastr?
This happens automatically!
More precisely, if a germline database includes its own _internal data_
and _auxiliary data_, then the `igblastn()` function will use this data
instead of the _internal data_ and _auxiliary data_ provided by IgBLAST.
See documentation of the `custom_internal_data` and `auxiliary_data`
arguments in `?igblastn` for more information.
# Select a constant region database (optional)
The `igblastn` _standalone executable_ included in IgBLAST can also use
constant region (C-region) sequences for sequence annotation. As with the
germline V, D, and J gene allele sequences, the C-region sequences are
generally expected to originate from the same organism(s) as the query
sequences, and must likewise be formatted as a BLAST database. We will
refer to this database as the C-region database.
The `r Biocpkg("igblastr")` package includes built-in C-region databases
for human, mouse, rabbit, rat, and rhesus monkey, obtained from IMGT/V-QUEST.
The available databases can be listed using:
```{r}
list_c_region_dbs()
```
The last part of the database name indicates the date of the download in
YYYYMM format.
If your query sequences are from human, mouse, or rabbit, you can
select the appropriate database using `use_c_region_db()`:
```{r}
use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605")
```
Calling `list_c_region_dbs()` again should display an asterisk (`*`)
at the far right of the output, indicating the currently selected
C-region database.
See `?list_c_region_dbs` for more information.
# Use igblastn()
Now that we have selected the germline and C-region databases to use
with `igblastn()`, we are almost ready to call `igblastn()` to perform
the alignment.
## Prepare the _query sequences_
As mentioned earlier, the `r Biocpkg("igblastr")` package includes a
FASTA file containing 8,437 paired heavy and light chain human antibody
sequences retrieved from OAS. These will serve as our _query sequences_,
that is, as the set of BCR sequences that we will analyse with `igblastn()`.
To get the path to the query sequences, use:
```{r}
query <- system.file(package="igblastr", "extdata",
"BCR", "1279067_1_Paired_sequences.fasta.gz")
```
The `r Biocpkg("igblastr")` package also includes a JSON file containing
metadata associated with the query BCR sequences:
```{r}
json <- system.file(package="igblastr", "extdata",
"BCR", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata
```
The 8,347 paired sequences come from memory B cells isolated from
peripheral blood mononuclear cell (PBMC) samples of a single human
donor (age 38) with no known disease or vaccination history.
The source for these sequences is the Jaffe et al. (2022) study; the
DOI link to the publication is provided above.
## Call igblastn() on the _query sequences_
Before calling `igblastn()`, we first check the selected databases by
calling `use_germline_db()` and `use_c_region_db()` with no arguments:
```{r}
use_germline_db()
use_c_region_db()
```
Now, let's call `igblastn()`. Since we are only interested in the best
V alignment for each query sequence, we set `num_alignments_V` to 1.
Analyzing this set of 16,874 BCR sequences may take up to around 3 min
on a standard laptop:
```{r}
AIRR_df <- igblastn(query, num_alignments_V=1)
```
By default, the result is an _AIRR-formatted_ tibble, that is, a tibble
with one row per query sequence and many columns:
```{r}
AIRR_df
```
The columns are standard "Rearrangement Schema" fields.
These fields are defined and documented by the AIRR Community at
https://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields
You can call `igbrowser()` on `AIRR_df` to visualize the annotated
sequences in a browser. For each sequence, the V, D, J, and C segments
will be shown as well as the FWR1-4 and CDR1-3 regions.
See `?igblastn` and `?igbrowser` for more information.
# Downstream analysis
In this section, we're going to show some examples of simple downstream
analyses that can be performed on the _AIRR-formatted_ tibble returned
by `igblastn()`.
## Distribution of percent mutation across BCR sequences
```{r, message=FALSE}
library(ggplot2)
```
One common analysis of AIRR format data is to examine the distribution of
percent mutation across BCR sequences. Here we analyze the percent mutation
in the V segments of each chain type (heavy, kappa, and lambda) at the
nucleotide level. Note that V percent mutation at the nucleotide level is
100 - `v_identity`:
```{r}
AIRR_df |>
ggplot(aes(locus, 100 - v_identity)) +
theme_bw(base_size=14) +
geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
ggtitle("Distribution of V percent mutation by locus at the nucleotide level") +
xlab(NULL)
```
To do the same thing at the amino acid level, we first use the
`percent_mutation()` function to compute the percent mutation in the
V, D, J segments at the amino acid level:
```{r}
perc_mut_aa <- percent_mutation(AIRR_df, for.aa=TRUE)
head(perc_mut_aa)
```
See `?percent_mutation` for more information.
Then:
```{r}
perc_mut_aa |>
ggplot(aes(locus, v_perc_mut_aa)) +
theme_bw(base_size=14) +
geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
ggtitle("Distribution of V percent mutation by locus at the amino acid level") +
xlab(NULL)
```
## Distribution of germline genes
```{r, message=FALSE}
library(dplyr)
library(scales)
```
Another common analysis is to investigate the distribution of germline
genes (e.g., V genes). In this case, we typically stratify the analysis
by locus or chain type.
```{r}
plot_gene_dist <- function(AIRR_df, loc) {
df_v_gene <- AIRR_df |>
filter(locus == loc) |>
mutate(v_gene = allele2gene(v_call)) |> # drop allele info
group_by(v_gene) |>
summarize(n = n(), .groups = "drop") |>
mutate(frac = n / sum(n))
df_v_gene |>
ggplot(aes(frac, v_gene)) +
theme_bw(base_size=13) +
geom_col() +
scale_x_continuous('Percent of sequences', labels = scales::percent) +
ylab("Germline gene") +
ggtitle(paste0(loc, "V gene prevalence"))
}
```
```{r, fig.height=8}
plot_gene_dist(AIRR_df, "IGH")
```
```{r, fig.height=5.9}
plot_gene_dist(AIRR_df, "IGK")
```
```{r, fig.height=5.3}
plot_gene_dist(AIRR_df, "IGL")
```
## Lengths and motifs of CDR3 sequences
```{r, message=FALSE}
library(ggseqlogo)
```
A third category of analysis focuses on CDR3 sequences, including their
lengths and motifs, which are often visualized using sequence logo plots.
```{r, fig.height=4.5}
AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)
AIRR_df |>
group_by(locus, cdr3_aa_length) |>
summarize(n = n(), .groups = "drop") |>
ggplot(aes(cdr3_aa_length, n)) +
theme_bw(base_size=14) +
facet_wrap(~locus) +
geom_col() +
ggtitle("Histograms of CDR3 length by locus")
```
```{r}
AIRR_df |>
filter(locus == "IGK", cdr3_aa_length == 9) |>
pull(cdr3_aa) |>
ggseqlogo(method = "probability") +
theme_bw(base_size=14) +
ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")
```
# Advanced usage
## Passing additional arguments to igblastn()
The `igblastn` _standalone executable_ included in IgBLAST supports many
command line arguments. You can quickly list them with `igblastn_help()`
(or with `igblastn_help(TRUE)` to get an expanded list with more details --
see `?igblastn_help`):
```{r}
igblastn_help()
```
All these command line arguments can be passed to the `igblastn()`
function using the usual `argument_name=argument_value` syntax. For
example, command line argument `-num_threads` allows the user to leverage
IgBLAST built-in parallel computing capabilities. To use it in `igblastn()`,
set argument `num_threads` to the desired value e.g. by
calling `igblastn(query, num_threads=8)`.
Note that the Examples section in `?igblastn` provides more information
about using `igblastn()` in parallel.
## Restrict the search to a subset of user-supplied gene alleles
Some arguments of particular interest are the `germline_db_[VDJ]_seqidlist`
arguments. They allow restricting the search of the germline database to
a list of gene alleles supplied by the user. This list can be provided
as a character vector of gene allele identifiers (e.g. `IGHV3-23*01`,
`IGHV3-23*04`, etc..), or as the path to a file containing the identifiers
(one identifier per line). For example:
```{r, eval=FALSE}
V_alleles <- c("IGHV3-23*01", "IGHV3-23*04")
igblastn(query, germline_db_V_seqidlist=V_alleles)
```
If the gene alleles are stored in a file, say in
`path/to/my_V_gene_alleles.txt`, then use:
```{r, eval=FALSE}
igblastn(query, germline_db_V_seqidlist=file("path/to/my_V_gene_alleles.txt"))
```
Note that in this case, the path to the file containing the gene alleles
identifiers must be wrapped in a call to `file()`.
See `?igblastn` for more information.
## A TCR analysis example
Even though NCBI IgBLAST primary use case is BCR analysis, it can also be used
for TCR sequence analysis, and so does the `r Biocpkg("igblastr")` package.
### Prepare the TCR _query sequences_
File `SRR11341217.fasta.gz` included in the package contains 10,875
human beta chain TCR transcripts running from 5' of reverse transcription
reaction to beginning of constant region.
See for more information
about this dataset.
```{r}
filename <- "SRR11341217.fasta.gz"
query <- system.file(package="igblastr", "extdata", "TCR", filename)
```
### Install a TCR germline database
To analyze this dataset with `igblastn()`, we need to install and select
a human TCR germline database. We can use `install_IMGT_germline_db()` with
the `tcr.db` argument set to `TRUE` for that. This will install a germline
database made of the human TCR germline sequences provided by IMGT/V-QUEST:
```{r, warning=FALSE, message=FALSE}
db_name <- install_IMGT_germline_db("202614-2", organism="Homo sapiens",
tcr.db=TRUE)
```
See the new germline database in the list displayed by `list_germline_dbs()`:
```{r}
list_germline_dbs()
```
Note that:
- The name of this new germline database
(`IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD`) reflects the fact that
it contains germline gene alleles from the four T-cell receptor loci:
TRA (alpha chain), TRB (beta chain), TRG (gamma chain), and TRD (delta chain).
- This new germline database also includes its own _internal data_
and _auxiliary data_, as reported in the `intdata` and `auxdata` columns.
See `?install_IMGT_germline_db` and `?list_germline_dbs` for more information.
Let's select this new germline database as the germline database to use
with `igblastn()`:
```{r, message=FALSE}
use_germline_db(db_name)
```
See `?use_germline_db` for more information.
### Select a TCR constant region database (optional)
Use `list_c_region_dbs()` to list the C-region databases that are available:
```{r}
list_c_region_dbs()
```
For this analysis of human TCR sequences, we'll select
`_IMGT.human.TRA+TRB+TRG+TRD.202605` as the C-region database
to use with `igblastn()`:
```{r, message=FALSE}
use_c_region_db("_IMGT.human.TRA+TRB+TRG+TRD.202605")
```
See `?use_c_region_db` for more information.
### Call igblastn() on the TCR _query sequences_
Check the selected databases:
```{r}
use_germline_db()
use_c_region_db()
```
Call `igblastn()`:
```{r}
AIRR_df <- igblastn(query)
AIRR_df
```
Note that, when using `igblastn()` to analyze TCR sequences, we don't need
to specify the `ig_seqtype` argument like we would have to if we were using
the `igblastn` _standalone executable_ included in IgBLAST.
`igblastn()` will automatically set `ig_seqtype` to `"TCR"` based on the
name of the selected germline db. See documentation of the `ig_seqtype`
argument in `?igblastn` for more information.
# Future developments and session information
## Future developments
At the moment, the `r Biocpkg("igblastr")` package does not provide
access to the full functionality of the IgBLAST software. Most notably,
the `igblastp` _standalone executable_ included included in IgBLAST has no
counterpart in `r Biocpkg("igblastr")`.
Some future developments include:
- Implement `igblastp()`, a wrapper to the `igblastp` _standalone executable_
included in IgBLAST for protein-level alignments.
- Add facilities to retrieve arbitrary germline databases from OGRDB,
the AIRR Community’s Open Germline Receptor Database.
## Session information
Here is the output of `sessionInfo()` on the system where this document
was compiled:
```{r}
sessionInfo()
```