The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a specialized bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR) and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs sequence alignment and annotation, with key outputs including germline V, D, and J gene assignments; characterization of somatic hypermutations introduced during affinity maturation; identification of complementarity-determining regions (CDR1–CDR3), framework regions (FWR1–FWR4), and isotype; and both nucleotide and protein-level alignments. These outputs form the basis for many downstream analyses of BCR and TCR repertoires, including clonotype identification, lineage tracing, and repertoire feature characterization.
igblastr is an R/Bioconductor package that provides functions to conveniently install and use a local IgBLAST installation from within R while offering additional BCR or TCR sequence annotation. The package is designed to make it as easy as possible to run IgBLAST in R by streamlining the installation of both IgBLAST and its associated germline databases. In particular, these installations can be performed with a single function call, do not require root access, and can persist across R sessions.
The main function in the package is igblastn(), a
wrapper to the igblastn standalone executable
included in IgBLAST. In addition to igblastn(), the package
provides several other features, including:
A function (install_igblast()) to conveniently
download and install a pre-compiled IgBLAST from NCBI.
A set of built-in germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.
Functions to download and install germline databases from the IMGT/V-QUEST download
site, and to configure them for use with
igblastn().
The install_custom_germline_db() function to install
a germline database from user-supplied germline gene allele
sequences.
Functions to compute, access, and manipulate IgBLAST internal data and auxiliary data.
A set of built-in constant region (C-region) databases from IMGT/V-QUEST.
Utility functions to parse the results returned by
igblastn().
A simple tool to vizualize the annotated sequences in a browser.
The percent_mutation() function to compute the
percent mutation in the V, D, J segments of a set of BCR or TCR
sequences.
Utility functions to download data from OAS, the Observed Antibody Space database, and prepare it for use with IgBLAST.
Etc.
Useful links:
IgBLAST is described at https://pubmed.ncbi.nlm.nih.gov/23671333/
IgBLAST web interface: https://www.ncbi.nlm.nih.gov/igblast/
IMGT/V-QUEST download site: https://www.imgt.org/download/V-QUEST/
Please use https://github.com/HyrienLab/igblastr/issues to report bugs, provide feedback, request features (etc) about igblastr.
Like any Bioconductor package, igblastr
should be installed with BiocManager::install():
if (!require("BiocManager", quietly=TRUE))
install.packages("BiocManager")
BiocManager::install("igblastr")BiocManager::install() will take care of installing the
package dependencies that are missing.
To load igblastr into your R session, run:
If IgBLAST is already installed on your system, you can tell igblastr
to use it by setting the environment variable IGBLAST_ROOT
to the path of your IgBLAST installation. See ?IGBLAST_ROOT
for more information.
Otherwise, simply call install_igblast() to install the
latest version of IgBLAST. As of March 2025, NCBI provides pre-compiled
versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon.
install_igblast() will automatically download the
appropriate pre-compiled version of IgBLAST for your platform from the
NCBI FTP site, and install it in a location that will be remembered
across R sessions.
## Installing IgBLAST at /github/home/.cache/R/igblastr/igblast_roots/1.22.0 ... ok
## IgBLAST 1.22.0 successfully installed.
See ?install_igblast for more information.
Note that we use has_igblast() to avoid reinstalling
IgBLAST if igblastr
already has access to a working IgBLAST installation.
To display basic information about the IgBLAST installation used by igblastr, run:
## igblast_root: /github/home/.cache/R/igblastr/igblast_roots/1.22.0
## igblast_build: igblast 1.22.0, build Oct 11 2023 16:06:20
## igblastn_version: 1.22.0
## makeblastdb_version: 2.15.0+
## OS/arch: Linux/x86_64
## igblast_organisms: human, mouse, rabbit, rat, rhesus_monkey
The igblastr package includes a FASTA file containing 8,437 paired heavy- and light-chain human antibody sequences (16,874 individual sequences) retrieved from OAS. These sequences will serve as our query sequences, that is, the immunoglobulin (Ig) sequences that analyzed in this vignette.
Before we can do so, the igblastn standalone
executable included in IgBLAST – and, by extension, our
igblastn() function – requires access to human germline V,
D, and J gene allele sequences. To be used by igblastn,
these sequences must first be organized into three separate BLAST
databases: the V-region, D-region, and J-region databases. Collectively,
we will refer to these three databases as the germline
database.
The igblastr package includes a set of built-in germline databases for human and mouse that were obtained from the OGRDB database (AIRR community). These can be listed with:
## db_name V D J intdata auxdata
## _OGRDB.human.IGH+IGK+IGL.202410 342 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202410.src 354 33 24 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605 367 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605.src 379 33 24 TRUE TRUE
## _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603 184 9 22 TRUE TRUE
## _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603 169 11 22 TRUE TRUE
## _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603 172 9 22 TRUE TRUE
## _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205 149 9 22 TRUE TRUE
## _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410 184 10 22 TRUE TRUE
## _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602 2294 72 39 TRUE TRUE
The last part of the database name indicates the date of the download in YYYYMM format.
The intdata and auxdata columns indicate
whether a database includes its own annotations for the germline V
alleles (intdata column) and germline J alleles
(auxdata column). More on this in the “Internal data and auxiliary
data” section below.
If our query sequences are from humans, or one of the mouse strains
listed above, or rhesus monkey, then we can already select the
appropriate database with use_germline_db() and skip the
subsection below.
AIRR/OGRDB and IMGT/V-QUEST are two providers of germline databases
that can be used with IgBLAST. If, for any reason, none of the built-in
AIRR/OGRDB germline databases is suitable (e.g. your query sequences are
not from human, mouse, or rhesus monkey), then you can use
install_IMGT_germline_db() to install additional germline
databases. Below, we show how to install the latest human germline
database from IMGT/V-QUEST.
First, we list the most recent IMGT/V-QUEST releases:
## [1] "202623-4" "202616-1" "202614-2" "202612-4" "202611-4" "202603-4"
The organisms included in release 202614-2 are:
## [1] "Aotus_nancymaae" "Bos_taurus"
## [3] "Camelus_dromedarius" "Canis_lupus_familiaris"
## [5] "Capra_hircus" "Chondrichthyes"
## [7] "Danio_rerio" "Equus_caballus"
## [9] "Felis_catus" "Gadus_morhua"
## [11] "Gallus_gallus" "Gorilla_gorilla_gorilla"
## [13] "Heterocephalus_glaber" "Homo_sapiens"
## [15] "Ictalurus_punctatus" "Lemur_catta"
## [17] "Macaca_fascicularis" "Macaca_mulatta"
## [19] "Mus_musculus" "Mus_musculus_C57BL6J"
## [21] "Mustela_putorius_furo" "Neogale_vison"
## [23] "Nonhuman_primates" "Oncorhynchus_mykiss"
## [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"
## [27] "Ovis_aries" "Pan_troglodytes"
## [29] "Pongo_abelii" "Pongo_pygmaeus"
## [31] "Rattus_norvegicus" "Salmo_salar"
## [33] "Sus_scrofa" "Teleostei"
## [35] "Tursiops_truncatus" "Vicugna_pacos"
Next, we install the human germline database from the latest IMGT/V-QUEST release:
See ?install_IMGT_germline_db for more information.
The igblastr
package provides the install_custom_germline_db() function
to create and install a germline database from a set of user-supplied
FASTA files containing germline V/D/J gene allele sequences.
See ?install_custom_germline_db for more
information.
Finally, we select the newly installed germline database as the
germline database to use with igblastn():
## CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
## users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
## license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
## use of IMGT material, from the private sector, needs a financial arrangement
## with CNRS.
##
## To suppress this message, use:
## suppressMessages(use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))
See ?use_germline_db for more information.
To see the full list of cached germline dbs:
## db_name V D J intdata auxdata
## _OGRDB.human.IGH+IGK+IGL.202410 342 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202410.src 354 33 24 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605 367 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605.src 379 33 24 TRUE TRUE
## _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603 184 9 22 TRUE TRUE
## _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603 169 11 22 TRUE TRUE
## _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603 172 9 22 TRUE TRUE
## _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205 149 9 22 TRUE TRUE
## _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410 184 10 22 TRUE TRUE
## _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602 2294 72 39 TRUE TRUE
## IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL 730 48 35 TRUE TRUE *
Note that the asterisk (*) displayed at the far right of
the output from list_germline_dbs() indicates the currently
selected germline database (you may need to scroll horizontally to see
the asterisk).
See ?list_germline_dbs for more information.
Note that the newly installed germline database
(IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL) includes its own
internal data and auxiliary data, as reported in the
intdata and auxdata columns.
The intdata and auxdata columns indicate
whether a database includes annotations for the germline V alleles
(intdata column) and germline J alleles
(auxdata column). These annotations consist of reporting
the coding frame start position and FWR/CDR boundaries on the V and J
sequences. Note that the FWR/CDR boundaries on the V sequences are also
known as the V gene delineations. For the J sequences the only FWR/CDR
boundary is the CDR3/FWR4 boundary.
When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST’s terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively.
IgBLAST provides internal data and auxiliary data for 5 organisms: human, mouse, rabbit, rat, and rhesus monkey. We’ll sometimes refer to these organisms as IgBLAST organisms.
This data can be accessed with load_intdata() and
load_auxdata():
## allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1 A1 1 78 79 111 112 162
## 2 A10 1 78 79 96 97 147
## 3 A11 1 78 79 99 100 150
## 4 A14 1 78 79 96 97 147
## 5 A17 1 78 79 111 112 162
## 6 A18b 1 78 79 111 112 162
## cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1 163 171 172 279 VK 0
## 2 148 156 157 264 VK 0
## 3 151 159 160 267 VK 0
## 4 148 156 157 264 VK 0
## 5 163 171 172 279 VK 0
## 6 163 171 172 279 VK 0
## allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1 IGHJ1*01 0 JH 17 1
## 2 IGHJ1P*01 2 JH NA NA
## 3 IGHJ2*01 1 JH 18 1
## 4 IGHJ2P*01 0 JH NA NA
## 5 IGHJ3*01 1 JH 15 1
## 6 IGHJ3*02 1 JH 15 1
See ?load_intdata and ?load_auxdata for
more information.
However, this data doesn’t get updated on a regular basis by the NCBI folks, and can be incomplete or out-of-sync with the germline V and J alleles provided by AIRR/OGRDB or IMGT/V-QUEST.
Unlike the internal data and auxiliary data provided by IgBLAST, the internal data and auxiliary data included in a germline database is generated by igblastr, and is guaranteed to be complete and in sync with the germline V and J alleles in the database.
This data can also be accessed with load_intdata() and
load_auxdata():
## allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1 IGHV1-18*01 1 75 76 99 100 150
## 2 IGHV1-18*02 1 75 76 99 100 150
## 3 IGHV1-18*03 1 75 76 99 100 150
## 4 IGHV1-18*04 1 75 76 99 100 150
## 5 IGHV1-2*01 1 75 76 99 100 150
## 6 IGHV1-2*02 1 75 76 99 100 150
## cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1 151 174 175 288 VH 0
## 2 151 174 175 288 VH 0
## 3 151 174 175 288 VH 0
## 4 151 174 175 288 VH 0
## 5 151 174 175 288 VH 0
## 6 151 174 175 288 VH 0
## allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1 IGHJ1*01 0 JH 17 1
## 2 IGHJ2*01 1 JH 18 1
## 3 IGHJ3*01 1 JH 15 1
## 4 IGHJ3*02 1 JH 15 1
## 5 IGHJ4*01 2 JH 13 1
## 6 IGHJ4*02 2 JH 13 1
See ?load_intdata and ?load_auxdata for
more information.
For the built-in OGRDB germline databases (the _OGRDB.*
databases), the internal data and auxiliary data is
provided by AIRR/OGRDB.
For a germline database created with
install_IMGT_germline_db() like
IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL:
This happens automatically!
More precisely, if a germline database includes its own internal
data and auxiliary data, then the igblastn()
function will use this data instead of the internal data and
auxiliary data provided by IgBLAST.
See documentation of the custom_internal_data and
auxiliary_data arguments in ?igblastn for more
information.
The igblastn standalone executable included in
IgBLAST can also use constant region (C-region) sequences for sequence
annotation. As with the germline V, D, and J gene allele sequences, the
C-region sequences are generally expected to originate from the same
organism(s) as the query sequences, and must likewise be formatted as a
BLAST database. We will refer to this database as the C-region
database.
The igblastr package includes built-in C-region databases for human, mouse, rabbit, rat, and rhesus monkey, obtained from IMGT/V-QUEST. The available databases can be listed using:
## db_name C
## _IMGT.human.IGH+IGK+IGL.202605 180
## _IMGT.human.TRA+TRB+TRG+TRD.202605 23
## _IMGT.mouse.IGH+IGK+IGL.202605 64
## _IMGT.mouse.TRA+TRB+TRG+TRD.202605 13
## _IMGT.rabbit.IGH+IGK+IGL.202605 50
## _IMGT.rabbit.TRA+TRB+TRG+TRD.202605 6
## _IMGT.rat.IGH+IGK+IGL.202605 25
## _IMGT.rhesus_monkey.IGH+IGK+IGL.202605 44
## _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605 10
The last part of the database name indicates the date of the download in YYYYMM format.
If your query sequences are from human, mouse, or rabbit, you can
select the appropriate database using
use_c_region_db():
## CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
## users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
## license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
## use of IMGT material, from the private sector, needs a financial arrangement
## with CNRS.
##
## To suppress this message, use:
## suppressMessages(use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605"))
Calling list_c_region_dbs() again should display an
asterisk (*) at the far right of the output, indicating the
currently selected C-region database.
See ?list_c_region_dbs for more information.
Now that we have selected the germline and C-region databases to use
with igblastn(), we are almost ready to call
igblastn() to perform the alignment.
As mentioned earlier, the igblastr
package includes a FASTA file containing 8,437 paired heavy and light
chain human antibody sequences retrieved from OAS. These will serve as
our query sequences, that is, as the set of BCR sequences that
we will analyse with igblastn().
To get the path to the query sequences, use:
The igblastr package also includes a JSON file containing metadata associated with the query BCR sequences:
json <- system.file(package="igblastr", "extdata",
"BCR", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata## $Run
## [1] 1279067
##
## $Link
## [1] "https://doi.org/10.1038/s41586-022-05371-z"
##
## $Author
## [1] "Jaffe et al., 2022"
##
## $Species
## [1] "human"
##
## $Age
## [1] 38
##
## $BSource
## [1] "PBMC"
##
## $BType
## [1] "Memory-B-Cells"
##
## $Vaccine
## [1] "None"
##
## $Disease
## [1] "None"
##
## $Subject
## [1] "Donor-3"
##
## $Longitudinal
## [1] "no"
##
## $`Unique sequences`
## [1] 8437
##
## $Isotype
## [1] "All"
##
## $Chain
## [1] "Paired"
The 8,347 paired sequences come from memory B cells isolated from peripheral blood mononuclear cell (PBMC) samples of a single human donor (age 38) with no known disease or vaccination history. The source for these sequences is the Jaffe et al. (2022) study; the DOI link to the publication is provided above.
Before calling igblastn(), we first check the selected
databases by calling use_germline_db() and
use_c_region_db() with no arguments:
## [1] "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"
## [1] "_IMGT.human.IGH+IGK+IGL.202605"
Now, let’s call igblastn(). Since we are only interested
in the best V alignment for each query sequence, we set
num_alignments_V to 1. Analyzing this set of 16,874 BCR
sequences may take up to around 3 min on a standard laptop:
By default, the result is an AIRR-formatted tibble, that is, a tibble with one row per query sequence and many columns:
## # A tibble: 16,874 × 111
## sequence_id sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
## <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
## 1 heavy_chain_A… ACAACCA… QVQLVQSGSE… IGH FALSE TRUE FALSE
## 2 light_chain_A… TGAGCGC… QSVLTQPPSV… IGL FALSE TRUE FALSE
## 3 heavy_chain_A… ACTTTCT… QVQLQESGPG… IGH FALSE TRUE FALSE
## 4 light_chain_A… GAGGAGT… DIQMTQSPSS… IGK FALSE TRUE FALSE
## 5 heavy_chain_A… GGGATCA… QVQLVQSGAE… IGH FALSE TRUE FALSE
## 6 light_chain_A… GCTGGGG… QSALTQPPSA… IGL FALSE TRUE FALSE
## 7 heavy_chain_A… GGAGCCC… EVQLVESGGG… IGH FALSE TRUE FALSE
## 8 light_chain_A… GGGGGCT… QSALTQPASV… IGL FALSE TRUE FALSE
## 9 heavy_chain_A… ATACTTT… QVQLQESGPG… IGH FALSE TRUE FALSE
## 10 light_chain_A… GAGCTAC… DIVMTQSPDS… IGK FALSE TRUE FALSE
## # ℹ 16,864 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## # d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## # sequence_alignment <chr>, germline_alignment <chr>,
## # sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## # v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## # d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …
The columns are standard “Rearrangement Schema” fields. These fields are defined and documented by the AIRR Community at https://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields
You can call igbrowser() on AIRR_df to
visualize the annotated sequences in a browser. For each sequence, the
V, D, J, and C segments will be shown as well as the FWR1-4 and CDR1-3
regions.
See ?igblastn and ?igbrowser for more
information.
In this section, we’re going to show some examples of simple
downstream analyses that can be performed on the AIRR-formatted
tibble returned by igblastn().
One common analysis of AIRR format data is to examine the
distribution of percent mutation across BCR sequences. Here we analyze
the percent mutation in the V segments of each chain type (heavy, kappa,
and lambda) at the nucleotide level. Note that V percent mutation at the
nucleotide level is 100 - v_identity:
AIRR_df |>
ggplot(aes(locus, 100 - v_identity)) +
theme_bw(base_size=14) +
geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
ggtitle("Distribution of V percent mutation by locus at the nucleotide level") +
xlab(NULL)To do the same thing at the amino acid level, we first use the
percent_mutation() function to compute the percent mutation
in the V, D, J segments at the amino acid level:
## # A tibble: 6 × 5
## sequence_id locus v_perc_mut_aa d_perc_mut_aa j_perc_mut_aa
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 heavy_chain_AAACCTGAGAGTACCG-… IGH 16.3 0 0
## 2 light_chain_AAACCTGAGAGTACCG-… IGL 16.7 NA 16.7
## 3 heavy_chain_AAACCTGAGATGGCGT-… IGH 6.06 20 6.25
## 4 light_chain_AAACCTGAGATGGCGT-… IGK 7.37 NA 9.09
## 5 heavy_chain_AAACCTGAGCAGACTG-… IGH 11.3 37.5 6.67
## 6 light_chain_AAACCTGAGCAGACTG-… IGL 4.08 NA 0
See ?percent_mutation for more information.
Then:
perc_mut_aa |>
ggplot(aes(locus, v_perc_mut_aa)) +
theme_bw(base_size=14) +
geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
ggtitle("Distribution of V percent mutation by locus at the amino acid level") +
xlab(NULL)Another common analysis is to investigate the distribution of germline genes (e.g., V genes). In this case, we typically stratify the analysis by locus or chain type.
plot_gene_dist <- function(AIRR_df, loc) {
df_v_gene <- AIRR_df |>
filter(locus == loc) |>
mutate(v_gene = allele2gene(v_call)) |> # drop allele info
group_by(v_gene) |>
summarize(n = n(), .groups = "drop") |>
mutate(frac = n / sum(n))
df_v_gene |>
ggplot(aes(frac, v_gene)) +
theme_bw(base_size=13) +
geom_col() +
scale_x_continuous('Percent of sequences', labels = scales::percent) +
ylab("Germline gene") +
ggtitle(paste0(loc, "V gene prevalence"))
}A third category of analysis focuses on CDR3 sequences, including their lengths and motifs, which are often visualized using sequence logo plots.
AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)
AIRR_df |>
group_by(locus, cdr3_aa_length) |>
summarize(n = n(), .groups = "drop") |>
ggplot(aes(cdr3_aa_length, n)) +
theme_bw(base_size=14) +
facet_wrap(~locus) +
geom_col() +
ggtitle("Histograms of CDR3 length by locus")AIRR_df |>
filter(locus == "IGK", cdr3_aa_length == 9) |>
pull(cdr3_aa) |>
ggseqlogo(method = "probability") +
theme_bw(base_size=14) +
ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
## Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The igblastn standalone executable included in
IgBLAST supports many command line arguments. You can quickly list them
with igblastn_help() (or with
igblastn_help(TRUE) to get an expanded list with more
details – see ?igblastn_help):
## USAGE
## igblastn [-h] [-help] [-germline_db_V germline_database_name]
## [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
## [-germline_db_D germline_database_name] [-num_alignments_D int_value]
## [-germline_db_D_seqidlist filename]
## [-germline_db_J germline_database_name] [-num_alignments_J int_value]
## [-germline_db_J_seqidlist filename] [-num_alignments_C int_value]
## [-c_region_db constant_region_database_name]
## [-custom_internal_data filename] [-d_frame_data filename]
## [-auxiliary_data filename] [-min_D_match min_D_match]
## [-V_penalty V_penalty] [-D_penalty D_penalty] [-J_penalty J_penalty]
## [-num_clonotype num_clonotype] [-clonotype_out clonotype_out]
## [-allow_vdj_overlap] [-organism germline_origin]
## [-domain_system domain_system] [-ig_seqtype sequence_type]
## [-focus_on_V_segment] [-extend_align5end] [-extend_align3end]
## [-min_V_length Min_V_Length] [-min_J_length Min_J_Length]
## [-show_translation] [-db database_name] [-dbsize num_letters]
## [-entrez_query entrez_query] [-query input_file] [-sra accession]
## [-out output_file] [-evalue evalue] [-word_size int_value]
## [-gapopen open_penalty] [-gapextend extend_penalty] [-searchsp int_value]
## [-sum_stats bool_value] [-ungapped] [-lcase_masking] [-query_loc range]
## [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis]
## [-num_descriptions int_value] [-num_alignments int_value]
## [-line_length line_length] [-num_threads int_value] [-remote] [-version]
##
## DESCRIPTION
## BLAST for Ig and TCR sequences
##
## Use '-help' to print detailed descriptions of command line arguments
All these command line arguments can be passed to the
igblastn() function using the usual
argument_name=argument_value syntax. For example, command
line argument -num_threads allows the user to leverage
IgBLAST built-in parallel computing capabilities. To use it in
igblastn(), set argument num_threads to the
desired value e.g. by calling
igblastn(query, num_threads=8).
Note that the Examples section in ?igblastn provides
more information about using igblastn() in parallel.
Some arguments of particular interest are the
germline_db_[VDJ]_seqidlist arguments. They allow
restricting the search of the germline database to a list of gene
alleles supplied by the user. This list can be provided as a character
vector of gene allele identifiers (e.g. IGHV3-23*01,
IGHV3-23*04, etc..), or as the path to a file containing
the identifiers (one identifier per line). For example:
If the gene alleles are stored in a file, say in
path/to/my_V_gene_alleles.txt, then use:
Note that in this case, the path to the file containing the gene
alleles identifiers must be wrapped in a call to
file().
See ?igblastn for more information.
Even though NCBI IgBLAST primary use case is BCR analysis, it can also be used for TCR sequence analysis, and so does the igblastr package.
File SRR11341217.fasta.gz included in the package
contains 10,875 human beta chain TCR transcripts running from 5’ of
reverse transcription reaction to beginning of constant region. See https://www.ncbi.nlm.nih.gov/sra/SRX7944361 for more
information about this dataset.
To analyze this dataset with igblastn(), we need to
install and select a human TCR germline database. We can use
install_IMGT_germline_db() with the tcr.db
argument set to TRUE for that. This will install a germline
database made of the human TCR germline sequences provided by
IMGT/V-QUEST:
See the new germline database in the list displayed by
list_germline_dbs():
## db_name V D J intdata auxdata
## _OGRDB.human.IGH+IGK+IGL.202410 342 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202410.src 354 33 24 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605 367 31 23 TRUE TRUE
## _OGRDB.human.IGH+IGK+IGL.202605.src 379 33 24 TRUE TRUE
## _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603 184 9 22 TRUE TRUE
## _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603 169 11 22 TRUE TRUE
## _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603 172 9 22 TRUE TRUE
## _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205 149 9 22 TRUE TRUE
## _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410 184 10 22 TRUE TRUE
## _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602 2294 72 39 TRUE TRUE
## IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL 730 48 35 TRUE TRUE *
## IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD 355 6 101 TRUE TRUE
Note that:
The name of this new germline database
(IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD) reflects the
fact that it contains germline gene alleles from the four T-cell
receptor loci: TRA (alpha chain), TRB (beta chain), TRG (gamma chain),
and TRD (delta chain).
This new germline database also includes its own internal
data and auxiliary data, as reported in the
intdata and auxdata columns.
See ?install_IMGT_germline_db and
?list_germline_dbs for more information.
Let’s select this new germline database as the germline database to
use with igblastn():
See ?use_germline_db for more information.
Use list_c_region_dbs() to list the C-region databases
that are available:
## db_name C
## _IMGT.human.IGH+IGK+IGL.202605 180 *
## _IMGT.human.TRA+TRB+TRG+TRD.202605 23
## _IMGT.mouse.IGH+IGK+IGL.202605 64
## _IMGT.mouse.TRA+TRB+TRG+TRD.202605 13
## _IMGT.rabbit.IGH+IGK+IGL.202605 50
## _IMGT.rabbit.TRA+TRB+TRG+TRD.202605 6
## _IMGT.rat.IGH+IGK+IGL.202605 25
## _IMGT.rhesus_monkey.IGH+IGK+IGL.202605 44
## _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605 10
For this analysis of human TCR sequences, we’ll select
_IMGT.human.TRA+TRB+TRG+TRD.202605 as the C-region database
to use with igblastn():
See ?use_c_region_db for more information.
Check the selected databases:
## [1] "IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD"
## [1] "_IMGT.human.TRA+TRB+TRG+TRD.202605"
Call igblastn():
## # A tibble: 10,875 × 111
## sequence_id sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
## <chr> <chr> <chr> <chr> <lgl> <lgl> <lgl>
## 1 SRR11341217.1 GTCTCAG… VSDGYSVSRS… TRB FALSE TRUE FALSE
## 2 SRR11341217.2 AATGGGG… DLKNVFPP*A… TRB TRUE TRUE FALSE
## 3 SRR11341217.3 GAGTGGA… SGFVIDKFPI… TRB FALSE TRUE FALSE
## 4 SRR11341217.4 TCTTGAA… LEGFSAQLFP… TRB FALSE TRUE FALSE
## 5 SRR11341217.5 TTACCGA… DLKNVFPP*A… TRB TRUE FALSE FALSE
## 6 SRR11341217.6 GGGGTAC… GYSVSREKKE… TRB FALSE TRUE FALSE
## 7 SRR11341217.7 GCGTGAT… SAEFPKEGPS… TRB FALSE TRUE FALSE
## 8 SRR11341217.8 TGGAGAT… GDIAEGYSVS… TRB FALSE TRUE FALSE
## 9 SRR11341217.9 CACCCAC… DLKNVFPP*A… TRB TRUE FALSE FALSE
## 10 SRR11341217.10 TGATCGG… DRFSAERPKG… TRB FALSE TRUE FALSE
## # ℹ 10,865 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## # d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## # sequence_alignment <chr>, germline_alignment <chr>,
## # sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## # v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## # d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …
Note that, when using igblastn() to analyze TCR
sequences, we don’t need to specify the ig_seqtype argument
like we would have to if we were using the igblastn
standalone executable included in IgBLAST.
igblastn() will automatically set ig_seqtype
to "TCR" based on the name of the selected germline db. See
documentation of the ig_seqtype argument in
?igblastn for more information.
At the moment, the igblastr
package does not provide access to the full functionality of the IgBLAST
software. Most notably, the igblastp standalone
executable included included in IgBLAST has no counterpart in
igblastr.
Some future developments include:
Implement igblastp(), a wrapper to the
igblastp standalone executable included in IgBLAST
for protein-level alignments.
Add facilities to retrieve arbitrary germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.
Here is the output of sessionInfo() on the system where
this document was compiled:
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats4 stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] ggseqlogo_0.2.2 scales_1.4.0 dplyr_1.2.1
## [4] ggplot2_4.0.3 igblastr_1.2.9 Biostrings_2.80.1
## [7] Seqinfo_1.2.0 XVector_0.52.0 IRanges_2.46.0
## [10] S4Vectors_0.50.1 BiocGenerics_0.58.1 generics_0.1.4
## [13] tibble_3.3.1 BiocStyle_2.40.0
##
## loaded via a namespace (and not attached):
## [1] utf8_1.2.6 sass_0.4.10 xml2_1.5.2
## [4] stringi_1.8.7 digest_0.6.39 magrittr_2.0.5
## [7] RColorBrewer_1.1-3 grid_4.6.0 evaluate_1.0.5
## [10] fastmap_1.2.0 R.oo_1.27.1 jsonlite_2.0.0
## [13] R.utils_2.13.0 GenomeInfoDb_1.48.0 BiocManager_1.30.27
## [16] httr_1.4.8 rvest_1.0.5 selectr_0.5-1
## [19] UCSC.utils_1.8.0 jquerylib_0.1.4 cli_3.6.6
## [22] rlang_1.2.0 crayon_1.5.3 R.methodsS3_1.8.2
## [25] withr_3.0.2 cachem_1.1.0 yaml_2.3.12
## [28] otel_0.2.0 tools_4.6.0 curl_7.1.0
## [31] buildtools_1.0.0 vctrs_0.7.3 R6_2.6.1
## [34] lifecycle_1.0.5 stringr_1.6.0 pkgconfig_2.0.3
## [37] gtable_0.3.6 pillar_1.11.1 bslib_0.11.0
## [40] glue_1.8.1 tidyselect_1.2.1 xfun_0.58
## [43] sys_3.4.3 knitr_1.51 farver_2.1.2
## [46] xtable_1.8-8 htmltools_0.5.9 labeling_0.4.3
## [49] rmarkdown_2.31 maketools_1.3.2 compiler_4.6.0
## [52] S7_0.2.2