igblastr overview

Introduction

The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a specialized bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR) and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs sequence alignment and annotation, with key outputs including germline V, D, and J gene assignments; characterization of somatic hypermutations introduced during affinity maturation; identification of complementarity-determining regions (CDR1–CDR3), framework regions (FWR1–FWR4), and isotype; and both nucleotide and protein-level alignments. These outputs form the basis for many downstream analyses of BCR and TCR repertoires, including clonotype identification, lineage tracing, and repertoire feature characterization.

igblastr is an R/Bioconductor package that provides functions to conveniently install and use a local IgBLAST installation from within R while offering additional BCR or TCR sequence annotation. The package is designed to make it as easy as possible to run IgBLAST in R by streamlining the installation of both IgBLAST and its associated germline databases. In particular, these installations can be performed with a single function call, do not require root access, and can persist across R sessions.

The main function in the package is igblastn(), a wrapper to the igblastn standalone executable included in IgBLAST. In addition to igblastn(), the package provides several other features, including:

  • A function (install_igblast()) to conveniently download and install a pre-compiled IgBLAST from NCBI.

  • A set of built-in germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.

  • Functions to download and install germline databases from the IMGT/V-QUEST download site, and to configure them for use with igblastn().

  • The install_custom_germline_db() function to install a germline database from user-supplied germline gene allele sequences.

  • Functions to compute, access, and manipulate IgBLAST internal data and auxiliary data.

  • A set of built-in constant region (C-region) databases from IMGT/V-QUEST.

  • Utility functions to parse the results returned by igblastn().

  • A simple tool to vizualize the annotated sequences in a browser.

  • The percent_mutation() function to compute the percent mutation in the V, D, J segments of a set of BCR or TCR sequences.

  • Utility functions to download data from OAS, the Observed Antibody Space database, and prepare it for use with IgBLAST.

  • Etc.

Useful links:

Please use https://github.com/HyrienLab/igblastr/issues to report bugs, provide feedback, request features (etc) about igblastr.

Install and load the package

Like any Bioconductor package, igblastr should be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("igblastr")

BiocManager::install() will take care of installing the package dependencies that are missing.

To load igblastr into your R session, run:

library(igblastr)

Install and check IgBLAST

Install IgBLAST

If IgBLAST is already installed on your system, you can tell igblastr to use it by setting the environment variable IGBLAST_ROOT to the path of your IgBLAST installation. See ?IGBLAST_ROOT for more information.

Otherwise, simply call install_igblast() to install the latest version of IgBLAST. As of March 2025, NCBI provides pre-compiled versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon. install_igblast() will automatically download the appropriate pre-compiled version of IgBLAST for your platform from the NCBI FTP site, and install it in a location that will be remembered across R sessions.

if (!has_igblast())
    install_igblast()
## Installing IgBLAST at /github/home/.cache/R/igblastr/igblast_roots/1.22.0 ... ok
## IgBLAST 1.22.0 successfully installed.

See ?install_igblast for more information.

Note that we use has_igblast() to avoid reinstalling IgBLAST if igblastr already has access to a working IgBLAST installation.

Check IgBLAST

To display basic information about the IgBLAST installation used by igblastr, run:

igblast_info()
## igblast_root:        /github/home/.cache/R/igblastr/igblast_roots/1.22.0
## igblast_build:       igblast 1.22.0, build Oct 11 2023 16:06:20
## igblastn_version:    1.22.0
## makeblastdb_version: 2.15.0+
## OS/arch:             Linux/x86_64
## igblast_organisms:   human, mouse, rabbit, rat, rhesus_monkey

Install and select a germline database

The igblastr package includes a FASTA file containing 8,437 paired heavy- and light-chain human antibody sequences (16,874 individual sequences) retrieved from OAS. These sequences will serve as our query sequences, that is, the immunoglobulin (Ig) sequences that analyzed in this vignette.

Before we can do so, the igblastn standalone executable included in IgBLAST – and, by extension, our igblastn() function – requires access to human germline V, D, and J gene allele sequences. To be used by igblastn, these sequences must first be organized into three separate BLAST databases: the V-region, D-region, and J-region databases. Collectively, we will refer to these three databases as the germline database.

Built-in germline databases

The igblastr package includes a set of built-in germline databases for human and mouse that were obtained from the OGRDB database (AIRR community). These can be listed with:

list_germline_dbs(builtin.only=TRUE)
##  db_name                                       V  D  J intdata auxdata
##  _OGRDB.human.IGH+IGK+IGL.202410             342 31 23    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354 33 24    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202605             367 31 23    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379 33 24    TRUE    TRUE
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184  9 22    TRUE    TRUE
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169 11 22    TRUE    TRUE
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172  9 22    TRUE    TRUE
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149  9 22    TRUE    TRUE
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184 10 22    TRUE    TRUE
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294 72 39    TRUE    TRUE

The last part of the database name indicates the date of the download in YYYYMM format.

The intdata and auxdata columns indicate whether a database includes its own annotations for the germline V alleles (intdata column) and germline J alleles (auxdata column). More on this in the “Internal data and auxiliary data” section below.

If our query sequences are from humans, or one of the mouse strains listed above, or rhesus monkey, then we can already select the appropriate database with use_germline_db() and skip the subsection below.

Install additional germline databases

AIRR/OGRDB and IMGT/V-QUEST are two providers of germline databases that can be used with IgBLAST. If, for any reason, none of the built-in AIRR/OGRDB germline databases is suitable (e.g. your query sequences are not from human, mouse, or rhesus monkey), then you can use install_IMGT_germline_db() to install additional germline databases. Below, we show how to install the latest human germline database from IMGT/V-QUEST.

Install germline database from IMGT/V-QUEST

First, we list the most recent IMGT/V-QUEST releases:

head(list_IMGT_releases())
## [1] "202623-4" "202616-1" "202614-2" "202612-4" "202611-4" "202603-4"

The organisms included in release 202614-2 are:

list_IMGT_organisms("202614-2")
##  [1] "Aotus_nancymaae"          "Bos_taurus"              
##  [3] "Camelus_dromedarius"      "Canis_lupus_familiaris"  
##  [5] "Capra_hircus"             "Chondrichthyes"          
##  [7] "Danio_rerio"              "Equus_caballus"          
##  [9] "Felis_catus"              "Gadus_morhua"            
## [11] "Gallus_gallus"            "Gorilla_gorilla_gorilla" 
## [13] "Heterocephalus_glaber"    "Homo_sapiens"            
## [15] "Ictalurus_punctatus"      "Lemur_catta"             
## [17] "Macaca_fascicularis"      "Macaca_mulatta"          
## [19] "Mus_musculus"             "Mus_musculus_C57BL6J"    
## [21] "Mustela_putorius_furo"    "Neogale_vison"           
## [23] "Nonhuman_primates"        "Oncorhynchus_mykiss"     
## [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"   
## [27] "Ovis_aries"               "Pan_troglodytes"         
## [29] "Pongo_abelii"             "Pongo_pygmaeus"          
## [31] "Rattus_norvegicus"        "Salmo_salar"             
## [33] "Sus_scrofa"               "Teleostei"               
## [35] "Tursiops_truncatus"       "Vicugna_pacos"

Next, we install the human germline database from the latest IMGT/V-QUEST release:

install_IMGT_germline_db("202614-2", organism="Homo sapiens")

See ?install_IMGT_germline_db for more information.

Install germline database from user-supplied gene allele sequences

The igblastr package provides the install_custom_germline_db() function to create and install a germline database from a set of user-supplied FASTA files containing germline V/D/J gene allele sequences.

See ?install_custom_germline_db for more information.

Select the germline database to use with igblastn()

Finally, we select the newly installed germline database as the germline database to use with igblastn():

use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL")
##   CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
##   users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
##   license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
##   use of IMGT material, from the private sector, needs a financial arrangement
##   with CNRS.
## 
##   To suppress this message, use:
##     suppressMessages(use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))

See ?use_germline_db for more information.

To see the full list of cached germline dbs:

list_germline_dbs()
##  db_name                                       V  D  J intdata auxdata  
##  _OGRDB.human.IGH+IGK+IGL.202410             342 31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354 33 24    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605             367 31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379 33 24    TRUE    TRUE  
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184  9 22    TRUE    TRUE  
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169 11 22    TRUE    TRUE  
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172  9 22    TRUE    TRUE  
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149  9 22    TRUE    TRUE  
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184 10 22    TRUE    TRUE  
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294 72 39    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL      730 48 35    TRUE    TRUE *

Note that the asterisk (*) displayed at the far right of the output from list_germline_dbs() indicates the currently selected germline database (you may need to scroll horizontally to see the asterisk).

See ?list_germline_dbs for more information.

Internal data and auxiliary data

Note that the newly installed germline database (IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL) includes its own internal data and auxiliary data, as reported in the intdata and auxdata columns.

What are the internal data and auxiliary data?

The intdata and auxdata columns indicate whether a database includes annotations for the germline V alleles (intdata column) and germline J alleles (auxdata column). These annotations consist of reporting the coding frame start position and FWR/CDR boundaries on the V and J sequences. Note that the FWR/CDR boundaries on the V sequences are also known as the V gene delineations. For the J sequences the only FWR/CDR boundary is the CDR3/FWR4 boundary.

When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST’s terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively.

IgBLAST-provided internal/auxiliary data

IgBLAST provides internal data and auxiliary data for 5 organisms: human, mouse, rabbit, rat, and rhesus monkey. We’ll sometimes refer to these organisms as IgBLAST organisms.

This data can be accessed with load_intdata() and load_auxdata():

head(load_intdata("human"))
##   allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1          A1          1       78         79      111        112      162
## 2         A10          1       78         79       96         97      147
## 3         A11          1       78         79       99        100      150
## 4         A14          1       78         79       96         97      147
## 5         A17          1       78         79      111        112      162
## 6        A18b          1       78         79      111        112      162
##   cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1        163      171        172      279         VK                  0
## 2        148      156        157      264         VK                  0
## 3        151      159        160      267         VK                  0
## 4        148      156        157      264         VK                  0
## 5        163      171        172      279         VK                  0
## 6        163      171        172      279         VK                  0
head(load_auxdata("human"))
##   allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1    IGHJ1*01                  0         JH       17         1
## 2   IGHJ1P*01                  2         JH       NA        NA
## 3    IGHJ2*01                  1         JH       18         1
## 4   IGHJ2P*01                  0         JH       NA        NA
## 5    IGHJ3*01                  1         JH       15         1
## 6    IGHJ3*02                  1         JH       15         1

See ?load_intdata and ?load_auxdata for more information.

However, this data doesn’t get updated on a regular basis by the NCBI folks, and can be incomplete or out-of-sync with the germline V and J alleles provided by AIRR/OGRDB or IMGT/V-QUEST.

igblastr-generated internal/auxiliary data

Unlike the internal data and auxiliary data provided by IgBLAST, the internal data and auxiliary data included in a germline database is generated by igblastr, and is guaranteed to be complete and in sync with the germline V and J alleles in the database.

This data can also be accessed with load_intdata() and load_auxdata():

head(load_intdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))
##   allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1 IGHV1-18*01          1       75         76       99        100      150
## 2 IGHV1-18*02          1       75         76       99        100      150
## 3 IGHV1-18*03          1       75         76       99        100      150
## 4 IGHV1-18*04          1       75         76       99        100      150
## 5  IGHV1-2*01          1       75         76       99        100      150
## 6  IGHV1-2*02          1       75         76       99        100      150
##   cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1        151      174        175      288         VH                  0
## 2        151      174        175      288         VH                  0
## 3        151      174        175      288         VH                  0
## 4        151      174        175      288         VH                  0
## 5        151      174        175      288         VH                  0
## 6        151      174        175      288         VH                  0
head(load_auxdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))
##   allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1    IGHJ1*01                  0         JH       17         1
## 2    IGHJ2*01                  1         JH       18         1
## 3    IGHJ3*01                  1         JH       15         1
## 4    IGHJ3*02                  1         JH       15         1
## 5    IGHJ4*01                  2         JH       13         1
## 6    IGHJ4*02                  2         JH       13         1

See ?load_intdata and ?load_auxdata for more information.

How is this data generated?

For the built-in OGRDB germline databases (the _OGRDB.* databases), the internal data and auxiliary data is provided by AIRR/OGRDB.

For a germline database created with install_IMGT_germline_db() like IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL:

  • internal data: The V gene delineations are inferred from the gaps in the nucleotide sequences of the V alleles provided by IMGT/V-QUEST.

  • auxiliary data: The procedure to generate this data is described in the igblastr Wiki here.

How to use the igblastr-generated internal/auxiliary data with igblastr?

This happens automatically!

More precisely, if a germline database includes its own internal data and auxiliary data, then the igblastn() function will use this data instead of the internal data and auxiliary data provided by IgBLAST.

See documentation of the custom_internal_data and auxiliary_data arguments in ?igblastn for more information.

Select a constant region database (optional)

The igblastn standalone executable included in IgBLAST can also use constant region (C-region) sequences for sequence annotation. As with the germline V, D, and J gene allele sequences, the C-region sequences are generally expected to originate from the same organism(s) as the query sequences, and must likewise be formatted as a BLAST database. We will refer to this database as the C-region database.

The igblastr package includes built-in C-region databases for human, mouse, rabbit, rat, and rhesus monkey, obtained from IMGT/V-QUEST. The available databases can be listed using:

list_c_region_dbs()
##  db_name                                      C
##  _IMGT.human.IGH+IGK+IGL.202605             180
##  _IMGT.human.TRA+TRB+TRG+TRD.202605          23
##  _IMGT.mouse.IGH+IGK+IGL.202605              64
##  _IMGT.mouse.TRA+TRB+TRG+TRD.202605          13
##  _IMGT.rabbit.IGH+IGK+IGL.202605             50
##  _IMGT.rabbit.TRA+TRB+TRG+TRD.202605          6
##  _IMGT.rat.IGH+IGK+IGL.202605                25
##  _IMGT.rhesus_monkey.IGH+IGK+IGL.202605      44
##  _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605  10

The last part of the database name indicates the date of the download in YYYYMM format.

If your query sequences are from human, mouse, or rabbit, you can select the appropriate database using use_c_region_db():

use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605")
##   CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
##   users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
##   license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
##   use of IMGT material, from the private sector, needs a financial arrangement
##   with CNRS.
## 
##   To suppress this message, use:
##     suppressMessages(use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605"))

Calling list_c_region_dbs() again should display an asterisk (*) at the far right of the output, indicating the currently selected C-region database.

See ?list_c_region_dbs for more information.

Use igblastn()

Now that we have selected the germline and C-region databases to use with igblastn(), we are almost ready to call igblastn() to perform the alignment.

Prepare the query sequences

As mentioned earlier, the igblastr package includes a FASTA file containing 8,437 paired heavy and light chain human antibody sequences retrieved from OAS. These will serve as our query sequences, that is, as the set of BCR sequences that we will analyse with igblastn().

To get the path to the query sequences, use:

query <- system.file(package="igblastr", "extdata",
                     "BCR", "1279067_1_Paired_sequences.fasta.gz")

The igblastr package also includes a JSON file containing metadata associated with the query BCR sequences:

json <- system.file(package="igblastr", "extdata",
                    "BCR", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata
## $Run
## [1] 1279067
## 
## $Link
## [1] "https://doi.org/10.1038/s41586-022-05371-z"
## 
## $Author
## [1] "Jaffe et al., 2022"
## 
## $Species
## [1] "human"
## 
## $Age
## [1] 38
## 
## $BSource
## [1] "PBMC"
## 
## $BType
## [1] "Memory-B-Cells"
## 
## $Vaccine
## [1] "None"
## 
## $Disease
## [1] "None"
## 
## $Subject
## [1] "Donor-3"
## 
## $Longitudinal
## [1] "no"
## 
## $`Unique sequences`
## [1] 8437
## 
## $Isotype
## [1] "All"
## 
## $Chain
## [1] "Paired"

The 8,347 paired sequences come from memory B cells isolated from peripheral blood mononuclear cell (PBMC) samples of a single human donor (age 38) with no known disease or vaccination history. The source for these sequences is the Jaffe et al. (2022) study; the DOI link to the publication is provided above.

Call igblastn() on the query sequences

Before calling igblastn(), we first check the selected databases by calling use_germline_db() and use_c_region_db() with no arguments:

use_germline_db()
## [1] "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"
use_c_region_db()
## [1] "_IMGT.human.IGH+IGK+IGL.202605"

Now, let’s call igblastn(). Since we are only interested in the best V alignment for each query sequence, we set num_alignments_V to 1. Analyzing this set of 16,874 BCR sequences may take up to around 3 min on a standard laptop:

AIRR_df <- igblastn(query, num_alignments_V=1)

By default, the result is an AIRR-formatted tibble, that is, a tibble with one row per query sequence and many columns:

AIRR_df
## # A tibble: 16,874 × 111
##    sequence_id    sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
##    <chr>          <chr>    <chr>       <chr> <lgl>      <lgl>       <lgl>       
##  1 heavy_chain_A… ACAACCA… QVQLVQSGSE… IGH   FALSE      TRUE        FALSE       
##  2 light_chain_A… TGAGCGC… QSVLTQPPSV… IGL   FALSE      TRUE        FALSE       
##  3 heavy_chain_A… ACTTTCT… QVQLQESGPG… IGH   FALSE      TRUE        FALSE       
##  4 light_chain_A… GAGGAGT… DIQMTQSPSS… IGK   FALSE      TRUE        FALSE       
##  5 heavy_chain_A… GGGATCA… QVQLVQSGAE… IGH   FALSE      TRUE        FALSE       
##  6 light_chain_A… GCTGGGG… QSALTQPPSA… IGL   FALSE      TRUE        FALSE       
##  7 heavy_chain_A… GGAGCCC… EVQLVESGGG… IGH   FALSE      TRUE        FALSE       
##  8 light_chain_A… GGGGGCT… QSALTQPASV… IGL   FALSE      TRUE        FALSE       
##  9 heavy_chain_A… ATACTTT… QVQLQESGPG… IGH   FALSE      TRUE        FALSE       
## 10 light_chain_A… GAGCTAC… DIVMTQSPDS… IGK   FALSE      TRUE        FALSE       
## # ℹ 16,864 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## #   d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## #   sequence_alignment <chr>, germline_alignment <chr>,
## #   sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## #   v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## #   d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …

The columns are standard “Rearrangement Schema” fields. These fields are defined and documented by the AIRR Community at https://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields

You can call igbrowser() on AIRR_df to visualize the annotated sequences in a browser. For each sequence, the V, D, J, and C segments will be shown as well as the FWR1-4 and CDR1-3 regions.

See ?igblastn and ?igbrowser for more information.

Downstream analysis

In this section, we’re going to show some examples of simple downstream analyses that can be performed on the AIRR-formatted tibble returned by igblastn().

Distribution of percent mutation across BCR sequences

library(ggplot2)

One common analysis of AIRR format data is to examine the distribution of percent mutation across BCR sequences. Here we analyze the percent mutation in the V segments of each chain type (heavy, kappa, and lambda) at the nucleotide level. Note that V percent mutation at the nucleotide level is 100 - v_identity:

AIRR_df |>
    ggplot(aes(locus, 100 - v_identity)) +
    theme_bw(base_size=14) +
    geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
    geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
    ggtitle("Distribution of V percent mutation by locus at the nucleotide level") +
    xlab(NULL)

To do the same thing at the amino acid level, we first use the percent_mutation() function to compute the percent mutation in the V, D, J segments at the amino acid level:

perc_mut_aa <- percent_mutation(AIRR_df, for.aa=TRUE)
head(perc_mut_aa)
## # A tibble: 6 × 5
##   sequence_id                    locus v_perc_mut_aa d_perc_mut_aa j_perc_mut_aa
##   <chr>                          <chr>         <dbl>         <dbl>         <dbl>
## 1 heavy_chain_AAACCTGAGAGTACCG-… IGH           16.3            0            0   
## 2 light_chain_AAACCTGAGAGTACCG-… IGL           16.7           NA           16.7 
## 3 heavy_chain_AAACCTGAGATGGCGT-… IGH            6.06          20            6.25
## 4 light_chain_AAACCTGAGATGGCGT-… IGK            7.37          NA            9.09
## 5 heavy_chain_AAACCTGAGCAGACTG-… IGH           11.3           37.5          6.67
## 6 light_chain_AAACCTGAGCAGACTG-… IGL            4.08          NA            0

See ?percent_mutation for more information.

Then:

perc_mut_aa |>
    ggplot(aes(locus, v_perc_mut_aa)) +
    theme_bw(base_size=14) +
    geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
    geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
    ggtitle("Distribution of V percent mutation by locus at the amino acid level") +
    xlab(NULL)

Distribution of germline genes

library(dplyr)
library(scales)

Another common analysis is to investigate the distribution of germline genes (e.g., V genes). In this case, we typically stratify the analysis by locus or chain type.

plot_gene_dist <- function(AIRR_df, loc) {
    df_v_gene <- AIRR_df |>
        filter(locus == loc) |>
        mutate(v_gene = allele2gene(v_call)) |> # drop allele info
        group_by(v_gene) |>
        summarize(n = n(), .groups = "drop") |>
        mutate(frac = n / sum(n))
    df_v_gene |>
        ggplot(aes(frac, v_gene)) +
        theme_bw(base_size=13) +
        geom_col() +
        scale_x_continuous('Percent of sequences', labels = scales::percent) +
        ylab("Germline gene") +
        ggtitle(paste0(loc, "V gene prevalence"))
}
plot_gene_dist(AIRR_df, "IGH")

plot_gene_dist(AIRR_df, "IGK")

plot_gene_dist(AIRR_df, "IGL")

Lengths and motifs of CDR3 sequences

library(ggseqlogo)

A third category of analysis focuses on CDR3 sequences, including their lengths and motifs, which are often visualized using sequence logo plots.

AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)

AIRR_df |>
    group_by(locus, cdr3_aa_length) |>
    summarize(n = n(), .groups = "drop") |>
    ggplot(aes(cdr3_aa_length, n)) +
    theme_bw(base_size=14) +
    facet_wrap(~locus) +
    geom_col() +
    ggtitle("Histograms of CDR3 length by locus")

AIRR_df |>
    filter(locus == "IGK", cdr3_aa_length == 9) |>
    pull(cdr3_aa) |>
    ggseqlogo(method = "probability") +
    theme_bw(base_size=14) +
    ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Advanced usage

Passing additional arguments to igblastn()

The igblastn standalone executable included in IgBLAST supports many command line arguments. You can quickly list them with igblastn_help() (or with igblastn_help(TRUE) to get an expanded list with more details – see ?igblastn_help):

igblastn_help()
## USAGE
##   igblastn [-h] [-help] [-germline_db_V germline_database_name]
##     [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
##     [-germline_db_D germline_database_name] [-num_alignments_D int_value]
##     [-germline_db_D_seqidlist filename]
##     [-germline_db_J germline_database_name] [-num_alignments_J int_value]
##     [-germline_db_J_seqidlist filename] [-num_alignments_C int_value]
##     [-c_region_db constant_region_database_name]
##     [-custom_internal_data filename] [-d_frame_data filename]
##     [-auxiliary_data filename] [-min_D_match min_D_match]
##     [-V_penalty V_penalty] [-D_penalty D_penalty] [-J_penalty J_penalty]
##     [-num_clonotype num_clonotype] [-clonotype_out clonotype_out]
##     [-allow_vdj_overlap] [-organism germline_origin]
##     [-domain_system domain_system] [-ig_seqtype sequence_type]
##     [-focus_on_V_segment] [-extend_align5end] [-extend_align3end]
##     [-min_V_length Min_V_Length] [-min_J_length Min_J_Length]
##     [-show_translation] [-db database_name] [-dbsize num_letters]
##     [-entrez_query entrez_query] [-query input_file] [-sra accession]
##     [-out output_file] [-evalue evalue] [-word_size int_value]
##     [-gapopen open_penalty] [-gapextend extend_penalty] [-searchsp int_value]
##     [-sum_stats bool_value] [-ungapped] [-lcase_masking] [-query_loc range]
##     [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis]
##     [-num_descriptions int_value] [-num_alignments int_value]
##     [-line_length line_length] [-num_threads int_value] [-remote] [-version]
## 
## DESCRIPTION
##    BLAST for Ig and TCR sequences
## 
## Use '-help' to print detailed descriptions of command line arguments

All these command line arguments can be passed to the igblastn() function using the usual argument_name=argument_value syntax. For example, command line argument -num_threads allows the user to leverage IgBLAST built-in parallel computing capabilities. To use it in igblastn(), set argument num_threads to the desired value e.g. by calling igblastn(query, num_threads=8).

Note that the Examples section in ?igblastn provides more information about using igblastn() in parallel.

Restrict the search to a subset of user-supplied gene alleles

Some arguments of particular interest are the germline_db_[VDJ]_seqidlist arguments. They allow restricting the search of the germline database to a list of gene alleles supplied by the user. This list can be provided as a character vector of gene allele identifiers (e.g. IGHV3-23*01, IGHV3-23*04, etc..), or as the path to a file containing the identifiers (one identifier per line). For example:

V_alleles <- c("IGHV3-23*01", "IGHV3-23*04")
igblastn(query, germline_db_V_seqidlist=V_alleles)

If the gene alleles are stored in a file, say in path/to/my_V_gene_alleles.txt, then use:

igblastn(query, germline_db_V_seqidlist=file("path/to/my_V_gene_alleles.txt"))

Note that in this case, the path to the file containing the gene alleles identifiers must be wrapped in a call to file().

See ?igblastn for more information.

A TCR analysis example

Even though NCBI IgBLAST primary use case is BCR analysis, it can also be used for TCR sequence analysis, and so does the igblastr package.

Prepare the TCR query sequences

File SRR11341217.fasta.gz included in the package contains 10,875 human beta chain TCR transcripts running from 5’ of reverse transcription reaction to beginning of constant region. See https://www.ncbi.nlm.nih.gov/sra/SRX7944361 for more information about this dataset.

filename <- "SRR11341217.fasta.gz"
query <- system.file(package="igblastr", "extdata", "TCR", filename)

Install a TCR germline database

To analyze this dataset with igblastn(), we need to install and select a human TCR germline database. We can use install_IMGT_germline_db() with the tcr.db argument set to TRUE for that. This will install a germline database made of the human TCR germline sequences provided by IMGT/V-QUEST:

db_name <- install_IMGT_germline_db("202614-2", organism="Homo sapiens",
                                    tcr.db=TRUE)

See the new germline database in the list displayed by list_germline_dbs():

list_germline_dbs()
##  db_name                                       V  D   J intdata auxdata  
##  _OGRDB.human.IGH+IGK+IGL.202410             342 31  23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354 33  24    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605             367 31  23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379 33  24    TRUE    TRUE  
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184  9  22    TRUE    TRUE  
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169 11  22    TRUE    TRUE  
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172  9  22    TRUE    TRUE  
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149  9  22    TRUE    TRUE  
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184 10  22    TRUE    TRUE  
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294 72  39    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL      730 48  35    TRUE    TRUE *
##  IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD  355  6 101    TRUE    TRUE

Note that:

  • The name of this new germline database (IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD) reflects the fact that it contains germline gene alleles from the four T-cell receptor loci: TRA (alpha chain), TRB (beta chain), TRG (gamma chain), and TRD (delta chain).

  • This new germline database also includes its own internal data and auxiliary data, as reported in the intdata and auxdata columns.

See ?install_IMGT_germline_db and ?list_germline_dbs for more information.

Let’s select this new germline database as the germline database to use with igblastn():

use_germline_db(db_name)

See ?use_germline_db for more information.

Select a TCR constant region database (optional)

Use list_c_region_dbs() to list the C-region databases that are available:

list_c_region_dbs()
##  db_name                                      C  
##  _IMGT.human.IGH+IGK+IGL.202605             180 *
##  _IMGT.human.TRA+TRB+TRG+TRD.202605          23  
##  _IMGT.mouse.IGH+IGK+IGL.202605              64  
##  _IMGT.mouse.TRA+TRB+TRG+TRD.202605          13  
##  _IMGT.rabbit.IGH+IGK+IGL.202605             50  
##  _IMGT.rabbit.TRA+TRB+TRG+TRD.202605          6  
##  _IMGT.rat.IGH+IGK+IGL.202605                25  
##  _IMGT.rhesus_monkey.IGH+IGK+IGL.202605      44  
##  _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605  10

For this analysis of human TCR sequences, we’ll select _IMGT.human.TRA+TRB+TRG+TRD.202605 as the C-region database to use with igblastn():

use_c_region_db("_IMGT.human.TRA+TRB+TRG+TRD.202605")

See ?use_c_region_db for more information.

Call igblastn() on the TCR query sequences

Check the selected databases:

use_germline_db()
## [1] "IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD"
use_c_region_db()
## [1] "_IMGT.human.TRA+TRB+TRG+TRD.202605"

Call igblastn():

AIRR_df <- igblastn(query)
AIRR_df
## # A tibble: 10,875 × 111
##    sequence_id    sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
##    <chr>          <chr>    <chr>       <chr> <lgl>      <lgl>       <lgl>       
##  1 SRR11341217.1  GTCTCAG… VSDGYSVSRS… TRB   FALSE      TRUE        FALSE       
##  2 SRR11341217.2  AATGGGG… DLKNVFPP*A… TRB   TRUE       TRUE        FALSE       
##  3 SRR11341217.3  GAGTGGA… SGFVIDKFPI… TRB   FALSE      TRUE        FALSE       
##  4 SRR11341217.4  TCTTGAA… LEGFSAQLFP… TRB   FALSE      TRUE        FALSE       
##  5 SRR11341217.5  TTACCGA… DLKNVFPP*A… TRB   TRUE       FALSE       FALSE       
##  6 SRR11341217.6  GGGGTAC… GYSVSREKKE… TRB   FALSE      TRUE        FALSE       
##  7 SRR11341217.7  GCGTGAT… SAEFPKEGPS… TRB   FALSE      TRUE        FALSE       
##  8 SRR11341217.8  TGGAGAT… GDIAEGYSVS… TRB   FALSE      TRUE        FALSE       
##  9 SRR11341217.9  CACCCAC… DLKNVFPP*A… TRB   TRUE       FALSE       FALSE       
## 10 SRR11341217.10 TGATCGG… DRFSAERPKG… TRB   FALSE      TRUE        FALSE       
## # ℹ 10,865 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## #   d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## #   sequence_alignment <chr>, germline_alignment <chr>,
## #   sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## #   v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## #   d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …

Note that, when using igblastn() to analyze TCR sequences, we don’t need to specify the ig_seqtype argument like we would have to if we were using the igblastn standalone executable included in IgBLAST. igblastn() will automatically set ig_seqtype to "TCR" based on the name of the selected germline db. See documentation of the ig_seqtype argument in ?igblastn for more information.

Future developments and session information

Future developments

At the moment, the igblastr package does not provide access to the full functionality of the IgBLAST software. Most notably, the igblastp standalone executable included included in IgBLAST has no counterpart in igblastr.

Some future developments include:

  • Implement igblastp(), a wrapper to the igblastp standalone executable included in IgBLAST for protein-level alignments.

  • Add facilities to retrieve arbitrary germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.

Session information

Here is the output of sessionInfo() on the system where this document was compiled:

sessionInfo()
## R version 4.6.0 (2026-04-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.4 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggseqlogo_0.2.2     scales_1.4.0        dplyr_1.2.1        
##  [4] ggplot2_4.0.3       igblastr_1.2.9      Biostrings_2.80.1  
##  [7] Seqinfo_1.2.0       XVector_0.52.0      IRanges_2.46.0     
## [10] S4Vectors_0.50.1    BiocGenerics_0.58.1 generics_0.1.4     
## [13] tibble_3.3.1        BiocStyle_2.40.0   
## 
## loaded via a namespace (and not attached):
##  [1] utf8_1.2.6          sass_0.4.10         xml2_1.5.2         
##  [4] stringi_1.8.7       digest_0.6.39       magrittr_2.0.5     
##  [7] RColorBrewer_1.1-3  grid_4.6.0          evaluate_1.0.5     
## [10] fastmap_1.2.0       R.oo_1.27.1         jsonlite_2.0.0     
## [13] R.utils_2.13.0      GenomeInfoDb_1.48.0 BiocManager_1.30.27
## [16] httr_1.4.8          rvest_1.0.5         selectr_0.5-1      
## [19] UCSC.utils_1.8.0    jquerylib_0.1.4     cli_3.6.6          
## [22] rlang_1.2.0         crayon_1.5.3        R.methodsS3_1.8.2  
## [25] withr_3.0.2         cachem_1.1.0        yaml_2.3.12        
## [28] otel_0.2.0          tools_4.6.0         curl_7.1.0         
## [31] buildtools_1.0.0    vctrs_0.7.3         R6_2.6.1           
## [34] lifecycle_1.0.5     stringr_1.6.0       pkgconfig_2.0.3    
## [37] gtable_0.3.6        pillar_1.11.1       bslib_0.11.0       
## [40] glue_1.8.1          tidyselect_1.2.1    xfun_0.58          
## [43] sys_3.4.3           knitr_1.51          farver_2.1.2       
## [46] xtable_1.8-8        htmltools_0.5.9     labeling_0.4.3     
## [49] rmarkdown_2.31      maketools_1.3.2     compiler_4.6.0     
## [52] S7_0.2.2