Introduction

The Immunoglobulin Basic Local Alignment Search Tool (IgBLAST) is a specialized bioinformatics tool developed by the National Center for Biotechnology Information (NCBI) for the analysis of B-cell receptor (BCR) and T-cell receptor (TCR) sequences (Ye et al., 2013). IgBLAST performs sequence alignment and annotation, with key outputs including germline V, D, and J gene assignments; characterization of somatic hypermutations introduced during affinity maturation; identification of complementarity-determining regions (CDR1–CDR3), framework regions (FWR1–FWR4), and isotype; and both nucleotide- and amino acid-level alignments. These outputs form the basis for many downstream analyses of BCR and TCR repertoires, including clonotype identification, lineage tracing, and repertoire feature characterization.

igblastr is an R/Bioconductor package that provides functions to conveniently install and use a local IgBLAST installation from within R. In addition to performing standard IgBLAST annotation, it offers additional BCR and TCR sequence annotation capabilities, returning all results in AIRR-style format.

The package is designed to make it as easy as possible to run IgBLAST in R by streamlining the installation of IgBLAST and IgBLAST-compatible germline databases. In particular, these installations can be performed with a single function call, do not require root access, and can persist across R sessions. Furthermore, when installing IgBLAST-compatible germline databases, the package will also automatically generate IgBLAST-compatible annotations for the V and J alleles included in the database. These automatic annotations can then be used as replacements for the internal data and auxiliary data provided by IgBLAST. Note that the latter can be outdated and/or incomplete, or simply missing when installing a germline database for a non-IgBLAST organism (IgBLAST only supports human, mouse, rabbit, rat, and rhesus monkey out of the box).

The main function in the package is igblastn(), a wrapper to the igblastn standalone executable included in IgBLAST. In addition to igblastn(), the package provides several other features, including:

The install_igblast() function to conveniently download and install a pre-compiled IgBLAST from NCBI.
A set of built-in IgBLAST-compatible germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.
Functions to download and install IgBLAST-compatible germline databases from the IMGT/V-QUEST download site, and to configure them for use with igblastn().
The install_custom_germline_db() function to install an IgBLAST-compatible germline database from user-supplied germline gene allele sequences.
Functions to compute, access, and manipulate IgBLAST internal data and auxiliary data.
A set of built-in IgBLAST-compatible constant region (C-region) databases from IMGT/V-QUEST.
Utility functions to parse the results returned by igblastn().
The igbrowser() function to vizualize the annotated sequences in a browser.
The percent_mutation() function to compute the percent mutation in the V, D, J segments of a set of BCR or TCR sequences.
Utility functions to download data from OAS, the Observed Antibody Space database, and prepare it for use with IgBLAST.

Useful links:

IgBLAST is described at https://pubmed.ncbi.nlm.nih.gov/23671333/
IgBLAST web interface: https://www.ncbi.nlm.nih.gov/igblast/
OGRDB: https://ogrdb.airr-community.org/
IMGT/V-QUEST download site: https://www.imgt.org/download/V-QUEST/

Please use https://github.com/HyrienLab/igblastr/issues to report bugs, provide feedback, request features (etc) about igblastr.

Install and load the package

Like any Bioconductor package, igblastr should be installed with BiocManager::install():

if (!require("BiocManager", quietly=TRUE))
    install.packages("BiocManager")
BiocManager::install("igblastr")

BiocManager::install() will take care of installing the package dependencies that are missing.

To load igblastr into your R session, run:

library(igblastr)

Install and check IgBLAST

Install IgBLAST

If IgBLAST is already installed on your system, you can tell igblastr to use it by setting the environment variable IGBLAST_ROOT to the path of your IgBLAST installation. See ?IGBLAST_ROOT for more information.

Otherwise, simply call install_igblast() to install the latest version of IgBLAST. As of March 2025, NCBI provides pre-compiled versions of IgBLAST for Linux, Windows, Intel Mac and Mac Silicon. install_igblast() will automatically download the appropriate pre-compiled version of IgBLAST for your platform from the NCBI FTP site, and install it in a location that will be remembered across R sessions.

if (!has_igblast())
    install_igblast()

## Installing IgBLAST at /github/home/.cache/R/igblastr/igblast_roots/1.22.0 ... ok
## IgBLAST 1.22.0 successfully installed.

See ?install_igblast for more information.

Note that we use has_igblast() to avoid reinstalling IgBLAST if igblastr already has access to a working IgBLAST installation.

Check IgBLAST

To display basic information about the IgBLAST installation used by igblastr, run:

igblast_info()

## igblast_root:        /github/home/.cache/R/igblastr/igblast_roots/1.22.0
## igblast_build:       igblast 1.22.0, build Oct 11 2023 16:06:20
## igblastn_version:    1.22.0
## makeblastdb_version: 2.15.0+
## OS/arch:             Linux/x86_64
## igblast_organisms:   human, mouse, rabbit, rat, rhesus_monkey

Install and select an IgBLAST-compatible germline database

To analyze BCR or TCR sequences, IgBLAST aligns each of them to a reference set of germline V, D, and J gene allele sequences. Therefore, in order to run IgBLAST, we need access to an IgBLAST-compatible germline database against which the sequence to analyze will be aligned. Note that such database actually consists of three separate BLAST databases bundled together: the V-, D-, and J-region BLAST databases.

Built-in germline databases

The igblastr package includes a set of built-in IgBLAST-compatible germline databases for human, mouse, rainbow trout, and rhesus monkey that were obtained from the OGRDB database (AIRR community). To get the list of built-in germline databases, run:

list_germline_dbs(builtin.only=TRUE)

##  db_name                                       V  D  J intdata auxdata
##  _OGRDB.human.IGH+IGK+IGL.202410             342 31 23    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354 33 24    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202605             367 31 23    TRUE    TRUE
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379 33 24    TRUE    TRUE
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184  9 22    TRUE    TRUE
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169 11 22    TRUE    TRUE
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172  9 22    TRUE    TRUE
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149  9 22    TRUE    TRUE
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184 10 22    TRUE    TRUE
##  _OGRDB.rainbow_trout.IGH.202606              81 26 18    TRUE    TRUE
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294 72 39    TRUE    TRUE

The last part of the database name indicates the date of the download in YYYYMM format.

The intdata and auxdata columns indicate whether a database includes its own annotations for the germline V alleles (intdata column) and germline J alleles (auxdata column). More on this in the “Internal data and auxiliary data” section below.

If the BCR sequences to analyze are from one of the OGRDB organisms then we can already select the appropriate database with use_germline_db() and skip the subsection below.

Install additional IgBLAST-compatible germline databases

AIRR/OGRDB and IMGT/V-QUEST are two providers of germline datasets that can be used to create IgBLAST-compatible germline databases. If, for any reason, none of the built-in AIRR/OGRDB germline databases is suitable (e.g., the sequences to analyze are not from human, mouse, rainbow trout, or rhesus monkey), then we can use install_IMGT_germline_db() to create and install additional IgBLAST-compatible germline databases. Below, we show how to create and install an IgBLAST-compatible germline database from germline V, D, and J gene sequences available on IMGT/V-QUEST download site.

Install an IgBLAST-compatible germline database from IMGT/V-QUEST

First, we list the most recent IMGT/V-QUEST releases:

head(list_IMGT_releases())

## [1] "202625-4" "202623-4" "202616-1" "202614-2" "202612-4" "202611-4"

The organisms included in release 202614-2 are:

list_IMGT_organisms("202614-2")

##  [1] "Aotus_nancymaae"          "Bos_taurus"              
##  [3] "Camelus_dromedarius"      "Canis_lupus_familiaris"  
##  [5] "Capra_hircus"             "Chondrichthyes"          
##  [7] "Danio_rerio"              "Equus_caballus"          
##  [9] "Felis_catus"              "Gadus_morhua"            
## [11] "Gallus_gallus"            "Gorilla_gorilla_gorilla" 
## [13] "Heterocephalus_glaber"    "Homo_sapiens"            
## [15] "Ictalurus_punctatus"      "Lemur_catta"             
## [17] "Macaca_fascicularis"      "Macaca_mulatta"          
## [19] "Mus_musculus"             "Mus_musculus_C57BL6J"    
## [21] "Mustela_putorius_furo"    "Neogale_vison"           
## [23] "Nonhuman_primates"        "Oncorhynchus_mykiss"     
## [25] "Ornithorhynchus_anatinus" "Oryctolagus_cuniculus"   
## [27] "Ovis_aries"               "Pan_troglodytes"         
## [29] "Pongo_abelii"             "Pongo_pygmaeus"          
## [31] "Rattus_norvegicus"        "Salmo_salar"             
## [33] "Sus_scrofa"               "Teleostei"               
## [35] "Tursiops_truncatus"       "Vicugna_pacos"

Next, we create and install an IgBLAST-compatible germline database from the human germline V, D, and J gene sequences available in IMGT/V-QUEST release 202614-2:

install_IMGT_germline_db("202614-2", organism="Homo sapiens")

See ?install_IMGT_germline_db for more information.

Install an IgBLAST-compatible germline database from user-supplied gene allele sequences

The igblastr package provides the install_custom_germline_db() function to create and install an IgBLAST-compatible germline database from a set of user-supplied FASTA files containing germline V/D/J gene allele nucleotide sequences.

See ?install_custom_germline_db for more information.

Select the germline database to use with igblastn()

Finally, we select the newly installed database as the germline database to use with igblastn(). For example, assuming that the sequences to analyze are human antibody sequences:

use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL")

##   CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
##   users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
##   license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
##   use of IMGT material, from the private sector, needs a financial arrangement
##   with CNRS.
## 
##   To suppress this message, use:
##     suppressMessages(use_germline_db("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))

See ?use_germline_db for more information.

To see the full list of cached germline databases:

list_germline_dbs()

##  db_name                                       V  D  J intdata auxdata  
##  _OGRDB.human.IGH+IGK+IGL.202410             342 31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354 33 24    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605             367 31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379 33 24    TRUE    TRUE  
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184  9 22    TRUE    TRUE  
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169 11 22    TRUE    TRUE  
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172  9 22    TRUE    TRUE  
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149  9 22    TRUE    TRUE  
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184 10 22    TRUE    TRUE  
##  _OGRDB.rainbow_trout.IGH.202606              81 26 18    TRUE    TRUE  
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294 72 39    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL      730 48 35    TRUE    TRUE *

Note that the asterisk (*) displayed at the far right of the output from list_germline_dbs() indicates the currently selected germline database (you may need to scroll horizontally to see the asterisk).

See ?list_germline_dbs for more information.

Internal data and auxiliary data

Note that the newly installed germline database (IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL) includes its own internal data and auxiliary data, as reported in the intdata and auxdata columns.

What are the internal data and auxiliary data?

The intdata and auxdata columns indicate whether a database includes annotations for the germline V alleles (intdata column) and germline J alleles (auxdata column). These annotations consist of reporting the coding frame start position and FWR/CDR boundaries on the V and J sequences. Note that the FWR/CDR boundaries on the V sequences are also known as the V gene delineations. For the J sequences, the only FWR/CDR boundary is the CDR3/FWR4 boundary.

When analyzing BCR or TCR sequences with IgBLAST, the latter needs access to this information in order to annotate the former. In IgBLAST’s terminology, the annotations for the germline V and J alleles are called internal data and auxiliary data, respectively.

IgBLAST-provided internal/auxiliary data

IgBLAST provides internal data and auxiliary data for 5 organisms: human, mouse, rabbit, rat, and rhesus monkey. We’ll sometimes refer to these organisms as IgBLAST organisms.

This data can be accessed with load_intdata() and load_auxdata():

head(load_intdata("human"))

##   allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1          A1          1       78         79      111        112      162
## 2         A10          1       78         79       96         97      147
## 3         A11          1       78         79       99        100      150
## 4         A14          1       78         79       96         97      147
## 5         A17          1       78         79      111        112      162
## 6        A18b          1       78         79      111        112      162
##   cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1        163      171        172      279         VK                  0
## 2        148      156        157      264         VK                  0
## 3        151      159        160      267         VK                  0
## 4        148      156        157      264         VK                  0
## 5        163      171        172      279         VK                  0
## 6        163      171        172      279         VK                  0

head(load_auxdata("human"))

##   allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1    IGHJ1*01                  0         JH       17         1
## 2   IGHJ1P*01                  2         JH       NA        NA
## 3    IGHJ2*01                  1         JH       18         1
## 4   IGHJ2P*01                  0         JH       NA        NA
## 5    IGHJ3*01                  1         JH       15         1
## 6    IGHJ3*02                  1         JH       15         1

See ?load_intdata and ?load_auxdata for more information.

However, this data doesn’t get updated on a regular basis by the NCBI folks, and can be incomplete or out-of-sync with the germline V and J alleles provided by AIRR/OGRDB or IMGT/V-QUEST.

igblastr-generated internal/auxiliary data

Unlike the internal data and auxiliary data provided by IgBLAST, the internal data and auxiliary data included in a germline database is generated by igblastr, and is guaranteed to be complete and in sync with the germline V and J alleles in the database.

This data can also be accessed with load_intdata() and load_auxdata():

head(load_intdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))

##   allele_name fwr1_start fwr1_end cdr1_start cdr1_end fwr2_start fwr2_end
## 1 IGHV1-18*01          1       75         76       99        100      150
## 2 IGHV1-18*02          1       75         76       99        100      150
## 3 IGHV1-18*03          1       75         76       99        100      150
## 4 IGHV1-18*04          1       75         76       99        100      150
## 5  IGHV1-2*01          1       75         76       99        100      150
## 6  IGHV1-2*02          1       75         76       99        100      150
##   cdr2_start cdr2_end fwr3_start fwr3_end chain_type coding_frame_start
## 1        151      174        175      288         VH                  0
## 2        151      174        175      288         VH                  0
## 3        151      174        175      288         VH                  0
## 4        151      174        175      288         VH                  0
## 5        151      174        175      288         VH                  0
## 6        151      174        175      288         VH                  0

head(load_auxdata("IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"))

##   allele_name coding_frame_start chain_type cdr3_end extra_bps
## 1    IGHJ1*01                  0         JH       17         1
## 2    IGHJ2*01                  1         JH       18         1
## 3    IGHJ3*01                  1         JH       15         1
## 4    IGHJ3*02                  1         JH       15         1
## 5    IGHJ4*01                  2         JH       13         1
## 6    IGHJ4*02                  2         JH       13         1

See ?load_intdata and ?load_auxdata for more information.

How is this data generated?

For the built-in OGRDB germline databases (the _OGRDB.* databases), the internal data and auxiliary data is provided by AIRR/OGRDB.

For a germline database created with install_IMGT_germline_db() like IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL:

internal data: The V gene delineations are inferred from the gaps in the nucleotide sequences of the V alleles provided by IMGT/V-QUEST.
auxiliary data: The procedure to generate this data is described in the igblastr Wiki here.

How to use the igblastr-generated internal/auxiliary data with igblastr?

This happens automatically!

More precisely, if a germline database includes its own internal data and auxiliary data, then the igblastn() function will use this data instead of the internal data and auxiliary data provided by IgBLAST.

See documentation of the custom_internal_data and auxiliary_data arguments in ?igblastn for more information.

Select a constant region database (optional)

The igblastn standalone executable included in IgBLAST can also use constant region (C-region) sequences for sequence annotation. As with the germline V, D, and J gene allele sequences, the C-region sequences are expected to originate from the same organism as the sequences to analyze, and must likewise be formatted as a BLAST database. We will refer to this database as the C-region database.

The igblastr package includes built-in C-region databases for human, mouse, rabbit, rat, and rhesus monkey, obtained from IMGT/V-QUEST. The available databases can be listed using:

list_c_region_dbs()

##  db_name                                      C
##  _IMGT.human.IGH+IGK+IGL.202605             180
##  _IMGT.human.TRA+TRB+TRG+TRD.202605          23
##  _IMGT.mouse.IGH+IGK+IGL.202605              64
##  _IMGT.mouse.TRA+TRB+TRG+TRD.202605          13
##  _IMGT.rabbit.IGH+IGK+IGL.202605             50
##  _IMGT.rabbit.TRA+TRB+TRG+TRD.202605          6
##  _IMGT.rat.IGH+IGK+IGL.202605                25
##  _IMGT.rhesus_monkey.IGH+IGK+IGL.202605      44
##  _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605  10

The last part of the database name indicates the date of the download in YYYYMM format.

If the sequences to analyze are from one of the organisms listed above, we can select the appropriate C-region database using use_c_region_db(). For example, assuming that the sequences to analyze are human antibody sequences:

use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605")

##   CONDITIONS OF USE AND LICENSE: The IMGT data is provided to the academic
##   users and NPO's (Not for Profit Organization(s)) under the CC BY-NC-ND 4.0
##   license. See https://creativecommons.org/licenses/by-nc-nd/4.0/. Any other
##   use of IMGT material, from the private sector, needs a financial arrangement
##   with CNRS.
## 
##   To suppress this message, use:
##     suppressMessages(use_c_region_db("_IMGT.human.IGH+IGK+IGL.202605"))

Calling list_c_region_dbs() again should display an asterisk (*) at the far right of the output, indicating the currently selected C-region database.

See ?list_c_region_dbs for more information.

Use igblastn()

Now that we have selected the germline and C-region databases to use with igblastn(), we are almost ready to call igblastn() to perform the alignment.

Prepare the query sequences

In the context of sequence analysis with IgBLAST, we refer to the set of sequences to analyze as the query sequences.

The igblastr package includes a FASTA file containing 8,437 paired heavy- and light-chain human antibody sequences (16,874 individual sequences) retrieved from OAS. This BCR sequencing dataset will serve as the query sequences that we analyze in this section.

To get the path to the query sequences, use:

query <- system.file(package="igblastr", "extdata",
                     "BCR", "1279067_1_Paired_sequences.fasta.gz")

The igblastr package also includes a JSON file containing metadata describing this BCR sequencing dataset:

json <- system.file(package="igblastr", "extdata",
                    "BCR", "1279067_1_Paired_All.json")
query_metadata <- jsonlite::fromJSON(json)
query_metadata

## $Run
## [1] 1279067
## 
## $Link
## [1] "https://doi.org/10.1038/s41586-022-05371-z"
## 
## $Author
## [1] "Jaffe et al., 2022"
## 
## $Species
## [1] "human"
## 
## $Age
## [1] 38
## 
## $BSource
## [1] "PBMC"
## 
## $BType
## [1] "Memory-B-Cells"
## 
## $Vaccine
## [1] "None"
## 
## $Disease
## [1] "None"
## 
## $Subject
## [1] "Donor-3"
## 
## $Longitudinal
## [1] "no"
## 
## $`Unique sequences`
## [1] 8437
## 
## $Isotype
## [1] "All"
## 
## $Chain
## [1] "Paired"

The 8,437 paired sequences come from memory B cells isolated from peripheral blood mononuclear cell (PBMC) samples of a single human donor (age 38) with no known disease or vaccination history. The source for these sequences is the Jaffe et al. (2022) study; the DOI link to the publication is provided above.

Call igblastn() on the query sequences

To analyze our query sequences, we’re going to use the igblastn() function. This is a wrapper to the igblastn standalone executable included in IgBLAST.

Before calling igblastn(), we first check the selected databases by calling use_germline_db() and use_c_region_db() with no arguments:

use_germline_db()

## [1] "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"

use_c_region_db()

## [1] "_IMGT.human.IGH+IGK+IGL.202605"

Now, let’s call igblastn(). Since we are only interested in the best V alignment for each query sequence, we set num_alignments_V to 1. Analyzing this set of 16,874 BCR sequences may take up to around 3 min on a standard laptop:

AIRR_df <- igblastn(query, num_alignments_V=1)

By default, the result is an AIRR-formatted tibble, that is, a tibble with one row per query sequence and many columns:

AIRR_df

## # A tibble: 16,874 × 111
##    sequence_id    sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
##    <chr>          <chr>    <chr>       <chr> <lgl>      <lgl>       <lgl>       
##  1 heavy_chain_A… ACAACCA… QVQLVQSGSE… IGH   FALSE      TRUE        FALSE       
##  2 light_chain_A… TGAGCGC… QSVLTQPPSV… IGL   FALSE      TRUE        FALSE       
##  3 heavy_chain_A… ACTTTCT… QVQLQESGPG… IGH   FALSE      TRUE        FALSE       
##  4 light_chain_A… GAGGAGT… DIQMTQSPSS… IGK   FALSE      TRUE        FALSE       
##  5 heavy_chain_A… GGGATCA… QVQLVQSGAE… IGH   FALSE      TRUE        FALSE       
##  6 light_chain_A… GCTGGGG… QSALTQPPSA… IGL   FALSE      TRUE        FALSE       
##  7 heavy_chain_A… GGAGCCC… EVQLVESGGG… IGH   FALSE      TRUE        FALSE       
##  8 light_chain_A… GGGGGCT… QSALTQPASV… IGL   FALSE      TRUE        FALSE       
##  9 heavy_chain_A… ATACTTT… QVQLQESGPG… IGH   FALSE      TRUE        FALSE       
## 10 light_chain_A… GAGCTAC… DIVMTQSPDS… IGK   FALSE      TRUE        FALSE       
## # ℹ 16,864 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## #   d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## #   sequence_alignment <chr>, germline_alignment <chr>,
## #   sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## #   v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## #   d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …

The columns are standard “Rearrangement Schema” fields. These fields are defined and documented by the AIRR Community at https://docs.airr-community.org/en/latest/datarep/rearrangements.html#fields

You can call igbrowser() on AIRR_df to visualize the annotated sequences in a browser. For each sequence, the V, D, J, and C segments will be shown as well as the FWR1-4 and CDR1-3 regions.

See ?igblastn and ?igbrowser for more information.

Downstream analysis

In this section, we’re going to show some examples of simple downstream analyses that can be performed on the AIRR-formatted tibble returned by igblastn().

Distribution of percent mutation across BCR sequences

library(ggplot2)

One common analysis of AIRR format data is to examine the distribution of percent mutation across BCR sequences. Here we analyze the percent mutation in the V segments of each chain type (heavy, kappa, and lambda) at the nucleotide level. Note that V percent mutation at the nucleotide level is 100 - v_identity:

AIRR_df |>
    ggplot(aes(locus, 100 - v_identity)) +
    theme_bw(base_size=14) +
    geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
    geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
    ggtitle("Distribution of V percent mutation by locus at the nucleotide level") +
    xlab(NULL)

To do the same thing at the amino acid level, we first use the percent_mutation() function to compute the percent mutation in the V, D, J segments at the amino acid level:

perc_mut_aa <- percent_mutation(AIRR_df, for.aa=TRUE)
head(perc_mut_aa)

## # A tibble: 6 × 5
##   sequence_id                    locus v_perc_mut_aa d_perc_mut_aa j_perc_mut_aa
##   <chr>                          <chr>         <dbl>         <dbl>         <dbl>
## 1 heavy_chain_AAACCTGAGAGTACCG-… IGH           16.3            0            0   
## 2 light_chain_AAACCTGAGAGTACCG-… IGL           16.7           NA           16.7 
## 3 heavy_chain_AAACCTGAGATGGCGT-… IGH            6.06          20            6.25
## 4 light_chain_AAACCTGAGATGGCGT-… IGK            7.37          NA            9.09
## 5 heavy_chain_AAACCTGAGCAGACTG-… IGH           11.3           37.5          6.67
## 6 light_chain_AAACCTGAGCAGACTG-… IGL            4.08          NA            0

See ?percent_mutation for more information.

Then:

perc_mut_aa |>
    ggplot(aes(locus, v_perc_mut_aa)) +
    theme_bw(base_size=14) +
    geom_point(position = position_jitter(width = 0.3), alpha = 0.1) +
    geom_boxplot(color = "blue", fill = NA, outliers = FALSE, alpha = 0.3) +
    ggtitle("Distribution of V percent mutation by locus at the amino acid level") +
    xlab(NULL)

Distribution of germline genes

library(dplyr)
library(scales)

Another common analysis is to investigate the distribution of germline genes (e.g., V genes). In this case, we typically stratify the analysis by locus or chain type.

plot_gene_dist <- function(AIRR_df, loc) {
    df_v_gene <- AIRR_df |>
        filter(locus == loc) |>
        mutate(v_gene = allele2gene(v_call)) |>  # drop allele info
        group_by(v_gene) |>
        summarize(n = n(), .groups = "drop") |>
        mutate(frac = n / sum(n))
    df_v_gene |>
        ggplot(aes(frac, v_gene)) +
        theme_bw(base_size=13) +
        geom_col() +
        scale_x_continuous('Percent of sequences', labels = scales::percent) +
        ylab("Germline gene") +
        ggtitle(paste0(loc, "V gene prevalence"))
}

plot_gene_dist(AIRR_df, "IGH")

plot_gene_dist(AIRR_df, "IGK")

plot_gene_dist(AIRR_df, "IGL")

Lengths and motifs of CDR3 sequences

library(ggseqlogo)

A third category of analysis focuses on CDR3 sequences, including their lengths and motifs, which are often visualized using sequence logo plots.

AIRR_df$cdr3_aa_length <- nchar(AIRR_df$cdr3_aa)

AIRR_df |>
    group_by(locus, cdr3_aa_length) |>
    summarize(n = n(), .groups = "drop") |>
    ggplot(aes(cdr3_aa_length, n)) +
    theme_bw(base_size=14) +
    facet_wrap(~locus) +
    geom_col() +
    ggtitle("Histograms of CDR3 length by locus")

AIRR_df |>
    filter(locus == "IGK", cdr3_aa_length == 9) |>
    pull(cdr3_aa) |>
    ggseqlogo(method = "probability") +
    theme_bw(base_size=14) +
    ggtitle("Logo plot of kappa chain CDR3 sequences that are 9 AA long")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Advanced usage

Passing additional arguments to igblastn()

The igblastn standalone executable included in IgBLAST supports many command line arguments. You can quickly list them with igblastn_help() (or with igblastn_help(TRUE) to get an expanded list with more details – see ?igblastn_help):

igblastn_help()

## USAGE
##   igblastn [-h] [-help] [-germline_db_V germline_database_name]
##     [-num_alignments_V int_value] [-germline_db_V_seqidlist filename]
##     [-germline_db_D germline_database_name] [-num_alignments_D int_value]
##     [-germline_db_D_seqidlist filename]
##     [-germline_db_J germline_database_name] [-num_alignments_J int_value]
##     [-germline_db_J_seqidlist filename] [-num_alignments_C int_value]
##     [-c_region_db constant_region_database_name]
##     [-custom_internal_data filename] [-d_frame_data filename]
##     [-auxiliary_data filename] [-min_D_match min_D_match]
##     [-V_penalty V_penalty] [-D_penalty D_penalty] [-J_penalty J_penalty]
##     [-num_clonotype num_clonotype] [-clonotype_out clonotype_out]
##     [-allow_vdj_overlap] [-organism germline_origin]
##     [-domain_system domain_system] [-ig_seqtype sequence_type]
##     [-focus_on_V_segment] [-extend_align5end] [-extend_align3end]
##     [-min_V_length Min_V_Length] [-min_J_length Min_J_Length]
##     [-show_translation] [-db database_name] [-dbsize num_letters]
##     [-entrez_query entrez_query] [-query input_file] [-sra accession]
##     [-out output_file] [-evalue evalue] [-word_size int_value]
##     [-gapopen open_penalty] [-gapextend extend_penalty] [-searchsp int_value]
##     [-sum_stats bool_value] [-ungapped] [-lcase_masking] [-query_loc range]
##     [-strand strand] [-parse_deflines] [-outfmt format] [-show_gis]
##     [-num_descriptions int_value] [-num_alignments int_value]
##     [-line_length line_length] [-num_threads int_value] [-remote] [-version]
## 
## DESCRIPTION
##    BLAST for Ig and TCR sequences
## 
## Use '-help' to print detailed descriptions of command line arguments

All these command line arguments can be passed to the igblastn() function using the usual argument_name=argument_value syntax. For example, command line argument -num_threads allows the user to leverage IgBLAST built-in parallel computing capabilities. To use it in igblastn(), set argument num_threads to the desired value e.g. by calling igblastn(query, num_threads=8).

Note that the Examples section in ?igblastn provides more information about using igblastn() in parallel.

Restrict the search to a subset of user-supplied gene alleles

Some arguments of particular interest are the germline_db_[VDJ]_seqidlist arguments. They allow restricting the search of the germline database to a list of gene alleles supplied by the user. This list can be provided as a character vector of gene allele identifiers (e.g., IGHV3-23*01, IGHV3-23*04, etc..), or as the path to a file containing the identifiers (one identifier per line). For example:

V_alleles <- c("IGHV3-23*01", "IGHV3-23*04")
igblastn(query, germline_db_V_seqidlist=V_alleles)

If the gene alleles are stored in a file, say in path/to/my_V_gene_alleles.txt, then use:

igblastn(query, germline_db_V_seqidlist=file("path/to/my_V_gene_alleles.txt"))

Note that in this case, the path to the file containing the gene alleles identifiers must be wrapped in a call to file().

See ?igblastn for more information.

Combine germline databases from two different organisms

The combine_germline_dbs() function can be used to combine two existing germline databases, typically (but not necessarily) from two different organisms. This can be useful to analyze BCR sequences from a hybrid organism like a humanized mouse.

To illustrate this, we’re going to analyze BCR sequences from KyDab, the Kymouse Antibody Database: https://kydab.naturalantibody.com/. In Kymab’s Intelliselect Transgenic mouse (Kymouse), a complete set of human variable (V), diversity (D), and junction (J) genes of the IGH locus as well as the V and J genes of the Igλ and Igκ loci were inserted at the sites of the endogenous mouse loci.

Prepare the query sequences

We’re going to analyze the Kymouse heavy-chain BCR sequences from the Influenza #1 Project from KyDab. The 3,950 heavy-chain BCR sequences from this project are stored in the Influenza-KV7015_heavy.fasta file included in igblastr:

fasta_file <- "Influenza-KV7015_heavy.fasta"
query <- system.file(package="igblastr", "extdata",
                     "BCR", "KyDab", fasta_file)

Create the hybrid germline database

For this analysis, we create a hybrid germline database by combining germline databases from human and mouse. Note that allele names are not unique across species, so we add suffixes to the human and mouse allele names to avoid name clashes (e.g., IGHV2-5*01_Hs vs. IGHV2-5*01_Mm):

db_name1 <- "IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL"
db_name2 <- install_IMGT_germline_db("202614-2", "Mus_musculus")

## Germline db IMGT-202614-2.Mus_musculus.IGH+IGK+IGL successfully installed in
## igblastr's persistent cache.

## Call use_germline_db("IMGT-202614-2.Mus_musculus.IGH+IGK+IGL") to select it as
## the germline db to use with igblastn(). Use list_germline_dbs() to list all the
## germline dbs currently installed in the cache (see '?list_germline_dbs').

db_name <- "comb.IMGT-202614-2.human+mouse.IGH+IGK+IGL"
combine_germline_dbs(db_name, db_name1, db_name2,
                     suffix1="_Hs", suffix2="_Mm")

Select the new hybrid germline database as the database to use with igblastn():

use_germline_db(db_name)

list_germline_dbs()

##  db_name                                       V   D  J intdata auxdata  
##  _OGRDB.human.IGH+IGK+IGL.202410             342  31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354  33 24    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605             367  31 23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379  33 24    TRUE    TRUE  
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184   9 22    TRUE    TRUE  
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169  11 22    TRUE    TRUE  
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172   9 22    TRUE    TRUE  
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149   9 22    TRUE    TRUE  
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184  10 22    TRUE    TRUE  
##  _OGRDB.rainbow_trout.IGH.202606              81  26 18    TRUE    TRUE  
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294  72 39    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL      730  48 35    TRUE    TRUE  
##  IMGT-202614-2.Mus_musculus.IGH+IGK+IGL      865  61 27    TRUE    TRUE  
##  comb.IMGT-202614-2.human+mouse.IGH+IGK+IGL 1595 109 62    TRUE    TRUE *

Note that the Kymouse BCR sequences from KyDab project Influenza #1 are not long enough to reach into the C regions so we’re not using any C-region database for this analysis.

Call igblastn() on the query sequences

AIRR_df <- igblastn(query, num_alignments_V=1,
                           num_alignments_D=1,
                           num_alignments_J=1)

A quick look at columns `v_call`, `d_call`, and `j_call`

In the context of this vignette, we’re only going to verify that the Kymouse heavy-chain sequences align to human V, D, and J alleles:

# extract_suffix() assumes suffixes of length 3, e.g. "_Hs"/"_Mm":
extract_suffix <- function(x) {nc <- nchar(x); factor(substr(x, nc-2, nc), levels=c("_Hs", "_Mm"))}

table(extract_suffix(AIRR_df$v_call))

## 
##  _Hs  _Mm 
## 3950    0

table(extract_suffix(AIRR_df$d_call))

## 
##  _Hs  _Mm 
## 3638  206

table(extract_suffix(AIRR_df$j_call))

## 
##  _Hs  _Mm 
## 3950    0

These counts are plotted in Figure 1 below.

Number of human vs. mouse V/D/J allele assignments<br>for the 3,950 Kymouse heavy-chain sequences<br>in Influenza-KV7015_heavy.fasta

Number of human vs. mouse V/D/J allele assignments
for the 3,950 Kymouse heavy-chain sequences
in Influenza-KV7015_heavy.fasta

As expected, all Kymouse heavy-chain sequences align to human V and J alleles. Most also align to human D alleles, with only a small fraction (less than 6%) aligning to mouse D alleles instead.

A TCR analysis example

Even though NCBI IgBLAST primary use case is BCR analysis, it can also be used for TCR sequence analysis, and so does the igblastr package.

Prepare the TCR query sequences

File SRR11341217.fasta.gz included in the package contains 10,875 human beta chain TCR transcripts running from 5’ of reverse transcription reaction to beginning of constant region. See https://www.ncbi.nlm.nih.gov/sra/SRX7944361 for more information about this dataset.

filename <- "SRR11341217.fasta.gz"
query <- system.file(package="igblastr", "extdata", "TCR", filename)

Install a TCR germline database

To analyze this dataset with igblastn(), we need to install and select a human TCR germline database. We can use install_IMGT_germline_db() with the tcr.db argument set to TRUE for that. This will install a germline database made of the human TCR germline sequences provided by IMGT/V-QUEST:

db_name <- install_IMGT_germline_db("202614-2", organism="Homo sapiens",
                                    tcr.db=TRUE)

See the new germline database in the list displayed by list_germline_dbs():

list_germline_dbs()

##  db_name                                       V   D   J intdata auxdata  
##  _OGRDB.human.IGH+IGK+IGL.202410             342  31  23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202410.src         354  33  24    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605             367  31  23    TRUE    TRUE  
##  _OGRDB.human.IGH+IGK+IGL.202605.src         379  33  24    TRUE    TRUE  
##  _OGRDB.mouse.CAST_EiJ.IGH+IGK+IGL.202603    184   9  22    TRUE    TRUE  
##  _OGRDB.mouse.LEWES_EiJ.IGH+IGK+IGL.202603   169  11  22    TRUE    TRUE  
##  _OGRDB.mouse.MSM_MsJ.IGH+IGK+IGL.202603     172   9  22    TRUE    TRUE  
##  _OGRDB.mouse.NOD_ShiLtJ.IGH+IGK+IGL.202205  149   9  22    TRUE    TRUE  
##  _OGRDB.mouse.PWD_PhJ.IGH+IGK+IGL.202410     184  10  22    TRUE    TRUE  
##  _OGRDB.rainbow_trout.IGH.202606              81  26  18    TRUE    TRUE  
##  _OGRDB.rhesus_monkey.IGH+IGK+IGL.202602    2294  72  39    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.IGH+IGK+IGL      730  48  35    TRUE    TRUE  
##  IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD  355   6 101    TRUE    TRUE  
##  IMGT-202614-2.Mus_musculus.IGH+IGK+IGL      865  61  27    TRUE    TRUE  
##  comb.IMGT-202614-2.human+mouse.IGH+IGK+IGL 1595 109  62    TRUE    TRUE *

Note that:

The name of this new germline database (IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD) reflects the fact that it contains germline gene alleles from the four T-cell receptor loci: TRA (alpha chain), TRB (beta chain), TRG (gamma chain), and TRD (delta chain).
This new germline database also includes its own internal data and auxiliary data, as reported in the intdata and auxdata columns.

See ?install_IMGT_germline_db and ?list_germline_dbs for more information.

Let’s select this new database as the germline database to use with igblastn():

use_germline_db(db_name)

See ?use_germline_db for more information.

Select a TCR constant region database (optional)

Use list_c_region_dbs() to list the C-region databases that are available:

list_c_region_dbs()

##  db_name                                      C  
##  _IMGT.human.IGH+IGK+IGL.202605             180 *
##  _IMGT.human.TRA+TRB+TRG+TRD.202605          23  
##  _IMGT.mouse.IGH+IGK+IGL.202605              64  
##  _IMGT.mouse.TRA+TRB+TRG+TRD.202605          13  
##  _IMGT.rabbit.IGH+IGK+IGL.202605             50  
##  _IMGT.rabbit.TRA+TRB+TRG+TRD.202605          6  
##  _IMGT.rat.IGH+IGK+IGL.202605                25  
##  _IMGT.rhesus_monkey.IGH+IGK+IGL.202605      44  
##  _IMGT.rhesus_monkey.TRA+TRB+TRG+TRD.202605  10

For this analysis of human TCR sequences, we’ll select _IMGT.human.TRA+TRB+TRG+TRD.202605 as the C-region database to use with igblastn():

use_c_region_db("_IMGT.human.TRA+TRB+TRG+TRD.202605")

See ?use_c_region_db for more information.

Call igblastn() on the TCR query sequences

Check the selected databases:

use_germline_db()

## [1] "IMGT-202614-2.Homo_sapiens.TRA+TRB+TRG+TRD"

use_c_region_db()

## [1] "_IMGT.human.TRA+TRB+TRG+TRD.202605"

Call igblastn():

AIRR_df <- igblastn(query)
AIRR_df

## # A tibble: 10,875 × 111
##    sequence_id    sequence sequence_aa locus stop_codon vj_in_frame v_frameshift
##    <chr>          <chr>    <chr>       <chr> <lgl>      <lgl>       <lgl>       
##  1 SRR11341217.1  GTCTCAG… VSDGYSVSRS… TRB   FALSE      TRUE        FALSE       
##  2 SRR11341217.2  AATGGGG… DLKNVFPP*A… TRB   TRUE       TRUE        FALSE       
##  3 SRR11341217.3  GAGTGGA… SGFVIDKFPI… TRB   FALSE      TRUE        FALSE       
##  4 SRR11341217.4  TCTTGAA… LEGFSAQLFP… TRB   FALSE      TRUE        FALSE       
##  5 SRR11341217.5  TTACCGA… DLKNVFPP*A… TRB   TRUE       FALSE       FALSE       
##  6 SRR11341217.6  GGGGTAC… GYSVSREKKE… TRB   FALSE      TRUE        FALSE       
##  7 SRR11341217.7  GCGTGAT… SAEFPKEGPS… TRB   FALSE      TRUE        FALSE       
##  8 SRR11341217.8  TGGAGAT… GDIAEGYSVS… TRB   FALSE      TRUE        FALSE       
##  9 SRR11341217.9  CACCCAC… DLKNVFPP*A… TRB   TRUE       FALSE       FALSE       
## 10 SRR11341217.10 TGATCGG… DRFSAERPKG… TRB   FALSE      TRUE        FALSE       
## # ℹ 10,865 more rows
## # ℹ 104 more variables: productive <lgl>, rev_comp <lgl>, complete_vdj <lgl>,
## #   d_frame <lgl>, v_call <chr>, d_call <chr>, j_call <chr>, c_call <chr>,
## #   sequence_alignment <chr>, germline_alignment <chr>,
## #   sequence_alignment_aa <chr>, germline_alignment_aa <chr>,
## #   v_alignment_start <int>, v_alignment_end <int>, d_alignment_start <int>,
## #   d_alignment_end <int>, j_alignment_start <int>, j_alignment_end <int>, …

Note that, when using igblastn() to analyze TCR sequences, we don’t need to specify the ig_seqtype argument like we would have to if we were using the igblastn standalone executable included in IgBLAST. igblastn() will automatically set ig_seqtype to "TCR" based on the name of the selected germline database. See documentation of the ig_seqtype argument in ?igblastn for more information.

Future developments and session information

Future developments

At the moment, the igblastr package does not provide access to the full functionality of the IgBLAST software. Most notably, the igblastp standalone executable included in IgBLAST has no counterpart in igblastr.

Some future developments include:

Implement igblastp(), a wrapper to the igblastp standalone executable included in IgBLAST for protein-level alignments.
Add facilities to retrieve arbitrary germline databases from OGRDB, the AIRR Community’s Open Germline Receptor Database.

Session information

Here are the output of igblast_info() and sessionInfo() on the system where this document was compiled.

igblast_info():

igblast_info()

## igblast_root:        /github/home/.cache/R/igblastr/igblast_roots/1.22.0
## igblast_build:       igblast 1.22.0, build Oct 11 2023 16:06:20
## igblastn_version:    1.22.0
## makeblastdb_version: 2.15.0+
## OS/arch:             Linux/x86_64
## igblast_organisms:   human, mouse, rabbit, rat, rhesus_monkey

sessionInfo():

sessionInfo()

## R version 4.6.1 (2026-06-24)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 26.04 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.32.so;  LAPACK version 3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: Etc/UTC
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] ggseqlogo_0.2.2     scales_1.4.0        dplyr_1.2.1        
##  [4] ggplot2_4.0.3       igblastr_1.2.16     Biostrings_2.80.1  
##  [7] Seqinfo_1.2.0       XVector_0.52.0      IRanges_2.46.0     
## [10] S4Vectors_0.50.1    BiocGenerics_0.58.1 generics_0.1.4     
## [13] tibble_3.3.1        BiocStyle_2.40.0   
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.10         utf8_1.2.6          xml2_1.6.0         
##  [4] digest_0.6.39       magrittr_2.0.5      RColorBrewer_1.1-3 
##  [7] grid_4.6.1          evaluate_1.0.5      fastmap_1.2.0      
## [10] R.oo_1.27.1         jsonlite_2.0.0      R.utils_2.13.0     
## [13] GenomeInfoDb_1.48.0 BiocManager_1.30.27 httr_1.4.8         
## [16] rvest_1.0.5         selectr_0.6-0       UCSC.utils_1.8.0   
## [19] jquerylib_0.1.4     cli_3.6.6           rlang_1.3.0        
## [22] crayon_1.5.3        R.methodsS3_1.8.2   withr_3.0.3        
## [25] cachem_1.1.0        yaml_2.3.12         otel_0.2.0         
## [28] tools_4.6.1         curl_7.1.0          buildtools_1.0.0   
## [31] vctrs_0.7.3         R6_2.6.1            lifecycle_1.0.5    
## [34] pkgconfig_2.0.3     gtable_0.3.6        pillar_1.11.1      
## [37] bslib_0.11.0        glue_1.8.1          tidyselect_1.2.1   
## [40] xfun_0.60           sys_3.4.3           knitr_1.51         
## [43] farver_2.1.2        xtable_1.8-8        htmltools_0.5.9    
## [46] labeling_0.4.3      rmarkdown_2.31      maketools_1.3.2    
## [49] compiler_4.6.1      S7_0.2.2

igblastr overview

Introduction

Install and load the package

Install and check IgBLAST

Install IgBLAST

Check IgBLAST

Install and select an IgBLAST-compatible germline database

Built-in germline databases

Install additional IgBLAST-compatible germline databases

Install an IgBLAST-compatible germline database from IMGT/V-QUEST

Install an IgBLAST-compatible germline database from user-supplied gene allele sequences

Select the germline database to use with igblastn()

Internal data and auxiliary data

What are the internal data and auxiliary data?

IgBLAST-provided internal/auxiliary data

igblastr-generated internal/auxiliary data

How is this data generated?

How to use the igblastr-generated internal/auxiliary data with igblastr?

Select a constant region database (optional)

Use igblastn()

Prepare the query sequences

Call igblastn() on the query sequences

Downstream analysis

Distribution of percent mutation across BCR sequences

Distribution of germline genes

Lengths and motifs of CDR3 sequences

Advanced usage

Passing additional arguments to igblastn()

Restrict the search to a subset of user-supplied gene alleles

Combine germline databases from two different organisms

Prepare the query sequences

Create the hybrid germline database

Call igblastn() on the query sequences

A quick look at columns v_call, d_call, and j_call

A TCR analysis example

Prepare the TCR query sequences

Install a TCR germline database

Select a TCR constant region database (optional)

Call igblastn() on the TCR query sequences

Future developments and session information

Future developments

Session information

A quick look at columns `v_call`, `d_call`, and `j_call`