The Basic Local Alignment Search Tool (BLAST) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. BLAST is one of the most widely used tools in bioinformatics; it can be applied to different problems or projects in a myriad ways.
- How BLAST works
- The command line version of BLAST
- Types of BLAST search
- E-value and Bit-score
- Creating a BLAST database using
- Creating local BLAST database from Swiss-Prot
- Searching against a BLAST nucleotide database using
- Video demonstration
- See also
There are two main steps in BLAST:
- A list of “words” (sets of characters/residues) of length k is created for the query sequence. By default, k = 3 for amino acid sequences, and k = 11 for nucleotide sequences.
- An alignment is made for database (subject) sequences that share many words with the query sequence. This is a local alignment in which only High-scoring Segment Pairs (HSPs) are reported. In other words, BLAST finds islands of similarity between sequences.
BLAST can be used online, or through the command line. Most biologists are familiar with NCBI’s web application for BLAST. If you use this web application regularly, the command line BLAST program is worth your consideration. The command line version of BLAST has several advantages over the web version:
- BLAST on the command line can be used to run local searches, i.e. searches which use files that are on your computer, instead of files that are on an NCBI database.
- BLAST searches on the command line can be made more specific by adding additional arguments.
- BLAST searches carried out on the command line can be automated, and incorporated into larger scripts.
- The command line BLAST program can output search results in various structured text formats.
The command line version of BLAST can be downloaded via conda using the following command:
$ conda install -c bioconda blast
This program is included in the bioinfo-notebook conda environment.
There are five main types of BLAST search:
- BLASTp searches a protein database with a protein query sequence.
- BLASTn searches a nucleic acid database with nucleic acid query sequence.
- BLASTx searches a protein database with nucleic acid query sequence, which is translated into an amino acid sequence.
- tBLASTx searches a nucleic acid database with nucleic acid query sequence. In this case, both the database (subject) sequences and query sequence are translated into amino acid sequences.
- tBLASTn searches a nucleic acid database with protein query sequence. In this case, the nucleic acid database is translated into a set of amino acid sequences.
While the type of query and subject sequences required for each of these BLAST searches differs, the command line arguments that can be used for these BLAST searches are interchangeable.
Two important variables when interpreting BLAST results are E-value and bit-score. These are both derived from the raw alignment score (S), which is based on the number of residues (i.e. individual amino/nucleic acids) that two sequences have in common. The more identical residues that two sequences have at the same position in an alignment, the higher the alignment score.
- Bit-score (S’) is the raw alignment score (S) normalised with respect to the scoring system used for the alignment.
- E-value or Expectation value is the number of different alignments with scores equivalent to or better than S that is expected to occur in a database search by chance. The lower the E value, the more significant the score and the alignment. An exact match between query and subject sequences results in an E-value of zero.
While bit-scores are comparable between searches, as they are normalised, they do not take the size of the database into account. E-values, however, do account for the size of the database. The lower the E-value and the higher the bit-score, the better the BLAST result.
To search against a set of nucleotide or amino acid sequences using BLAST, a database must be created. This can be done using the
$ makeblastdb -dbtype prot/nucl -in input_file -out database_name
In this command…
-dbtypespecifies the type of sequences used to create the database. For amino acid (protein) sequences,
protis used (“
-dbtype prot”). For nucleic acid sequences,
nuclis used (“
-inis used to specify the input file. The database created can be used to search against the sequences in this file.
-outis used to name the database that will be created from the input file.
In this video, the FASTA amino acid sequences of Swiss-Prot are downloaded, and a BLAST protein database is created from these sequences using
makeblastdb. UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. As it is well-annotated and curated, the Swiss-Prot database gives informative results when searched locally using
blastx. The link used in the
wget command is copied and pasted from the UniProt downloads page. The compressed FASTA sequences of the Swiss-Prot database on
These FASTA amino acid sequences are compressed into a
.gz (gzip) file. Before using the
makeblastdb command, this FASTA file is uncompressed using
.fasta. Once the FASTA file is downloaded and uncompressed,
makeblastdb is used to create a BLAST protein database of the amino acid sequences in this FASTA file. This BLAST protein database is named
swissprot, and consists of three binary files.
Once the BLAST protein database is created,
blastx can be used to search sequences against it. This database can be selected using the argument
blastx (the path to the
swissprot database will need to be given if the command is run from a different directory).
blastn is used for searching nucleotide databases with a nucleotide query.
$ blastn -query query_file.fna -db nucl_database_name -out results_file.tsv -outfmt 6 -evalue x -max_hsps y -num_threads n
In this command…
-queryis used to select the FASTA nucleic acids file you want to search against the BLAST database (the
-dbis used to select the BLAST nucleotide database you want to search against (
-outis used to direct the results to an output file (
-outfmtis used to specify how this results file should be formatted. In this case, as
6, the results will be written to a file as tab-separated values: this is why
-evalueis used to set an E-value threshold (
x). Results which have an E-value greater than this threshold will not be written to the results file.
-max_hspsis used to set a High-scoring Segment Pairs (HSPs) threshold (
y). When given, no more than
yHSPs (alignments) for each query-subject pair will be written to the results file.
-num_threadsis used to set the number (
n) of threads/processors to use (default 1).
The last two arguments given in this command-
-max_hsps- are optional, but they are useful as they allow the results to be filtered before being written to the file. Using these arguments will result in more specific results, and will reduce the need to manually filter results later.
These BLAST results are taken from the video demonstration and are in BLAST output format 6.
gi|242120357|gb|FJ461870.1| NC_001144.5 93.252 163 11 0 196 358 454921 454759 7.57e-63 241 gi|242120357|gb|FJ461870.1| NC_001144.5 93.252 163 11 0 196 358 464058 463896 7.57e-63 241 gi|242120357|gb|FJ461870.1| CP036478.1 93.252 163 11 0 196 358 454829 454667 7.57e-63 241 gi|242120357|gb|FJ461870.1| CP036478.1 93.252 163 11 0 196 358 463966 463804 7.57e-63 241 gi|242120357|gb|FJ461870.1| CP024006.1 93.252 163 11 0 196 358 453978 453816 7.57e-63 241
These results are tab-separated values, meaning each column in the results is separated by a
Tab character. These columns always appear in the same order:
query_id subject_id per_identity aln_length mismatches gap_openings q_start q_end s_start s_end e-value bit_score
In this format…
query_idis the FASTA header of the sequence being searched against the database (the query sequence).
subject_idis the FASTA header of the sequence in the database that the query sequence has been aligned to (the subject sequence).
per_identityis the percentage identity- the extent to which the query and subject sequences have the same residues at the same positions.
aln_lengthis the alignment length.
mismatchesis the number of mismatches.
gap_openingsis the number of gap openings in the alignment.
q_startis the start of the alignment in the query sequence.
q_endis the end of the alignment in the query sequence.
s_startis the start of the alignment in the subject sequence.
s_endis the end of the alignment in the subject sequence.
e_valueis the expect value (E-value) for the alignment.
bit_scoreis the bit-score of the alignment.
All BLAST output formats above 4 (i.e.
--outfmt > 4) use this tabular layout, formatted in different ways. For example,
--outfmt 10 gives the same information in a comma-separated values (
.csv) file instead of a tab-separated values (
In this demonstration,
makeblastdb is used to create a BLAST database from the file
S_cere_genomes.fna. This FASTA nucleic acids (
.fna) file was created by concatenating the following Saccharomyces cerevisiae genome assemblies, which were downloaded from NCBI: GCA_003086655.1, GCA_003709285.1 and GCA_004328465.1.
blastn is then used to query
23S_rRNA_gene.fna against this database. This file is a copy of the Scutellospora fulgida isolate NC303A 25S ribosomal RNA gene from NCBI.
tblastn is also used to query
YPK1.faa against this database multiple times. This FASTA amino acid (
.faa) file is a copy of the serine/threonine-protein kinase YPK1 from UniProt. This search is carried out multiple times with additional parameters: the flag
-evalueis used to set an E-value threshold, and the flag
-max_hsps is used to set a maximum number of High-scoring Segment Pairs (HSPs).
The results from these BLAST searches are written to tab-separated values (
.tsv) files. This output format is specified with the flag
- File formats used in bioinformatics
- Introduction to the command line
- NCBI’s web application for BLAST