BLAST
The Basic Local Alignment Search Tool (BLAST) is an algorithm and program for comparing primary biological sequence information, such as the amino-acid sequences of proteins or the nucleotides of DNA and/or RNA sequences. BLAST is one of the most widely used tools in bioinformatics; it can be applied to different problems or projects in a myriad ways.
Contents
- How BLAST works
- The command line version of BLAST
- Types of BLAST search
- E-value and Bit-score
- Creating a BLAST database using
makeblastdb
- Creating local BLAST database from Swiss-Prot
- Searching against a BLAST nucleotide database using
blastn
- BLAST
-outfmt 6
results - Video demonstration
- See also
- References
How BLAST works
There are two main steps in BLAST:
- A list of “words” (sets of characters/residues) of length k is created for the query sequence. By default, k = 3 for amino acid sequences, and k = 11 for nucleotide sequences.
- An alignment is made for database (subject) sequences that share many words with the query sequence. This is a local alignment in which only High-scoring Segment Pairs (HSPs) are reported. In other words, BLAST finds islands of similarity between sequences.
The command line version of BLAST
BLAST can be used online, or through the command line. Most biologists are familiar with NCBI’s web application for BLAST. If you use this web application regularly, the command line BLAST program is worth your consideration. The command line version of BLAST has several advantages over the web version:
- BLAST on the command line can be used to run local searches, i.e. searches which use files that are on your computer, instead of files that are on an NCBI database.
- BLAST searches on the command line can be made more specific by adding additional arguments.
- BLAST searches carried out on the command line can be automated, and incorporated into larger scripts.
- The command line BLAST program can output search results in various structured text formats.
The command line version of BLAST can be downloaded via conda using the following command:
$ conda install -c bioconda blast
This program is included in the bioinfo-notebook conda environment.
Types of BLAST search
There are five main types of BLAST search:
- BLASTp searches a protein database with a protein query sequence.
- BLASTn searches a nucleic acid database with nucleic acid query sequence.
- BLASTx searches a protein database with nucleic acid query sequence, which is translated into an amino acid sequence.
- tBLASTx searches a nucleic acid database with nucleic acid query sequence. In this case, both the database (subject) sequences and query sequence are translated into amino acid sequences.
- tBLASTn searches a nucleic acid database with protein query sequence. In this case, the nucleic acid database is translated into a set of amino acid sequences.
While the type of query and subject sequences required for each of these BLAST searches differs, the command line arguments that can be used for these BLAST searches are interchangeable.
E-value and Bit-score
Two important variables when interpreting BLAST results are E-value and bit-score. These are both derived from the raw alignment score (S), which is based on the number of residues (i.e. individual amino/nucleic acids) that two sequences have in common. The more identical residues that two sequences have at the same position in an alignment, the higher the alignment score.
- Bit-score (S’) is the raw alignment score (S) normalised with respect to the scoring system used for the alignment.
- E-value or Expectation value is the number of different alignments with scores equivalent to or better than S that is expected to occur in a database search by chance. The lower the E value, the more significant the score and the alignment. An exact match between query and subject sequences results in an E-value of zero.
While bit-scores are comparable between searches, as they are normalised, they do not take the size of the database into account. E-values, however, do account for the size of the database. The lower the E-value and the higher the bit-score, the better the BLAST result.
Creating a BLAST database using makeblastdb
To search against a set of nucleotide or amino acid sequences using BLAST, a database must be created. This can be done using the makeblastdb
command.
$ makeblastdb -dbtype prot/nucl -in input_file -out database_name
In this command…
-dbtype
specifies the type of sequences used to create the database. For amino acid (protein) sequences,prot
is used (“-dbtype prot
”). For nucleic acid sequences,nucl
is used (“-dbtype nucl
”).-in
is used to specify the input file. The database created can be used to search against the sequences in this file.-out
is used to name the database that will be created from the input file.
Downloading Swiss-Prot FASTA sequences and creating a BLAST protein database
In this video, the FASTA amino acid sequences of Swiss-Prot are downloaded, and a BLAST protein database is created from these sequences using makeblastdb
. UniProtKB/Swiss-Prot is a manually annotated, non-redundant protein sequence database. As it is well-annotated and curated, the Swiss-Prot database gives informative results when searched locally using blastp
and blastx
. The link used in the wget
command is copied and pasted from the UniProt downloads page. The compressed FASTA sequences of the Swiss-Prot database on ftp.uniprot.org
.
These FASTA amino acid sequences are compressed into a .gz
(gzip) file. Before using the makeblastdb
command, this FASTA file is uncompressed using gunzip
, turning uniprot_sprot
.fasta.gz
into uniprot_sprot
.fasta
. Once the FASTA file is downloaded and uncompressed, makeblastdb
is used to create a BLAST protein database of the amino acid sequences in this FASTA file. This BLAST protein database is named swissprot
, and consists of three binary files.
Once the BLAST protein database is created, blastp
and blastx
can be used to search sequences against it. This database can be selected using the argument -db swissprot
with blastp
or blastx
(the path to the swissprot
database will need to be given if the command is run from a different directory).
Searching against a BLAST nucleotide database using blastn
The program blastn
is used for searching nucleotide databases with a nucleotide query.
$ blastn -query query_file.fna -db nucl_database_name -out results_file.tsv -outfmt 6 -evalue x -max_hsps y -num_threads n
In this command…
-query
is used to select the FASTA nucleic acids file you want to search against the BLAST database (thequery_file.fna
).-db
is used to select the BLAST nucleotide database you want to search against (nucl_database_name
).-out
is used to direct the results to an output file (results_file.tsv
).-outfmt
is used to specify how this results file should be formatted. In this case, as-outfmt
is6
, the results will be written to a file as tab-separated values: this is whyresults_file.tsv
has a.tsv
extension.-evalue
is used to set an E-value threshold (x
). Results which have an E-value greater than this threshold will not be written to the results file.-max_hsps
is used to set a High-scoring Segment Pairs (HSPs) threshold (y
). When given, no more thany
HSPs (alignments) for each query-subject pair will be written to the results file.-num_threads
is used to set the number (n
) of threads/processors to use (default 1).
The last two arguments given in this command- -evalue
and -max_hsps
- are optional, but they are useful as they allow the results to be filtered before being written to the file. Using these arguments will result in more specific results, and will reduce the need to manually filter results later.
BLAST -outfmt 6
results
These BLAST results are taken from the video demonstration and are in BLAST output format 6.
gi|242120357|gb|FJ461870.1| NC_001144.5 93.252 163 11 0 196 358 454921 454759 7.57e-63 241
gi|242120357|gb|FJ461870.1| NC_001144.5 93.252 163 11 0 196 358 464058 463896 7.57e-63 241
gi|242120357|gb|FJ461870.1| CP036478.1 93.252 163 11 0 196 358 454829 454667 7.57e-63 241
gi|242120357|gb|FJ461870.1| CP036478.1 93.252 163 11 0 196 358 463966 463804 7.57e-63 241
gi|242120357|gb|FJ461870.1| CP024006.1 93.252 163 11 0 196 358 453978 453816 7.57e-63 241
These results are tab-separated values, meaning each column in the results is separated by a Tab
character. These columns always appear in the same order:
query_id subject_id per_identity aln_length mismatches gap_openings q_start q_end s_start s_end e-value bit_score
In this format…
query_id
is the FASTA header of the sequence being searched against the database (the query sequence).subject_id
is the FASTA header of the sequence in the database that the query sequence has been aligned to (the subject sequence).per_identity
is the percentage identity- the extent to which the query and subject sequences have the same residues at the same positions.aln_length
is the alignment length.mismatches
is the number of mismatches.gap_openings
is the number of gap openings in the alignment.q_start
is the start of the alignment in the query sequence.q_end
is the end of the alignment in the query sequence.s_start
is the start of the alignment in the subject sequence.s_end
is the end of the alignment in the subject sequence.e_value
is the expect value (E-value) for the alignment.bit_score
is the bit-score of the alignment.
All BLAST output formats above 4 (i.e. --outfmt > 4
) use this tabular layout, formatted in different ways. For example, --outfmt 10
gives the same information in a comma-separated values (.csv
) file instead of a tab-separated values (.tsv
) file.
Video demonstration
In this demonstration, makeblastdb
is used to create a BLAST database from the file S_cere_genomes.fna
. This FASTA nucleic acids (.fna
) file was created by concatenating the following Saccharomyces cerevisiae genome assemblies, which were downloaded from NCBI: GCA_003086655.1, GCA_003709285.1 and GCA_004328465.1.
The program blastn
is then used to query 23S_rRNA_gene.fna
against this database. This file is a copy of the Scutellospora fulgida isolate NC303A 25S ribosomal RNA gene from NCBI.
The program tblastn
is also used to query YPK1.faa
against this database multiple times. This FASTA amino acid (.faa
) file is a copy of the serine/threonine-protein kinase YPK1 from UniProt. This search is carried out multiple times with additional parameters: the flag -evalue
is used to set an E-value threshold, and the flag -max_hsps
is used to set a maximum number of High-scoring Segment Pairs (HSPs).
The results from these BLAST searches are written to tab-separated values (.tsv
) files. This output format is specified with the flag -outfmt 6
.
See also
- File formats used in bioinformatics
- Introduction to the command line
- conda
- NCBI’s web application for BLAST