Link Search Menu Expand Document

Genome annotation SwissProt CDS.sh

genome annotation SwissProt CDS.sh is a bash script that annotates the coding sequences (CDS) in a given genome assembly. It uses BLAST and MGKit, which are included in the bioinfo-notebook conda environment.

Usage

genome_annotation_SwissProt_CDS.sh [-h|--help] [-d|--demo] [-i|--input] 
 [-l|--log -p|--processors n -e|--email] 
 
 A script to annotate proteins in a genome assembly, using BLASTx with
 UniProtKB/Swiss-Prot.
 
 When run with the arugment '-d' or '--demo' this script...
 
 	 1. Downloads a Saccharomyces cerevisiae S288C genome assembly, and 
 	 the UniProtKB/Swiss-Prot amino acid sequences. 
 	 2. Creates a BLAST database from the downloaded Swiss-Prot sequences,
 	 and searches the S. cerevisiae genome against it using BLASTx with an
 	 E-value threshold of 1e-100. 
 	 3. Filters the BLASTx results, removing results with less than 90%
 	 identity.
 	 4. Creates a genome annotation GFF file from these BLASTx results.
 	 5. Adds information to the genome annotation from UniProt (protein
 	 names, KeGG ortholog information, EC numbers, etc.) 
 
 The end result ('S_cere.gff') is an annotation of the coding sequences (CDS) 
 in the S. cerevisiae genome that are described in UniProtKB/Swiss-Prot. 
 
 This script can also be run with the argument '-i' or '--input', which is used
 to specify a FASTA nucleotide file (.fasta or .fna) to annotate, instead of
 the demo sequence. The end result is also an annotation of the CDS in the input
 sequence based on UniProtKB/Swiss-Prot, called '<input>.gff'. 
 
 This script should be called from the 'bioinfo-notebook/' directory.The 
 programs required for this script are in the 'bioinfo-notebook' conda 
 environment (bioinfo-notebook/envs/bioinfo-notebook.yml or 
 bioinfo-notebook/envs/bioinfo-notebook.txt). 
 If the input file is not in the 'bioinfo-notebook/data/' directory, the full 
 file path should be given.
 
 arguments: 
 	 -h | --help		 show this help text and exit 
 	 -i | --input		 name of input FASTA nucleotide file to annotate 
 	 -d | --demo		 run the script with demonstration inputs
 
 optional arguments:
 	 -l | --log		 redirect terminal output to a log file 
 	 -p | --processors	 set the number (n) of processors to use
 				 (default: 1) 
 	 -e | --email		 contact email for UniProt queries

See also