annotating_snps.R is an
R script that cross-references annotations of genome assemblies with VCF files containing SNPs of sequencing reads aligned against those genome assemblies. If a SNP falls within- or upstream of- an annotated genome feature (start codon, stop codon, CDS, etc.), the script will return that feature along with the SNP. For this script to work, these files need to use the same sequence names: e.g. if the first sequence in the VCF is called “chrI”, there should be a corresponding sequence called “chrI” in the GFF file.
To use this script, variables need to be defined on lines 28 to 32 of the script:
- The GFF file name should be assigned to the variable
- The VCF file name should be assigned to the variable
- The VCF and GFF files should be in the directory
- The number of lines in the VCF file header should be specified in the
VCF_header.intvariable. This is the number of lines that begin with
#in the VCF file.
- The variable
upstream.intis used to determine how far upstream from an annotated feature a SNP can be. This can be set to 0 if you do not want upstream SNPs to be considered. Setting it to 1000 will mean that SNPs up to 1,000 bases/1kb upstream from a feature will be annotated.
- The variable ‘output_name’ is used to specify the name of the output file, which should end in ‘.tsv’ as it will be a tab-separated values text file.
.tsv files created by this script have a combination of columns from the GFF and VCF formats as follows…
sequenceThe name of the sequence where the feature is located.
sourceKeyword identifying the source of the feature, like a program (e.g. Augustus) or an organization (e.g. SGD).
featureThe feature type name, like
exon. In a well-structured GFF file, all the children features always follow their parents in a single block (so all exons of a transcript are put after their parent
transcriptfeature line and before any other parent transcript line).
startGenomic start of the feature, with a 1-base offset.
endGenomic end of the feature, with a 1-base offset.
scoreNumeric value that generally indicates the confidence of the source in the annotated feature. A value of
.(a dot) is used to define a null value.
strandSingle character that indicates the strand of the feature; it can assume the values of
-, (negative, or
phasePhase of coding sequence (CDS) features, indicating where the feature starts in relation to the reading frame. It can be either one of
2(for CDS features) or
.(for everything else).
attributesAll the other information pertaining to this feature. The format, structure and content of this field is the one which varies the most between GFF formats.
POSThe 1-based position of the variation on the given sequence.
REFThe reference base (or bases in the case of an indel) at the given position on the given reference sequence.
ALTThe list of alternative alleles at this position.
QUALA quality score associated with the inference of the given alleles.
FILTERA flag indicating which of a given set of filters the variation has passed.
INFOAn extensible list of key-value pairs (fields) describing the variation. Multiple fields are separated by semicolons with optional values in the format:
SAMPLEFor each (optional) sample described in the file, values are given for the fields listed in FORMAT. If multiple samples have been aligned to the reference sequence, each sample will have its own column.