Bioinformatics Notebook

GitHub issues GitHub repo size Website

This project provides introductions to various bioinformatics tools with short guides, video demonstrations, and scripts that tie these tools together. The documents in this project can be read locally in a plain-text editor, or viewed online at https://rnnh.github.io/bioinfo-notebook/. If you are not familiar with using programs from the command line, begin with the page “Introduction to the command line”. If you have any questions, or spot any mistakes, please submit an issue on GitHub.

Pipeline examples
Contents
Installation instructions
Repository structure

Pipeline examples

These bioinformatics pipelines can be carried out using scripts and tools described in this project. Input files for some of these scripts can be specified in the command line; other scripts will need to be altered to fit the given input data.

SNP analysis

FASTQ reads from whole genome sequencing (WGS) can be assembled using SPAdes.
Sequencing reads can be aligned to this assembled genome using bowtie2.
The script snp_calling.sh aligns sequencing reads to an assembled genome and detects single nucleotide polymorphisms (SNPs). This will produce a Variant Call Format (VCF) file.
The proteins in the assembled reference genome- the genome to which the reads are aligned- can be annotated using genome_annotation_SwissProt_CDS.sh.
The genome annotation GFF file can be cross-referenced with the VCF file using annotating_snps.R. This will produce an annotated SNP format file.
Annotated SNP format files can be cross-referenced using annotated_snps_filter.R. For two annotated SNP files, this script will produce a file with annotated SNPs unique to the first file, and a file with annotated SNPs unique to the second file.

RNA-seq analysis

fastq-dump_to_featureCounts.sh can be used to download RNA-seq reads from NCBI’s Sequence Read Archive (SRA) and align them to a reference genome. This script uses fastq-dump or fasterq-dump to download the sequencing reads as FASTQ, and featureCounts to align them to a reference FASTA nucleotide file.
Running fastq-dump_to_featureCounts.sh will produce feature count tables. These feature count tables can be combined using combining_featCount_tables.py.
These combined feature count tables can be used for differential expression (DE) analysis. An example DE analysis script is included in this project: DE_analysis_edgeR_script.R. This script uses the R programming language with the edgeR library.

Detecting orthologs between genomes

Augustus can be used to predict genes from FASTA nucleotide files.
Once the FASTA amino acid sequences have been extracted from the Augustus annotations, you can search for orthologs using OrthoFinder.
To find a specific gene of interest, search the amino acid sequences of the predicted genes using BLAST.

1. General guides

2. Program guides

3. Scripts

Installation instructions

After following these instructions, there will be a copy of the bioinfo-notebook GitHub repo on your system in the ~/bioinfo-notebook/ directory. This means there will be a copy of all the documents and scripts in this project on your computer. If you are using Linux and run the Linux setup script, the bioinfo-notebook virtual environment- which includes the majority of the command line programs covered in this project- will also be installed using conda.

1. This project is written to be used through a UNIX (Linux or Mac with macOS Mojave or later) operating system. If you are using a Windows operating system, begin with these pages on setting up Ubuntu (a Linux operating system):

Once you have an Ubuntu system set up, run the following command to update the lists of available software:

$ sudo apt-get update # Updates lists of software that can be installed

2. Run the following command in your home directory (~) to download this project:

$ git clone https://github.com/rnnh/bioinfo-notebook.git

3. If you are using Linux, run the Linux setup script with this command after downloading the project:

$ bash ~/bioinfo-notebook/scripts/linux_setup.sh

Video demonstration of installation

Repository structure

bioinfo-notebook/
├── assets/
│   └── bioinfo-notebook_logo.svg
├── data/
│   ├── blastx_SwissProt_example_nucleotide_sequence.fasta.tsv
│   ├── blastx_SwissProt_S_cere.tsv
│   ├── design_table.csv
│   ├── example_genome_annotation.gtf
│   ├── example_nucleotide_sequence.fasta
│   └── featCounts_S_cere_20200331.csv
├── docs/
│   ├── annotated_snps_filter.md
│   ├── annotating_snps.md
│   ├── augustus.md
│   ├── blast.md
│   ├── bowtie2.md
│   ├── bowtie.md
│   ├── cl_intro.md
│   ├── cl_solutions.md
│   ├── combining_featCount_tables.md
│   ├── conda.md
│   ├── DE_analysis_edgeR_script.md
│   ├── DE_analysis_edgeR_script.pdf
│   ├── fasterq-dump.md
│   ├── fastq-dump.md
│   ├── fastq-dump_to_featureCounts.md
│   ├── featureCounts.md
│   ├── file_formats.md
│   ├── genome_annotation_SwissProt_CDS.md
│   ├── htseq-count.md
│   ├── linux_setup.md
│   ├── orthofinder.md
│   ├── part1.md    # Navigation page for website
│   ├── part2.md    # Navigation page for website
│   ├── part3.md    # Navigation page for website
│   ├── report_an_issue.md
│   ├── samtools.md
│   ├── sgRNAcas9.md
│   ├── snp_calling.md
│   ├── SPAdes.md
│   ├── ubuntu_virtualbox.md
│   ├── UniProt_downloader.md
│   └── wsl.md
├── envs/            # conda environment files
│   ├── augustus.yml            # environment for Augustus
│   ├── bioinfo-notebook.txt
│   ├── bioinfo-notebook.yml
│   ├── orthofinder.yml         # environment for OrthoFinder
│   └── sgRNAcas9.yml           # environment for sgRNAcas9
├── scripts/
│   ├── annotated_snps_filter.R
│   ├── annotating_snps.R
│   ├── combining_featCount_tables.py
│   ├── DE_analysis_edgeR_script.R
│   ├── fastq-dump_to_featureCounts.sh
│   ├── genome_annotation_SwissProt_CDS.sh
│   ├── linux_setup.sh
│   ├── snp_calling.sh
│   └── UniProt_downloader.sh
├── _config.yml     # Configures github.io project website
├── .gitignore
├── LICENSE
├── README.md
└── .travis.yml     # Configures Travis CI testing for GitHub repo