Algorithms and Bioinformatics: Code Repository for applied bioinformatics

Adapt from prof. Istvan Albert with modification: This page is a collection of most of the examples shown during lectures.

Lecture 30 - Using R

# the data frame is a data representation format
annot = read.delim("annot.gff", comment.char="#", header=FALSE)

# show the names associated with the columns
names(annot)

# you can look at these columns
head(annot$V1)

# you may also assign your own names to these columns
names(annot) <- -="" 10="" 1:2="" 1="" 1kb="" access="" annot="" are="" at="" attr="" attributes="" automatic="" by="" c="" can="" chrom="" class="" columns="" cool="" do="" element="" end="" executed="" features="" functional="" get="" head="" indexing="" investigate="" is="" len="" lenghty="" let="" long.feat="annot[" longer="" look="" loopless="" max.print="100)" maximal="" mode="" name="" now="" null="" objectanno="" objects="" of="" operations="" options="" or="" output="" phase="" print="" programming="" rid="" rows="" s="" set="" slice="" source="" start="" strand="" tail="" than="" that="" the="" things="" this="" to="" too="" truth="" type="" typeof="" value="" values="" vector="" vectorization="" with="" you=""> 1000, ]

# the same coudl be done in more explici fashion
long.feat = subset(annot, len > 1000)

# you can group by multiple conditions
long.gene = subset(annot, (type=="gene" & len > 1000))

# save the output into a table
write.table(long.gene, "long.genes.txt", sep="\t", quote=FALSE, col.names=FALSE)

#
# typical R code will look like this
#
# function broadcasting with the various versions of apply
# see http://nsaunders.wordpress.com/2010/08/20/a-brief-introduction-to-apply-in-r/
#
by(annot[, c("len")], annot$type, mean)

# install ggplot2
install.packages("ggplot2")

# load the plotting library
library(ggplot2)

# the diamonds dataset comes with the library
# it is a test dataset with many dimensions
# save the diamonds dataset into a file
write.table(diamonds, "diamonds.txt", sep="\t", quote=FALSE)

# default plot
qplot(carat, price, data=diamonds)

# reduce overplotting with transparency
qplot(carat, price, data=diamonds, alpha=I(1/20))

# a third dimension used to set plot symbols
qplot(carat, price, data=diamonds, shape=cut)

# or we could set the colors
qplot(carat, price, data=diamonds, color=cut)

# or we could set the colors or both
qplot(carat, price, data=diamonds, color=cut, shape=cut)

# you can override the default plot types
qplot(color, data=diamonds, geom="bar")

# perform operations
qplot(color, carat/price, data=diamonds, geom="boxplot")

# jitter plot with transparency
qplot(color, carat/price, data=diamonds, geom="jitter", alpha=I(1/20))

# multiplotting
qplot(price, carat, data=diamonds, facets=color ~ ., binwidth=0.1)

# large scale multiplot with facets and colors
qplot(carat, price, data=diamonds, facets=cut ~ clarity, color=color, alpha=I(1/20))

# let's visualize our own data now

# keep only some type of features
keep = c("gene", "binding_site", "long_terminal_repeat" ,"tRNA")
short = subset(annot, annot$type %in% keep)
qplot(len, data=short, facets = type ~ ., xlim=c(0, 2000))

Lecture 29 - Usage tips

#
# this is how you can loop over variables
# note how the variable substitution works
#
for name in one two three hello world;
do
    echo $name
    echo $name! and $name.1.gff
    echo "---"
done

#
# add the entire path to tophat and cufflinks to your PATH variable
#
# the path may be slightly different if you are using a Linux system
#
export PATH=$PATH:~/src/tophat-2.0.10.OSX_x86_64/:~/src/bowtie2-2.1.0/

Lecture 28 - RNA-Seq (part 1)

# making a transcriptome

# install both tophat and cufflinks and link them both to the bin folder

# link both bowtie2, and other supporting programs to the bin folder
ln -s ~/src/bowtie2-2.1.0/bowtie2 ~/bin
ln -s ~/src/bowtie2-2.1.0/bowtie2-build ~/bin
ln -s ~/src/bowtie2-2.1.0/bowtie2-align ~/bin
ln -s ~/src/bowtie2-2.1.0/bowtie2-inspect ~/bin

# link tophat
ln -s ~/src/tophat-2.0.10.OSX_x86_64/tophat ~/bin

# add cufflinks to the bin
ln -s ~/src/cufflinks-2.1.1.OSX_x86_64/cufflinks ~/bin
ln -s ~/src/cufflinks-2.1.1.OSX_x86_64/gffread ~/bin

# use bedtools getfasta it you have a 12 column bed file
bedtools getfasta -fi ~/refs/yeast/sc.fa -bed demo-gene.bed -split -fo  trans.fa

# or you can run the gffread tool from cufflinks
gffread -g ~/refs/yeast/sc.fa demo-gene.gff -w trans.fa

# make a small demo reference folder for this lecture
mkdir demoref
mv trans.fa demoref

# simulate reads from the demo reference
# note that we simulate short reads and short fragments,

# play around with this to see the effect of it on the alignments
wgsim -N 10000 -1 35 -2 35 -d 100 demoref/trans.fa r1.fq r2.fq

# align the simulated dataset
bash aln.sh ~/refs/yeast/sc.fa r1.fq r2.fq

# as you increase the insert size the short transcript cannot evenly accomodate
# the fragments
wgsim -N 10000 -1 35 -2 35 -d 500 demoref/trans.fa r1.fq r2.fq

# align and visualize the results
tophat -G demo-gene.gtf  ~/refs/yeast/sc.fa r1.fq r2.fq

# let's install the FLUX simulator
# download and install the FLUX simulator
ln -s ~/src/flux-simulator-1.2.1/bin/flux-simulator ~/bin

#
# to use the simulator the genome needs to be split by chromosome
# one file per chromosome and the file must be named by the chromosome
#
# you have several options perhaps extracting into a file via bedtools getfasta one chromosome at a time
# or keeping certain lines from files

# where does chromosome 2 start
mkdir fasta
cat -n ~/refs/yeast/sc.fa | grep chrII | head -1

# take the chromosome
head -2879 < ~/refs/yeast/sc.fa > fasta/chrI.fa

# see what files are there
ls fasta

# run the flux simulator
flux-simulator -l -x -s -p simulate.par --force

# process the simulated reads
tophat -G demo-gene.gtf  ~/refs/yeast/sc.fa simulate.fastq

# index the sam file
samtools index tophat_out/accepted_hits.bam

Lecture 27 - Chip-Seq peak calling

# get the datafile, unpack and align
curl -O http://www.personal.psu.edu/iua1/courses/files/2013/data/lec26.tar.gz
tar xzvf lec26.tar.gz
bwa mem ~/refs/yeast/sc.fa long.fq | samtools view -Sb - > temp.bam
samtools sort temp.bam long
samtools index long.bam

# get MACS,
# the people that wrote MACS do not quite understand how Python installation
# works so you need to work around that
cd ~/src/
curl -OL https://github.com/downloads/taoliu/MACS/MACS-1.4.2-1.tar.gz
tar xzvf MACS-1.4.2-1.tar.gz
cd MACS-1.4.2

# install into a local directory
python setup.py install --prefix local

# add that local directory to your python path
export PYTHONPATH=$PYTHONPATH:~/src/MACS-1.4.2/local/lib/python2.7/site-packages/

# link the macs14 executable to the bin
ln -s ~/src/MACS-1.4.2/local/bin/macs14 ~/bin/macs14

# run macs to look at the output
~/bin/macs14

# get SISSRS
curl -O http://dir.nhlbi.nih.gov/papers/lmi/epigenomes/sissrs/sissrs_v1.4.tar.gz
tar zxvf sissrs_v1.4.tar.gz
mv sissrs.pl ~/bin

# get genetrack
cd ~/src
git clone git@github.com:ialbert/chipexo.git

# onto peak predictions

# run with macs14
macs14 -t long.bam --name long-macs

# run with sissrs
# turn the bam file into bed intervals
~/bin/bedtools bamtobed -i long.bam > long.bed
~/bin/sissrs.pl -i long.bed -s 15000000 -o long-sissrs.txt
cat long-sissrs.txt | grep "chr" > long-sissrs.bed

# run with genetrack
python ~/src/chipexo/genetrack/genetrack.py long.bed -v > long-genetrack.gff

# create a bedgraph to see the underlying probabilities
python ~/src/chipexo/genetrack/genetrack.py long.bed -b  > long-genetrack.gff

Lecture 26 - Chip-Seq data analysis

#
# download and unpack the Chip-Seq data
#
curl -O http://www.personal.psu.edu/iua1/courses/files/2013/data/lec26.tar.gz
tar zxvf lec26.tar.gz

#
# investigate the files, lines, what encoding are they in?
#
wc -l *.fq
head long.fq

#
# perform the alignments for both short and long dataset
#
bwa mem ~/refs/yeast/sc.fa long.fq > long.sam
samtools view -Sb long.sam >tmp.bam
samtools sort tmp.bam long
samtools index long.bam

#
# visualize it in IGV, what do we see, well, trouble of course
# seems that there is a strand bias, let's deduplicate
#
samtools rmdup -s long.bam long-dedup.bam
samtools index long-dedup.bam

#
# now look at the file again, looks better
# the position deterimination is much better,
# but we altered the occupancy for sure, highly covered locations will
# most certainly show higher natural duplicates, and we just normalized those away
# use this to determine locations but not occupancy
#

#
# well, let's do some peak calling
# the poor man's peak predictor is a simple awk script
# that starts a peak when the coverage is higher than
# a certain value and stops ends the peak when the coverage is below a value
#
samtools depth long-dedup.bam | awk -f poor-mans-peak-pred.awk > long-peaks.gff

#
# investigate the peaks (intervals that you get), create intervals for midpoints only
#

# fragment size estimation, install the bioawk-tools package
git clone git@github.com:ialbert/bioawk-tools.git


# I'll go back to the data folder and create a shortcut there

cd ~/work/lec26
ln -s ~/src/bioawk-tools/chipfrag.awk .

# this will produce the number of overlapping 5' ends in the data
cat long.bed | awk -f chipfrag.awk

# we can shift the + starts and see how many match then
cat long.bed | awk -v shift=100 -f chipfrag.awk

# to run it on multiple shifts
for i in $(seq 20 5 200); do awk -v shift=$i -f chipfrag.awk long.bed; done

# put the results into a file and visualize them

#
# poor man's peak predictor
#
# processes a samtools depth output
# and produces GFF intervals for the regions of high coverage
#
BEGIN {
    MINCOV = 10
    inside = 0
    print " ##gff-version 3"
}

$3 >= MINCOV {
    if (inside == 0){
        chrom  = $1
        start  = $2
        inside = 1
    }
}

$3 < MINCOV {
    if (inside == 1){
        end = $2
        inside = 0
        # create a new attribute to show the midpoint as ID
        mid = int((start + end)/2)
        attr = sprintf("ID=P%d",mid)
        print chrom, "ppred", "peak", start, end, ".", ".", ".", attr
    }
}

Lecture 25 - Metagenomic classification

#
# download the classifier from sourceforge unzip it then install it into src
#
# http://sourceforge.net/projects/rdp-classifier/
#
#
mv ~/Downloads/rdp_classifier_2.6 ~/src

# read more about the tool

# see how it works
java -Xmx1g -jar ~/src/rdp_classifier_2.6/dist/classifier.jar

# you can create a shortcut to it
alias rdp='java -Xmx1g -jar ~/src/rdp_classifier_2.6/dist/classifier.jar'

# run the shortcut
rdp

# run the classification, this takes about 15 seconds
rdp classify meta.fa -f fixrank -o sequences.txt -h meta.fa.rdp.txt

# investigate the output of the rdp-class.txt file
more meta.fa.rdp.txt

#
# We can also run a different classification this time with blastn
#
#
# download the 16S microbial sequences as a blast database
#

# create a storage folder
cd ~/refs
mkdir 16S
cd 16S

# these are precompiled blast databases
curl -O ftp://ftp.ncbi.nlm.nih.gov/blast/db/16SMicrobial.tar.gz
tar zxvf 16SMicrobial.tar.gz

# now go to the lecture folder
mkdir ~/work/lec25
cd ~/work/lec25
curl -O http://www.personal.psu.edu/iua1/courses/files/2013/data/meta.tar.gz
gunzip meta.fa.gz

# we will now classify this with two different methods

# first aling with blastn against the
# this took about 7 minutes on my system
# you can also get the results from the course webpage
time blastn -db ~/refs/16S/16SMicrobial -query meta.fa -outfmt 6  > meta.fa.blastn

#
# you can now load this file into MEGAN and compare the results of the classifications
# with both methods
#

Lecture 24 - Genome assembly

#
# install the velvet and minia tools
#
cd ~/src
git clone https://github.com/jmarshall/velvet.git
cd velvet

# you could just type make but we'll customize the compilation
# to allow for larger kmer sizes
make 'MAXKMERLENGTH=71'

#
# velvet is consist of two tools, computing kmers with a certain hash size
# assembling the kmers that were computed
#
# link the executables
ln -s ~/src/velvet/velveth ~/bin
ln -s ~/src/velvet/velvetg ~/bin

# let's also install quast, this will require the following
sudo easy_install pip
sudo pip install matplotlib

# download quast
mv ~/Downloads/quast-2.2.tar.gz ~/src/
cd ~/src

# quast comes as a python program and all it needs is to be unpacked
tar zxvf quast-2.2.tar.gz

# now next we'll create a smaller yeastgenome with only the first three chromosomes
# this is to be able to get high enough coverage of the data and
# be able to run the assembler in minutes

# at what line does the fourth chromosome start
cat -n ~/refs/yeast/sc.fa | grep chrIV

# get the lines right before that
head -17004 ~/refs/yeast/sc.fa > mini.fa

# simulate 100,000 (or more reads from this data)
wgsim -N 100000 mini.fa r1.fq r2.fq > mutations.txt

# create a single end data to compare to paired end performance
# s.fq will contain the data as single end reads
cat r1.fq r2.fq > s.fq

# assemlby is a two step process, compute hashes for single end
~/bin/velveth s31 31 -fastq -short s.fq

# compute the actual assembly
~/bin/velvetg s31 -exp_cov auto

# we can repeat the same process with the same data but this time as paired end
~/bin/velveth p31 31 -fastq -shortPaired -separate r1.fq r2.fq

# and run the graph assembly again
~/bin/velvetg p31 -exp_cov auto

# let's compare the assemblies
~/src/quast-2.2/quast.py s31/contigs.fa p31/contigs.fa -R mini.fa

# now open the HTML file quast_results/latest/report.html
# to view the results of quast

Lecture 23 - Interval intersect

#
# we've done this before but just a quick reminder
#
#
# get the feature file
#
curl -o sc.gff http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff

#
# keep only the genes, and also simplify the attributes
# to only contain the first attribute pair
#
# remeber that awk needs to join by tab
# usually this means that awk=awk -v "OFS=\t" is set
#
cat sc.gff | awk ' $3=="gene" { split($9,x,";"); $9=x[1]; print $0 }' > genes.gff

# this above looks good but it does not have the first header line now you could do
# the following
head -1 sc.gff > genes.gff
cat sc.gff | awk ' $3=="gene" { split($9,x,";"); $9=x[1]; print $0 }' >< genes.gff

#
# or if you are more adventurous run it as a single command line
# the parentheses make the left side evaluate first before putting into the genes.gff
# the && symbol is the AND shell operator that causes both sides to evaluate
#
(head -1 sc.gff && cat sc.gff | awk ' $3=="gene" { split($9,x,";"); $9=x[1]; print $0 }') > genes.gff

# view the content of genes.gff
head genes.gff

# intersect alignment to create another bam file
~/bin/bedtools intersect -abam r1.fq.bam -b genes.gff > genomic.bam
~/bin/samtools index genomic.bam

# now view the genomic.bam file in IGV
# note how it only contains alignments that cover certain regions

# you can also use the -v (like grep) to give you the opposite of intersect
~/bin/bedtools intersect -abam r1.fq.bam -b genes.gff -v > non-genomic.bam
~/bin/samtools index non-genomic.bam

# turn the BAM file into BED file
~/bin/bedtools bamtobed -i r1.fq.bam > r1.fq.bed

# inspect the file
# bamtobed gives you a few options as to what to place inside the value column
head r1.fq.bed

# compute nucleotide composition for each interval
~/bin/bedtools nuc -fi ~/refs/yeast/sc.fa -bed genes.gff | more

# groupby is a powerful utility
# for it to work the data must be sorted by the grouping column
# what is the average length of the features by type in sc.gff
#
# for example average gene lenght is X
#
# get the rows that have 9 columns (valid GFF)
cat sc.gff | awk 'NF == 9 { print $0 } ' > features.gff

# place the lenght of the read into the value column of the GFF (6th column)
cat features.gff | awk ' { $6=$5-$4+1; print $0 } ' > temp.gff

# now sort by the column we try to group by (column 3)
sort -k 3 temp.gff > sizes.gff

# now perform the groupby operation
~/bin/bedtools groupby -i sizes.gff -g 3 -c 6 -ops mean | head

# in lecture 20 wgsim generated a mutations.txt file
# transform that into gff
# make a one base GFF file out of them where the place the refernce base into the source co
cat mutations.txt | awk ' { print $1, "wgsim", "variant", $2, $2, ".", $5, ".", "ref="$3 ";alt="$4 }' > mutations.gff

# mutations that overlap with snps
~/bin/bedtools intersect -a mutations.gff -b snps.vcf

# during subtractions the file order is essential
~/bin/bedtools subtract -a mutations.gff -b snps.vcf -A | wc -l
~/bin/bedtools subtract -a snps.vcf -b mutations.gff -A | wc -l

Lecture 22 - Interval analysis, using bedtools

#
# install bedtools
#
# download the code
curl https://bedtools.googlecode.com/files/BEDTools.v2.17.0.tar.gz > ~/src/BEDTools.tar.gz
#
# unpack and compile the repository
cd ~/src
tar xzvf BEDTools.tar.gz
cd bedtools-2.17.0
make
# link to the bin folder
ln ~/src/bedtools-2.17.0/bin/bedtools ~/bin/bedtools

# test that bedtools works properly
~/bin/bedtools

# let's expand an interval stored in say demo.bed (see file below)

~/bin/bedtools slop

# looks like we need a genome file, it even gives you a nice example

#
# the many ways to generate a chromosome/size file
#

# this will load the genome into memory, only do it for small genomes
cat ~/refs/yeast/sc.fa | ~/bin/bioawk -c fastx ' { print $name, length($seq) } '

# samtools faidx also creates a small index file
# this of course may take a while but you most likely already have done this
~/bin/samtools faidx ~/refs/yeast/sc.fa
cat ~/refs/yeast/sc.fa.fai | cut -f1,2

# we can do it with blastdbcmd
# this requires that you build a blast database, that too may take a while
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'all' -outfmt "%a %l"

# ok but that does not put in tabs and won't recognize \t either, sigh, tab problems again?
# how about this, use the tr command transl
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'all' -outfmt "%a %l" | tr ' ' '\t'

# or you could look at the header of an alignment file and extract from there
# but that needs a little work with tr
~/bin/samtools view -h ../lec19/r1.fq.bam | head -100 | grep "@SQ" | cut -f2,3 | tr -d "SN:" | tr -d "LN:"

# ok, let's get back to slop
# extend the interval on both sides
~/bin/bedtools slop -i demo.bed -g genome.txt -b 10

# you can extend on left and right non-directionally
$ ~/bin/bedtools slop -i demo.bed -g genome.txt -l 10 -r 0

# or in a strand specific manner
$ ~/bin/bedtools slop -i demo.bed -g genome.txt -l 10 -r 0 -s

# it is best if one also visualizes what the tool does
$ ~/bin/bedtools flank -i demo.bed -g genome.txt -l 10 -r 0 -s

# generate the intervals that are not covered by the file
~/bin/bedtools complement -i demo.gff -g genome.txt

# extract the sequence for the intervals in the file
~/bin/bedtools getfasta -bed demo.gff -fi ~/refs/yeast/sc.fa -fo output.fa

# the file is a regular fasta file
more output.fa

The BED file:

chrI    100 200 one 0   +
chrI    300 400 two 0   -

The GFF file:

chrI    .   one 101 200 0   +   .   .
chrI    .   two 301 400 0   -   .   .

Lecture 21 - Interval data formats: BED and GFF

same information represented in three different formats:
GTF:

chrI    demo    exon    1000    2000    .   +   .   gene_id "G1"; transcript_id "T1";
chrI    demo    exon    3000    4000    .   +   .   gene_id "G1"; transcript_id "T1";
chrI    demo    exon    5000    6000    .   +   .   gene_id "G1"; transcript_id "T1";
chrI    demo    exon    7000    8000    .   +   .   gene_id "G1"; transcript_id "T1";
chrI    demo    exon    1000    2000    .   +   .   gene_id "G1"; transcript_id "T2";
chrI    demo    exon    3000    4000    .   +   .   gene_id "G1"; transcript_id "T2";
chrI    demo    exon    7000    8000    .   +   .   gene_id "G1"; transcript_id "T2";
chrI    demo    exon    1000    2000    .   +   .   gene_id "G1"; transcript_id "T3";
chrI    demo    exon    5000    6000    .   +   .   gene_id "G1"; transcript_id "T3";
chrI    demo    exon    7000    8000    .   +   .   gene_id "G1"; transcript_id "T3";

GFF3:

##gff-version 3
chrI    demo    exon    1000    2000    .   +   .   Parent=T1,T2,T3;
chrI    demo    exon    3000    4000    .   +   .   Parent=T1,T2;
chrI    demo    exon    5000    6000    .   +   .   Parent=T1,T3;
chrI    demo    exon    7000    8000    .   +   .   Parent=T1,T2,T3;

BED:

chrI    999 8000    T1  0   +   999 8000    0,0,255 4   1000,1000,1000,1000 0,2000,4000,6000
chrI    999 8000    T2  0   +   999 8000    0,0,255 3   1000,1000,1000  0,2000,6000
chrI    999 8000    T3  0   +   999 8000    0,0,255 3   1000,1000,1000  0,4000,6000

Lecture 20 - SNP calling with freebayes, VCF format

# install the FreeBayes snp caller

# homepage https://github.com/ekg/freebayes

# as per documentation the git clone should be recursive
git clone --recursive git://github.com/ekg/freebayes.git

# as per documentation check out the stable version
git checkout c993c5c07e7673

# the tool uses cmake instread of make so install that as well
# on the Mac use:
brew install cmake

# on Ubuntu Linux find the cmake tool and install it with apt-get
sudo apt-get install cmake

# now switch to the freebayes and make it
cd freebayes
make
ln -s ~/src/freebayes/bin/freebayes ~/bin

# look at the paramters it generates
~/bin/freebayes

# generate SNPS with freebayes
~/bin/freebayes -f ~/refs/yeast/sc.fa r1.fq.bam > r1.bayes.vcf

# generate SNPS with samtools
~/bin/samtools mpileup -uDf ~/refs/yeast/sc.fa r1.fq.bam > r1.samtools.bcf
~/bin/bcftools view -gv r1.samtools.bcf > r1.samtools.vcf

# index with igvtools, this can also be performed in IGV
~/bin/igvtools index r1.bayes.vcf
~/bin/igvtools index r1.samtools.vcf

Lecture 19 - SNP calling with samtools, VCF format


# investigate samtools mpileup
samtools mpileup

# this will generate a pileup output for the alignment files
~/bin/samtools mpileup r1.fq.bam | head

# samtools computes the genoytpe likelihoods and puts them into a bcf file
# we will limit it to chrI so that it finishes quickly
samtools mpileup -uDf ~/refs/yeast/sc.fa -r chrI r1.fq.bam > r1.bcf

#
# bcftools calls the variants and infers other statistics
#
# link the bcftools into the bin folder
ln -s ~/src/samtools-0.1.19/bcftools/bcftools ~/bin/bcftools

# investigate what bcftools does
~/bin/bcftools

# investigate what the bcf view does
~/bin/bcftools view

# view the bcf file, this file contains calls for every base!
~/bin/bcftools view r1.bcf | more

# calls the genotypes and output only variant calls, the output is VCF
~/bin/bcftools view -gv r1.bcf | more

# store the snps in a VCF file
~/bin/bcftools view -gv r1.bcf > snps.vcf

# you will need to index the vcf file to dipsplay in IGV
# the index is IGV specific use the Tools->Run igvtools-Index
# or you can also do it from command line
# after installing the igv tools package

~/bin/igvtools index snps.vcf

Lecture 18 - Read duplication/genome visualization

# download and install wgsim from
# https://github.com/lh3/wgsim

# removing duplicated reads with samtools

# tired of adding ~/bin to every command?
# add the bin folder to your search PATH
# this only needs to be done once

# on a Mac
echo "export PATH=$PATH:~/bin" >> ~/.profile
source ~/.profile

# on Linux
echo "export PATH=$PATH:~/bin" >> ~/.bashrc
source ~/.bashrc

# look at the options of rmdup
samtools rmdup

# perform the duplicate removal
# will print statistics on the process
samtools rmdup r1.fq.bam r1.unique.bam

# there is a samtools specific indexing as well
# this indexing is required for samtools operations
# that access the reference
samtools faidx ~/refs/yeast/sc.fa

# pileup output
# see mpileup manual:
# see 5 Things to know about mpileup -> http://massgenomics.org/2012/03/5-things-to-know-about-samtools-mpileup.html
samtools mpileup -f ~/refs/yeast/sc.fa r1.fq.bam | head

# samtools comes with a simple text based genome viewer
samtools tview

# run it on the alignment
samtools tview r1.fq.bam ~/refs/yeast/sc.fa

# dot: . means exact match on the forward strand
# comma: , means exact match on reverse strand

Lecture 17 - Align paired end read

# download and install wgsim from
# https://github.com/lh3/wgsim

cd ~/src
git clone https://github.com/lh3/wgsim
cd wgsim

# you forgot the Makefile Heng!, look into the README
gcc -g -O2 -Wall -o wgsim wgsim.c -lz -lm

# link to your bin folder
ln -s ~/src/wgsim/wgsim ~/bin/wgsim

# run wgsim to look at its options

# generate 1 million paired end reads
~/bin/wgsim ~/refs/yeast/sc.fa r1.fq r2.fq > mutations.txt

# modify your aligner script to take paired end reads
bash aln.sh ~/refs/yeast/sc.fa r1.fq r2.fq

# evaluate the mapping
~/bin/samtools idxstats r1.fq.bam

# look at the flag statistics
~/bin/samtools flagstat r1.fq.bam

# how many reads have pairs
~/bin/samtools view -c -f 1 r1.fq.bam

# how many reads that have pairs are mapped
~/bin/samtools view -c -f 1 -F 4 r1.fq.bam

# read has a pair it is mapped but the mate is unmapped
~/bin/samtools view -c -f 9 -F 4  r1.fq.bam

# read has a pair it is mapped on forward strand but the mate is unmapped
~/bin/samtools view -c -f 9 -F 20  r1.fq.bam

The script that performs the alignment aln.sh:

# takes index name and query from command line
# produces a SAM alignment

# bail out on errors
set -ue

# first parameter is the reference genome
REF=$1

# second two parameters are the paired reads
READ1=$2
READ2=$3

# now set the output file names based on the query name
SAM=$READ1.sam

BAM=$READ1

# perform the alignments and generate two alignment files
~/bin/bwa aln $REF $READ1 > S1.sai
~/bin/bwa aln $REF $READ2 > S2.sai

# format the alignments as a paired end SAM file
#~/bin/bwa samse $REF S1.sai $READ1 > $SAM

~/bin/bwa sampe $REF S1.sai S2.sai $READ1 $READ2 > $SAM

# transform the SAM file to BAM
~/bin/samtools view -Sb $SAM > temp.bam

# sort the samfile
# try not to use -f parameter
~/bin/samtools sort temp.bam $BAM

# index the BAM file
~/bin/samtools index $BAM

# cleanup
rm -f *.sai *.sam temp.bam

echo "Finished ref=$REF, read1=$READ1, read2=$READ2, bam=$BAM"

Lecture 16 - Creating BAM file, using samtools

# download and install samtools 
git clone https://github.com/samtools/samtools.git

# in your src folder
cd samtools
make
ln -s ~/src/samtools/samtools ~/bin/samtools

# all of Heng Li's tool will tell you what they do
# when you just run them
~/bin/samtools

# if you need help on just one samtools command
# just expand the tool command line with it
~/bin/samtools view

# rename (move) your data
mv lec15.fq data.fq

# run our aligner from the previous lecture
bash aln1.sh ~/refs/yeast/sc.fa data.fq

# this creates a SAM file

# all your alignment files should be kept as
# sorted and indexed BAM files
# by default samtools expects to see a BAM file
# then produces a viewable SAM file from it

# to create a BAM file from a SAM file
# we put the output into a temporary file because we
# need to also sort it and the sorted file will be our final BAM file
~/bin/samtools view -Sb data.fq.sam > tempfile.bam

# now sort the BAM file and put the result into a different file
~/bin/samtools sort -f tempfile.bam data.bam

# finally index the BAM file
~/bin/samtools index data.bam

# let's delete the temporary bam file
rm -f tempfile.bam

# look at the files
# you will need to see both bam file and a bam.bai file
ls

# add these steps into your alignment script and call it aln.sh

# now let's play with samtools

# how many alignments does the BAM data have
~/bin/samtools view -c data.fq.bam

# show the first few alignments
~/bin/samtools view data.fq.bam | head

# how many unaligned reads
~/bin/samtools view -c -f 4 data.fq.bam

# how many proper alignments
~/bin/samtools view -c -F 4 data.fq.bam

# how many reads align on the reverse strand
~/bin/samtools view -c -f 16 data.fq.bam

# how many reads align on the forward strand
~/bin/samtools view -c -F 16 data.fq.bam

# how many reads have the minimum mapping quality of 1 (aka align uniquely)
~/bin/samtools view -c -q 1 data.fq.bam

# how many reads on the reverse strand have the minimum mapping quality of 1
~/bin/samtools view -c -q 1 -F 16 data.fq.bam

# flag statistics
~/bin/samtools flagstat data.fq.bam

# mapping statistics
~/bin/samtools idxstats data.fq.bam

# depth of coverage per base
~/bin/samtools depth data.fq.bam | head

# query a bam file
# how many reads fall within the region of 1000 to 2000 on chromosome V
~/bin/samtools view -c data.fq.bam chrV:1000-2000

# show the alignments that fall within the region of 1000 to 2000 on chromosome V
~/bin/samtools view data.fq.bam chrV:1000-2000

# how many alignments that fall in the region of 1000 to 2000 on chromosome V
# are on the reverse strand and align uniquely (as labeled by bwa)
~/bin/samtools view -c -f 16 -q 1 data.fq.bam chrV:1000-2000

The script that performs the alignment aln.sh:

# takes index name and query from command line
# produces a SAM alignment

# bail out on errors
set -ue

# first parameter is the reference genome
REF=$1

# second parameter is the query
QUERY=$2

# now set the output file names based on the query name
SAI=$QUERY.sai
SAM=$QUERY.sam
TMP=tmp.bam
BAM=$QUERY.bam

# perform the alignments via bwa
~/bin/bwa aln $REF $QUERY > $SAI
~/bin/bwa samse $REF $SAI $QUERY > $SAM

# transform the SAM file to BAM
~/bin/samtools view -Sb $SAM > $TMP

# sort the samfile
~/bin/samtools sort -f $TMP $BAM

# index the BAM file
~/bin/samtools index $BAM

echo "Finished ref=$REF, query=$QUERY, bam=$BAM"

Lecture 15 - the SAM (Sequence Alignment Map) format


# cut out the first line of the yeast genome
head -2 ~/refs/yeast/sc.fa

# from this create a fastq file that has the following
# an exact match a mismatch, a deletion and a reverse complement

# the simplest way to reverse complement a fastq record is via seqtk
~/bin/seqtk seq -r samtest.fq

# run the alignment
sh aln1.sh ~/refs/yeast/sc.fa samtest.fq

# look the files it has created
ls

# cut out certain columns
cat samtest.fq.sam | cut -f 1-3

# skip the SAM headers
cat samtest.fq.sam | grep -v "@SQ" | cut -f 1-3

# bioawk has SAM aware behavior
# it can recognize the following columns
# qname, flag, rname, pos, mapq, cigar, rnext, pnext, tlen, seq, qual
cat samtest.fq.sam | ~/bin/bioawk -c sam '{print $qname, $flag, $cigar}'

# or you can filter for flags (although there are better ways to do that)
# select records where the flag has the bit 2^4 (16) set
cat samtest.fq.sam | ~/bin/bioawk -c sam ' and($flag,16) {print $qname, $flag, $cigar}'

The input query file:

@exact
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTA
+
2222222222222222222222222222222222222222222222222222222222222222222222222
@mismatch10AG
CCACACCACGCCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTA
+
2222222222222222222222222222222222222222222222222222222222222222222222222
@delete10A
CCACACCACCCCACACACCCACACACCACACCACACACCACACCACACCCACACACACACATCCTAACACTA
+
222222222222222222222222222222222222222222222222222222222222222222222222
@rev-comp
AAAAAAAAAAATAGTGTTAGGATGTGTGTGTGTGGGTGTGGTGTGGTGTGTGGTGTGGTGTGTGGGTGTGTGGGTGTGGTGTGG
+
222222222222222222222222222222222222222222222222222222222222222222222222222222222222

The script that performs the alignment aln1.sh:

# takes index name and query from command line
# produces a SAM alignment

# bail out on errors
set -ue

# first parameter is the reference genome
REF=$1

# second parameter is the query
QUERY=$2

# now set the output file names based on the query name
SAI=$QUERY.sai
SAM=$QUERY.sam

# perform the alignments via bwa
~/bin/bwa aln $REF $QUERY > $SAI
~/bin/bwa samse $REF $SAI $QUERY > $SAM

Lecture 14 - introduction to short read mapping, bwa

# download and copy the archive to ~src
# then compile and link it

# the code is compressed with bzip2
tar jxf bwa-0.7.5a.tar.bz2
cd bwa-0.7.5a
make
ln -s ~/src/bwa-0.7.5a/bwa ~/bin/bwa

# look the output of bwa
~/bin/bwa

# to get a detailed help on usage look at
man ~/src/bwa-0.7.5a/bwa.1

# also read the manual at: http://bio-bwa.sourceforge.net/bwa.shtml

# using copy/paste and/or other methods create an input FastQ file
# that contains one line of sequence from the yeast genome
head -2 ~/refs/yeast/sc.fa

# we assume that you put the result into query.fq
cat query.fq

# create an alignment with the aln method
~/bin/bwa aln ~/refs/yeast/sc.fa query.fq > query.sai

# format the *.sai alignment into a SAM file (sequence alignment map)
# you can format via samse (SAM single end) or sampe (SAM paired end)
~/bin/bwa samse ~/refs/yeast/sc.fa query.sai query.fq > query.sam

# view the resulting alignment
more query.sam

# use the mem algorithm we skip the intermediary alignment file generation
~/bin/bwa mem ~/refs/yeast/sc.fa query.fq > query.sam

# turn a fastq file into a fasta file
~/bin/seqtk seq -A query.fq > query.fa

# here is how you would run blastn on the same query
~/bin/blastn -db ~/refs/yeast/sc -query query.fa -dust no

Lecture 13 - shell scripts, using flash and trimmomatic

# find your most duplicated sequences
cat data_R1.fq | ~/bin/bioawk -c fastx '{print substr($seq, 1, 30) }' | sort | uniq -c | sort -gr | head

# blast this against NR (nonredunant database)

#
# install and link both Trimmomatic and Flash into your ~/bin/folder
#
# see the links below for sourcecode
#
# you may use the wrapper script for Trimmomatic (see below) or launch it
# from the command line

# these two should produce the same output
java -jar ~/src/Trimmomatic-0.30/trimmomatic-0.30.jar

# versus
~/bin/trimmomatic

# ok, here naming gets a little crazy, each pair will have paired and unpaired output
# the unpaired file is the one where the pair did not survive the process

# come up with your own naming conventions
~/bin/trimmomatic PE -phred33 data_R1.fq data_R2.fq a1.fq a1.U.fq a2.fq a2.U.fq ILLUMINACLIP:polyA.fa:2:3:3 MINLEN:36 SLIDINGWINDOW:4:25

# now let's cut the adapter
~/bin/trimmomatic PE -phred33 a1.fq a2.fq b1.fq b1.U.fq b2.fq b2.U.fq ILLUMINACLIP:adapter.fa:2:3:3 MINLEN:36 SLIDINGWINDOW:4:25

# estimate the number of sequences in each file
wc -l *.fq

# merge the fragments with flash
~/bin/flash b1.fq b2.fq

# generate quality control reports for the results
~/bin/fastqc out.extendedFrags.fastq
~/bin/fastqc out.notCombined_1.fastq
~/bin/fastqc out.notCombined_2.fastq

A pipeline that can be run from the command line


# this makes the script stop on
# missing variables or an error
set -ue

# take input from command line
FIRST_PAIR=$1
SECOND_PAIR=$2

# collect commands into a single file
~/bin/fastqc $FIRST_PAIR
~/bin/fastqc $SECOND_PAIR

A wrapper script around Trimmomatic. Add this file into the ~/bin folder and make it executable with chmod +x ~/bin/trimm. The purpose of this is to simplify the way we run the tool:

#!/bin/bash

#
# this is a "wrapper" script that aims to make running trimmomatic a little more sane
#

# for this program to work it needs to be placed in the ~/bin folder
# then it needs to be made executable
# chmod +x ~/bin/trimmomatic

# the $@ symbol passes all command line parameters as they were listed
java -jar ~/src/Trimmomatic-0.30/trimmomatic-0.30.jar $@

Lecture 12 - Paired end reads, more quality control

# note: you may need to install a number of extra libraries on linux:
sudo apt-get install build-essential ncurses-dev byacc zlib1g-dev python-dev git

# onto bioawk
# visit the repository for information: https://github.com/lh3/bioawk

# clone the bioawk repository
git clone https://github.com/lh3/bioawk.git
cd bioawk
make

# link bioawk to your bin
ln -s ~/src/bioawk/bioawk ~/bin/bioawk

# note how bioawk creates 'magic' variables $name, $seq, $qual
# show the read names
cat sample1.fq | ~/bin/bioawk -c fastx '{ print $name }' | head

# show the read sequences
cat sample1.fq | ~/bin/bioawk -c fastx '{ print $seq }' | head

# the equivalent of this with regular grepping would be
# but this only works if each sequence occupies only one line
cat sample1.fq | awk ' NR%4 == 2 { print $1 }' | head

# moreover bioawk seamlessly works on fasta and fastq files
cat ~/refs/yeast/sc.fa | ~/bin/bioawk -c fastx '{ print $name }' | head

# install the Trimmomatic tool
cd ~/src
curl http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.30.zip -o Trimmomatic-0.30.zip
unzip Trimmomatic-0.30.zip
rm -f Trimmomatic-0.30.zip

# make an alias file to ease the run of this tool
# add it to your .profile or .bashrc
alias trimm='java -jar ~/src/Trimmomatic-0.30/trimmomatic-0.30.jar'

# you will need to create FASTA files that contain the adapters
# the tool comes with some examples, see the actual content of the files below

# now let's run the tool, it is a multistep process
trimm SE -phred33 data_R1.fq step1.fq ILLUMINACLIP:polyA.fa:2:3:3 MINLEN:36 SLIDINGWINDOW:4:25

# now let's run the tool
trimm SE -phred33 step1.fq step2.fq ILLUMINACLIP:adapter.fa:2:3:3 MINLEN:36 SLIDINGWINDOW:4:25

# etc, you can run more steps, after each run fastqc to evaluate

content of polyA.fa:

>polyA
AAAAAAAA

The content of adapter.fa

>trueseq
AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC

Lecture 11 - Illumina adapters, quality control


# On a Mac install HomeBrew, this only needs to be done once see http://brew.sh/
#
ruby -e "$(curl -fsSL https://raw.github.com/mxcl/homebrew/go)"

# on a Mac install git using HomeBrew
brew install git

# on linux you may need to use your package manager
# to install git and other required dependencies
# sudo means "superuser do"
sudo apt-get install build-essential zlib1g-dev python-dev git

# clone the seqtk repository from github: https://github.com/lh3/seqtk
#
# go to the src folder
cd ~/src

# clone the repository
git clone https://github.com/lh3/seqtk.git

# switch to the repository and make it
cd seqtk
make

# link to bin
ln -s ~/src/seqtk/seqtk ~/bin/seqtk

# test that is works, note how many secondary commands it has
~/bin/seqtk

# now let's install cutadapt, it is a python based package
# two python installers easy_install and pip
# I lik pip, easy_install is usually already present
# why not use easy_install to install pip ... kind of crazy, ah well it is
sudo easy_install pip

# now we have pip, test it
pip

# install cutadapt, depending on the setup you may need to use sudo
sudo pip install cutadapt

# check that it works
cutadapt -h

# now onto actually doing something useful with these
# let's cut some adapters, check the fastqc output for sample1.fq
# what does it say for overrepresented sequences,
# note how an A needs to be added to the adapter as that is the overhang added to the sequence
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC sample1.fq > sample1.cut.fq

# now run a second fastqc report
# do you see any differences
~/bin/fastqc sample1.cut.fq

# let's run a quality trimming on the second dataset
~/bin/seqtk trimfq sample2.fq > sample2.trim.fq

# generate another fastqc output and compare to original
~/bin/fastqc sample2.trim.fq

Lecture 10 - Compressing, archiving data, FastQC reports

# get and uncompress the archive
curl http://www.personal.psu.edu/iua1/courses/files/2013/data/lecture-10.tar.gz -o lecture-10.tar.gz

# look at the file sizes
ls -lh

# compress the files
gzip sample1.fq

# look at the file sizes again, text files compress very well
ls -lh

# uncompress the file
gunzip sample1.fq.gz

# you may compress/uncompress multiple files at the same time
gzip *.fq

# gzcat allows you to list the contents of a compressed file
gzcat sample1.fq.gz | grep "TAAAT" | head

# uncompress them both
gunzip *.fq.gz

# this will create an archive
tar czvf lecture-10.tar.gz *.fq

# this wil unpack the archive
tar xzvf lecture-10.tar.gz

# now download and link the fastqc executable
# there is a slight mistake in the distribution, the file is not executable, fix that
chmod +x ~/src/FastQC/fastqc

# link it to the bin folder
ln -s ~/src/FastQC/fastqc ~/bin/fastqc

# run the program
~/bin/fastqc -h

# this is how to run on a single sample
~/bin/fastqc sample1.fq

#
# important flags include
#
# --nogroup
# --khmer
# --threads
~/bin/fastqc sample1.fq sample2.fq  -k 8  --nogroup

# you can always investigate for patterns for yourself
# find the GGAAG pattern followed by up to 10 of any base {1,10} and followed by the end of line ($)
cat sample1.fq | egrep "(GGAAG)(.){1,10}$" --color=always | head

Lecture 9 - FastQ encoding

# calc shortcut courtesy of TA Nick Stoler

# on a Mac
echo 'calc () { python -c "from math import *; print $1"; }' >> ~/.profile

# on Linux
echo 'calc () { python -c "from math import *; print $1"; }' >> ~/.bashrc

# this will become active on opening a new window or by
source ~/.profile

# the following two commands are now equivalent
python -c "print 1+1"
calc 1+1

# in general you will need to put statements within single quotes
calc '1 + 1'

# getting the ordinal value of a character
calc 'ord("A")'

# getting the character value of an integer
calc 'chr(65)'

# find the Phred score of A on the Sanger scale +33
calc 'ord("A") - 33'

# find the letter value of a Phred score on the Sanger scale +33
calc 'chr(32 + 33)'

# what is the error probabity of the best score on the Sanger scale
calc '10 ** - ((ord("I") - 33) / 10.0)'

# what is the error probabity of the worst score on the Sanger scale
calc '10 ** - ((ord("!") - 33) / 10.0)'

Lecture 8 - Tuning BLAST searches

# we assume you downloaded the test sequences that are now in the data folder
# these datasets have been obtained from the book BLAST by Ian Korf, Mark Yandell, Joseph Bedell
# but have been renamed to better indicate what they contain

# make a directory for the references
mkdir refs

# create a database for the expressed sequence tags
~/bin/makeblastdb -in data/est.collection.nuc.fa -out refs/est.collection -dbtype nucl -parse_seqids

# find out some information about our blast database
~/bin/blastdbcmd -db refs/est.collection -info

# let's figure out how to get an E-value from a Bit Score
#
# E = m * n * 2 ^ -S
#
# m = database size, n=query size, S=bitscore

# create a custom output to evaluate the E-values relative to alignments
~/bin/blastn -db refs/est.collection -task blastn  -query data/est.single-read.nuc.fa -outfmt '7 qseqid qlen sseqid slen bitscore length evalue qseq' | head

# with the information above let's check
echo "" | awk '{ E = 673456 * 322 * 2 ^-578; print E } '

# blast search strategies

# megablast is the default search strategy
~/bin/blastn -db refs/est.collection -task megablast -query data/est.single-read.nuc.fa -outfmt 7

# blastn is a slower, it will also produce many more results
~/bin/blastn -db refs/est.collection -task blastn  -query data/est.single-read.nuc.fa -outfmt 7

# lets verify a location that aligns on the reverse strand
~/bin/blastn -db refs/est.collection -task blastn  -query data/est.single-read.nuc.fa  | grep "Minus" -B 6 -A 5 | head -11

# look at the output and identify the locations for the Subj alignment
~/bin/blastdbcmd -db refs/est.collection -entry "AU109564" -range "118-141" -strand minus

# generate a protein blast database
~/bin/makeblastdb -in data/globins.collection.prot.fa -dbtype prot -out refs/globins.collection -parse_seqids

# BLAST Database creation error: Error: Duplicate seq_ids are found: LCL|Q17154
# really, let's see?
cat data/globins.collection.prot.fa | grep Q17154

# ok so how many duplicated sequences ids might be there?
cat data/globins.collection.prot.fa | grep ">" | awk '{ split($0, x, " "); print x[1] } ' | sort | uniq -c | sort -g

# holy smokes that is a lot of duplicates, now we need to write another program to fix it
# thanks a bunch BLAST book ...
# use the commands below or just not pass the -parse_seqids option
#
cat data/globins.collection.prot.fa | awk -f uniq-fasta.awk > data/globins.collection.prot.uniq.fa


# now we can finally create the database they way we wanted to
~/bin/makeblastdb -in data/globins.collection.prot.uniq.fa -dbtype prot -out refs/globins.collection -parse_seqids

# create protein alignment report
~/bin/blastp -db refs/globins.collection -query data/gamma2_globin.prot.fa | more

# blastx, nucleotide vs
~/bin/blastx -db refs/globins.collection -query data/fugu_globin.nuc.fa | more

# protein vs nucleotides
~/bin/tblastn -db refs/est.collection -query data/gamma2_globin.prot.fa | more

# create a database for the chicken globins
~/bin/makeblastdb -in data/chicken_globin.nuc.fa -dbtype nucl -out refs/chicken_globin

# run the tblastx
~/bin/tblastx -db refs/chicken_globin -query data/fugu_globin.nuc.fa | more

# pairwise alignment of two sequences, used to be called bl2seq
#
# compare the results of the following three alignments
#

# blastn, megablast
~/bin/blastn -query data/chicken_globin.nuc.fa -subject data/fugu_globin.nuc.fa

# blastn, blastn
~/bin/blastn -query data/chicken_globin.nuc.fa -subject data/fugu_globin.nuc.fa -task blastn

# tblastx
/bin/tblastx -query data/chicken_globin.nuc.fa -subject data/fugu_globin.nuc.fa

turns out we needed a method to keep unqiue entries from a fasta file:

# this pattern matches only at the beginning of each fasta record
/^>/ {

    # has the current id been seen before?
    not_seen = ! storage[$1]

    # well it is seen now so add it to the storage
    storage[$1] = 1
}

# this pattern matches all the time
{
    if (not_seen) print $0
}

Lecture 7 - Useful BLAST commands

# passing the -parse_seqids parameter creates a database with more information in it

# look at what files are created when you run
~/bin/makeblastdb -in ~/refs/yeast/sc.fa -dbtype nucl -out ~/refs/yeast/sc_simple

# versus running
~/bin/makeblastdb -in ~/refs/yeast/sc.fa -dbtype nucl -out ~/refs/yeast/sc_full -parse_seqids

# you now have two differently formatted blast databases all supporting the same sequence
# look at the files and verify they are where supposed to
ls ~/refs/yeast

# make a small query
cat ~/refs/yeast/sc.fa | head -3 > query.fa

# run two blast searches with each database
~/bin/blastn -db ~/refs/yeast/sc_simple -query query.fa

# run two blast searches with each database, can you spot the difference?
~/bin/blastn -db ~/refs/yeast/sc_full -query query.fa

# get help on the tools
~/bin/blastdbcmd -help

# blastdbcmd will work only on databases created with the -parse_seqids option
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'all' -outfmt "%a" | head -3

# we could solve a previous homework on computing the genome size with this
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'all' -outfmt "%l" | awk '{s+=$1; print s}'

# extract regions from your sequences
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'all' -range '1-10'

# fetch the range for just one sequence
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry 'chrV' -range '100-110'

# fetch multiple entries based on information in a file
~/bin/blastdbcmd -db ~/refs/yeast/sc_full -entry_batch ids.txt

# now make a smaller query file
cat sc.fa | head -2 > mini.fa

# blast the query
~/bin/blastn -db ~/refs/yeast/sc_full -query mini.fa -outfmt 7

# why did we get no hits?, low complexity filter
~/bin/blastn -db ~/refs/yeast/sc_full -query mini.fa -dust no -outfmt 7

# there can be additional formatting options
~/bin/blastn -db ~/refs/yeast/sc_full -query mini.fa -dust no -outfmt '7 qseqid sseqid mismatch'

# to include the standard plus additional options use std
~/bin/blastn -db ~/refs/yeast/sc_full -query mini.fa -dust no -outfmt '7 std send'

input file for the -entry_batch parameter:

chrI   100-110   plus
chrI   100-110   minus

blastdbcmd formatting:

-outfmt 
   Output format, where the available format specifiers are:
        %f means sequence in FASTA format
        %s means sequence data (without defline)
        %a means accession
        %g means gi
        %o means ordinal id (OID)
        %i means sequence id
        %t means sequence title
        %l means sequence length
        %h means sequence hash value
        %T means taxid
        %e means membership integer
        %L means common taxonomic name
        %S means scientific name
        %P means PIG
        %m means sequence masking data.
           Masking data will be displayed as a series of 'N-M' values
           separated by ';' or the word 'none' if none are available.
    If '%f' is specified, all other format specifiers are ignored.
    For every format except '%f', each line of output will correspond
    to a sequence.
   Default = `%f'
    * Incompatible with:  info, list, recursive, remove_redundant_dbs,
   list_outfmt, show_blastdb_search_path

Lecture 6 - Installing and Running BLAST

# download the blast executable from the NCBI site
# ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/LATEST
# unpack and move it to ~/src

# link the executables into the bin folder
cd ~/bin
ln -s ~/src/ncbi-blast-2.2.28+/bin/blastn blastn

# the alternative is to extend the system PATH variable to
# contain the new path, you can add this to your profile
export PATH=$PATH:~/src/ncbi-blast-2.2.28+/bin/

# get the data, the -o flag stores the result in the sc.gff
curl http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff -o sc.gff

# how many lines in this file
wc -l sc.gff

# at what line does the sequence start in this file
cat -n sc.gff | grep ">" | head -1

# how long does the tail need to be
echo '174984-22995' | bc -l

# looks like 151989, are we off by one? I guess

# we need to include one more line
# check that the file is indeed cut at the first FASTA record
cat sc.gff | tail -151990 | head -1

# put it into a separate file, we are going to use this file
cat sc.gff | tail -151990 > sc.fa

# collect your reference files in a separate directory
# when we build indices it also creates a large numbe of other files that would
# just litter your current working directory
mkdir -p ~/refs/yeast
cp sc.fa ~/refs/yeast

# create the blast database, we'll call this target
$ ~/bin/makeblastdb -in ~/refs/yeast/sc.fa -dbtype nucl -out ~/refs/yeast/sc

# whoa, look at all the new files it has created
ls -l ~/refs/yeast/

# the name of the blast database does not need to match the reference
$ ~/bin/makeblastdb -in ~/refs/yeast/sc.fa -dbtype nucl -out ~/refs/yeast/hello-kitty

# take a look what happened
ls -l ~/refs/yeast/

# create a small dataset, we'll call this query
cat sc.fa | head -3 > query.fa

# if one were using the >> redirection it will append to the file
# this below generates a file with two records
cat sc.fa | head -3 >> double.fa
cat sc.fa | head -3 >> double.fa

# to get help on blast tools pass the -h (short help) or --help (long help)
blastn -h

# long version of help
blastn --help

# run a blastn search of your query against the database
~/bin/blastn -db ~/refs/yeast/sc -query query.fa > results.txt

# investigate the output
more results.txt

# you may reformat the output to a tabular form
~/bin/blastn -db ~/refs/yeast/sc -query query.fa -outfmt 7 | head

Lecture 5 - String Matching

# get the FASTA file from the course webpage
curl http://www.personal.psu.edu/iua1/courses/files/2013/data/lec4.fa -o lec4.fa

# you can search your sequences for a pattern
# this only works correctly if the sequences are all on one line
cat lec4.fa | grep "TATATA" --color=always | head -3

# 'extended' grep, the . means anything
# three As followed by two anything, followed by three As
cat lec4.fa | egrep "AAA..AAA" --color=always | head -1

# look at the output and try to identify what has been matched and why below
# TGGAGTGAAAAGGAAAGACGGTCAGGAAATAGGTCGTGCTTTTGTATGGAAGGTCACCGAGGAAGTCACGGCTAACTACGTGCC

# match AAA followed by T or C followed by T or C followed by AAA
cat lec4.fa | egrep "AAA[TC][TC]AAA" --color=always | head -1

# match one AAA followed by one or more T or C followed by AAA
cat lec4.fa | egrep "AAA[TC]+AAA" --color=always | head -1

# match an A followed by three Ts
cat lec4.fa | egrep 'AT{3}' --color=always | head -1

# match three ATs in a row
cat lec4.fa | egrep "(AT){3}" --color=always | head -1

# match one ore more ATG followed by two Cs
cat lec4.fa | egrep '(ATG)+C{2}' --color=always | head -1

Lecture 4 - FASTA/FASTQ files

# get the FASTA file from the course webpage
#curl http://www.personal.psu.edu/iua1/courses/files/2013/data/lec4.fa -o lec4.fa

# how many lines in the file
cat lec4.fa | wc -l

# count the number of sequences
cat lec4.fa | grep ">" | wc -l

# look at a few lines
head lec4.fa

# there is a nice little calculator called bc
# you can also invoke bc on its own, use bc -l, to exit type ctrl-D or quit
echo "1+1" | bc -l

# you can search your sequences for a pattern
# this only works correctly if the sequences are all on one line
cat lec4.fa | grep "TATATA" | head

# we can use awk to compute the lengths of each sequence
cat lec4.fa | grep -v ">" | awk '{ print length($1) } ' | head

# what is the longest and/or shortest sequence?
cat lec4.fa | grep -v ">" | awk '{ print length($1) } ' | sort -gr | head

# how many sequences of a certain length are there
cat lec4.fa | grep -v ">" | awk '{ print length($1) } ' | sort -n | uniq -c | head

# sorting the length inversely gives you the longest sequences first
cat lec4.fa | grep -v ">" | awk '{ print length } ' | sort -nr | uniq -c | head

# we are told that these groups of sequences may have the same start
cat lec4.fa | grep -v ">" | awk '{ print substr($1, 1, 8) } ' | head

# produce a count for all the subsequences
cat lec4.fa | grep -v ">" | awk '{ print substr($1, 1, 8) } ' | sort | uniq -c | sort -gr | head

The NCBI Fasta description formats:

GenBank                           gi|gi-number|gb|accession|locus
EMBL Data Library                 gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan       gi|gi-number|dbj|accession|locus
NBRF PIR                          pir||entry
Protein Research Foundation       prf||name
SWISS-PROT                        sp|accession|name
Brookhaven Protein Data Bank (1)  pdb|entry|chain
Brookhaven Protein Data Bank (2)  entry:chain|PDBID|CHAIN|SEQUENCE
Patents                           pat|country|number
GenInfo Backbone Id               bbs|number
General database identifier       gnl|database|identifier
NCBI Reference Sequence           ref|accession|locus
Local Sequence identifier         lcl|identifier

Lecture 3 - Awk Basics

# write pipelines with clear inputs and outputs
cat features.gff | awk '{ print $3 }' | head -3

# the above is equivalent to
cat features.gff | cut -f 3 | head -3

# take the genes and show the last 9th column for each
cat features.gff | awk ' $3 == "gene" { print $9 }' | head -3

# you can put the above program into a separate file and run it that way
cat features.gff | awk -f genefilter.awk | head -3

# we now build an awk script to simplify our file
cat features.gff | awk -f namesplit.awk | head -3

# create your simplified gff file
cat features.gff | awk -f namesplit.awk > short.gff

# check your file
head -3 short.gff

# find the length of genes
cat short.gff | awk ' { size = $5 - $4 + 1; print $3, size; } ' | head -3

# find the total lenght of genes with BEGIN/END patterns
cat short.gff | awk -f totalsize.awk

The genefilter.awk file:

$3 == "gene" { print $9 }

The namesplit.awk file:

$3 == "gene" {

    # split column 9 by the ";" and place the result into x
    split($9, x, ";")

    # rename the second element for easier handling
    pair = x[2]

    # split the pair again by the "=" sign and place it into y
    split(pair, y, "=")

    # assign this for easier use
    name = y[2]

    # place the name into the second columnd of the file, place for the attributes
    print $1, $2, name, $4, $5, $6, $7, $8, "."
}

The totalsize.awk file:

BEGIN {
    size = 0
}

{
    curr = $5 - $4 + 1
    size = size + curr
}

END {
    print "Total size is", size
}

Lecture 2 - Unix Basics

# every line that starts with the # sign is a comment that won't be executed!

# what folder are we in, change to a different folder if necessary
pwd

# get the data, the -o flag stores the result in the sc.gff
curl http://downloads.yeastgenome.org/curation/chromosomal_feature/saccharomyces_cerevisiae.gff -o sc.gff

# check to see what files we have here
ls

# check to see how many lines does the file have, -l flag restricts to line numbers only
wc -l sc.gff

# look at the first ten lines
head sc.gff

# look at the last ten lines
tail sc.gff

# page through the file (your paging program may be called less instead, joke: more is less)
# press space to go to the next page, press ESC to exit the pager
more sc.gff

# find lines that match a pattern
grep YAL060W sc.gff

# redirect output into a new file
grep YAL060W sc.gff > match.gff

# how many lines does the new file have
wc -l match.gff

# pipe (connect) streams into one-another with the | character

# how many lines of the file matche the word gene
cat sc.gff | grep gene | wc -l

# how many genes on chromosome 6
cat sc.gff | grep genes | grep VI | wc -l

# how many non-dubious genes on chromosome 6, the -v flag makes grep match the opposite
cat sc.gff | grep genes | grep VI | grep  -v Dubious | wc -l

# This file contains both the features but also the sequence.
# Most tools can deal with GFF files that only contain features.
# Break the file into two sections, before the word FASTA and after the word FASTA

# adding a -n to the cat command makes it print out the line numbers
cat -n sc.gff | head

# Find out at what line does the the word FASTA appear
cat -n sc.gff | grep FASTA

# Take a note of that number and make the head command take that many lines
# while you are at it get rid of the comments in the file to make a purely tabular file
head -22994 sc.gff | grep -v '#' > features.gff

# from now on we'll be using only the features.gff file

# cut out the type column, that is column 3
cut -f 3 features.gff | head

# you can sort these values as well
cut -f 3 features.gff | sort | head

# the uniq command combines consecutively identical values
cut -f 3 features.gff | sort | uniq -c | head

Bash Customization

To customize your bash enviroment edit your the file named .profile (on a Mac) or .bashrc (on Linux).
Make sure tab-completion works. Never type full filenames. Instead type them partially then then tap the TAB, if nothing seems to happen double tap TAB to see all options.
Getting a pretty prompt (yes, it does look crazy):

export PS1='\[\e]0;\w\a\]\n\[\e[32m\]\u@\h \[\e[33m\]\w\[\e[0m\]\n\$ '

Adding the bin folder to your system path:

export PATH=$PATH:~/bin/

Useful Constructs

Numbering lines in a file pass the -n flag to cat (conCATenate):

cat -n filename.txt

# calc shortcut courtesy of TA Nick Stoler

# on a Mac
echo 'calc () { python -c "from math import *; print $1"; }' >> ~/.profile

# on Linux
echo 'calc () { python -c "from math import *; print $1"; }' >> ~/.bashrc

# you can then do
calc "1 + 100"

Miscellaneous

The %20 symbol is the HTML encoded space character.

IUPAC codes:

A   Adenine
C   Cytosine
G   Guanine
T   (or U)  Thymine (or Uracil)
R   A or G
Y   C or T
S   G or C
W   A or T
K   G or T
M   A or C
B   C or G or T
D   A or G or T
H   A or C or T
V   A or C or G
N   any base
. or -  gap

Note

Some examples use what some may call a useless cat (for amusing examples read the Useless Use of Cat Award). We don't quite agree with the sentinment with respect of their uselessness. Pipelines that clearly show the input on the left and output on the right are easier to read and understand plus are more re-usable.

Go back to Applied Bioinformatics, BMMB 597D, Fall 2013

Created by Istvan Albert • Last updated on Thursday, December 12, 2013 • Site powered by PyBlue

Algorithms and Bioinformatics

February 08, 2014

Code Repository for applied bioinformatics