Introduction

The process of variant calling consists of finding variants that may be related to a particular population or trait. VCFs are standard formats for displaying sequence variation, such as SNPs, indels, and structural variants.

Variant call format (VCF) Header line:

#CHROM	Chromosome
POS	Co-ordinate - The start coordinate of the variant. the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position.
ID	Identifier: a semicolon-separated list of marker identifiers.
REF	Reference allele - The reference allele is whatever is found in the reference genome and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or “AAC”)
ALT	Alternative allele - The alternative allele is the allele found in the sample you are studying and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or AAC”). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.
QUAL	Alternative allele - The alternative allele is the allele found in the sample you are studying and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or “AAC”). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.
FILTER	Pass/fail - If it passed quality filters. Either “PASS” or a semicolon-separated list of failed quality control filters.
INFO	Further information - Allows you to provide further information on the variants. Keys in the INFO field can be defined in header lines above the table.
FORMAT	Information about the following columns - The GT in the FORMAT column tells us to expect genotypes in the following columns.
Other/optional	Individual identifier (optional) - The previous column told us to expect to see genotypes here. The genotype is in the form 0\|1, where 0 indicates the reference allele and 1 indicates the alternative allele, i.e it is heterozygous. The vertical pipe \| indicates that the genotype is phased, and is used to indicate which chromosome the alleles are on. If this is a slash /rather than a vertical pipe, it means we don’t know which chromosome they are on.

Methods

Files needed: AOC1.bam, chr7.fa, SRR11498039.vcf

Using the Ubuntu terminal we can convert BAM files into VCF files:

Install the required packages:

conda conda install -c bioconda Samtools

conda install -c bioconda freebayes

conda install -c bioconda bcftools

Run comand to convert BAM into VCF
- Using freebayes, run:

freebayes -f chr7.fa AOC1.bam >AOC1.vcf

-   Using Bcftools, run:

bcftools mpileup -Ob -o AOC1_BCFtools.vcf -f chr7.fa AOC1.bam

Part 1 - Extracting Variants from a BAM file into VCF format and Analysing using Ensembl VEP and the NCBI Genome browser

1. Set working directory

setwd("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/6.BGA - Bioinformatics and Genome Analysis/week 3 - DNA-Seq - Detecting variants in sequencing data/PRACTICAL")
getwd()

2. Install packages and load libraries

if (!requireNamespace("BiocManager", quietly = TRUE)) 
  install.packages("BiocManager)
BiocManager::install("VariantAnnotation")
BiocManager::install("Rsamtools")

install.packages("vcfR")

library("VariantAnnotation")
library("Rsubread")
library("Rsamtools")
library("vcfR")
library("R.filesets")

3. Store the file path of the BAM file and Chr7 reference FASTA file as string objects in R:

BAM1 <- "AOC1.bam"
refGenomeFile = "chr7.fa"

4. Create a .bai index of the BAM file in your working directory:

Bamindex <- indexBam(BAM1)

4. Call the variants in the BAM file into VCF format and write it to a .VCF file in the working directory

exactSNP(BAM1, isBAM = TRUE,  refGenomeFile, SNPAnnotationFile = NULL, outputFile = paste0(BAM1, '.VCF'))


        ==========     _____ _    _ ____  _____  ______          _____  
        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 2.8.2

//============================= exactSNP setting =============================\\
||                                                                            ||
||                   Input file : AOC1.bam (BAM)                              ||
||                  Output file : AOC1.bam.VCF                                ||
||             Reference genome : chr7.fa                                     ||
||                    Temp path :                                             ||
||                                                                            ||
||                      Threads : 1                                           ||
||         Min supporting reads : 1                                           ||
|| Min pct. of supporting reads : 0.0%                                        ||
||       Min base quality score : 13                                          ||
||      Number of trimmed bases : 3                                           ||
||                                                                            ||
||               Q value cutoff : 10E-12.0                                    ||
||          P value upper bound : 0.00500                                     ||
||        Flanking windows size : 5                                           ||
||                                                                            ||
\\============================================================================//

//====================== Running (04-May-2022 16:43:38) ======================\\
||                                                                            ||
|| Split BAM file into .//temp-snps-010484-592EE877B03F-* ...                 ||
|| processed block chr7@15000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@30000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@45000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@60000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@75000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@90000000 by thread 1/1 [block number=0/1]             ||
|| processed block chr7@105000000 by thread 1/1 [block number=0/1]            ||
|| processed block chr7@120000000 by thread 1/1 [block number=0/1]            ||
|| processed block chr7@135000000 by thread 1/1 [block number=0/1]            ||
|| processed block chr7@150000000 by thread 1/1 [block number=0/1]            ||
WARNING: Chromosome is shorter than expected: chr7
|| processed block chr7@159345788 by thread 1/1 [block number=1/1]            ||
||                                                                            ||
||                          Completed successfully.                           ||
||                                                                            ||
\\============================================================================//

//================================= Summary ==================================\\
||                                                                            ||
||              Processed reads : 7930                                        ||
||                Reported SNPs : 0                                           ||
||              Reported indels : 5                                           ||
||                                                                            ||
||                 Running time : 0.2 minutes                                 ||
||                                                                            ||
\\============================================================================//

5. Read the newly generated VCF file into RStudio and check the header

fl <- "AOC1.bam.VCF"
VCF1 <- readVcf(fl, refGenomeFile)
header(VCF1)

class: VCFHeader 
samples(0):
meta(2): comment fileformat
fixed(1): FILTER
info(6): DP BGMM ... INDEL SR
geno(0):

6. Compress the VCF file using bgzip and then create a VCF file index.

AOCzip <- bgzip(fl)

indexTabix(AOCzip, format = "vcf")

[1] "AOC1.bam.VCF.bgz.tbi"

This file “AOC1.bam.VCF.bgz” can now be uploaded to Emsemble VEP or to Genome Data Viewer (Options > Upload) to compare our findings to those from the VEP and see if we discovered new variants.

Part 2 - Annotating and filtering genetic variants using the VariantAnnotation package

1. Download reference genome with download.file function

#Specify URL where file is stored
url <- 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz'

#Specify the name you want to give to the downloaded file 
destfile <- 'output_ref'

#Apply download.file function only if the file is not downloaded yet:
if( ! file.exists('output_ref') ) {
  hg38 <- getFile()
  } else { 
  hg38 <- 'download.file(url, destfile)'
}

The reference genome is downloaded to the working directory and its name is “output_ref”.

2. Read the VCF file and reference genome with readvcf function

# Set your vcf file as an object:
srr <-'SRR11498039.vcf'

#Set the reference genome as an object
hg38 <- 'output_ref'

#Use readvcr() function to read your vcr file object and the reference genome file object together. Data are parsed into a VCF object with readVcf.
srr_vcf <- readVcf(srr, hg38)


#View VCF file
srr_vcf

class: CollapsedVCF 
dim: 550924 1 
rowRanges(vcf):
  GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
  DataFrame with 17 columns: INDEL, IDV, IMF, DP, VDB, RPB, MQB, B...
info(header(vcf)):
         Number Type    Description                                  
   INDEL 0      Flag    Indicates that the variant is an INDEL.      
   IDV   1      Integer Maximum number of reads supporting an indel  
   IMF   1      Float   Maximum fraction of reads supporting an indel
   DP    1      Integer Raw read depth                               
   VDB   1      Float   Variant Distance Bias for filtering splice...
   RPB   1      Float   Mann-Whitney U test of Read Position Bias ...
   MQB   1      Float   Mann-Whitney U test of Mapping Quality Bia...
   BQB   1      Float   Mann-Whitney U test of Base Quality Bias (...
   MQSB  1      Float   Mann-Whitney U test of Mapping Quality vs ...
   SGB   1      Float   Segregation based metric.                    
   MQ0F  1      Float   Fraction of MQ0 reads (smaller is better)    
   ICB   1      Float   Inbreeding Coefficient Binomial test (bigg...
   HOB   1      Float   Bias in the number of HOMs number (smaller...
   AC    A      Integer Allele count in genotypes for each ALT all...
   AN    1      Integer Total number of alleles in called genotypes  
   DP4   4      Integer Number of high-quality ref-forward , ref-r...
   MQ    1      Integer Average mapping quality                      
geno(vcf):
  List of length 2: GT, PL
geno(header(vcf)):
      Number Type    Description                              
   GT 1      String  Genotype                                 
   PL G      Integer List of Phred-scaled genotype likelihoods

This is a vcf class with a dimension (dim) of 550924 variants in 1 sample. - RwRanges(vcf) shows the GRanges with 5 metadata columns as: paramRangeID, REF, ALT, QUAL, FILTER - The info(vcf) slot shows the # of data frames (DFrame) – in this case 17 columns including: INDEL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, SGB, MQ0F - The “geno” field shows the number of variables, which in this case are 2 variables: GT and PL

3. Examine header information

#Header information can be extracted from the VCF with header().
header(srr_vcf)

class: VCFHeader 
samples(1): SRR11498039.bam
meta(5): fileformat reference samtoolsCommand samtoolsVersion
  contig
fixed(2): FILTER ALT
info(17): INDEL IDV ... DP4 MQ
geno(2): GT PL

There is only one sample in this vcf file, 5 fields of meta information, 17 info fields and 2 geno fields.

Data can be further extracted using the named accessors:

samples(header(srr_vcf))

[1] "SRR11498039.bam"

There is 1 sample with the name “SRR11498039.bam”.

geno(header(srr_vcf))

DataFrame with 2 rows and 3 columns
        Number        Type            Description
   <character> <character>            <character>
GT           1      String               Genotype
PL           G     Integer List of Phred-scaled..

info(header(srr_vcf))

DataFrame with 17 rows and 3 columns
           Number        Type            Description
      <character> <character>            <character>
INDEL           0        Flag Indicates that the v..
IDV             1     Integer Maximum number of re..
IMF             1       Float Maximum fraction of ..
DP              1     Integer         Raw read depth
VDB             1       Float Variant Distance Bia..
...           ...         ...                    ...
HOB             1       Float Bias in the number o..
AC              A     Integer Allele count in geno..
AN              1     Integer Total number of alle..
DP4             4     Integer Number of high-quali..
MQ              1     Integer Average mapping qual..

info(srr_vcf)

DataFrame with 550924 rows and 17 columns
                               INDEL       IDV       IMF        DP
                           <logical> <integer> <numeric> <integer>
chr1:13613_T/A                 FALSE        NA        NA         1
chr1:13684_C/T                 FALSE        NA        NA         1
chr1:13813_T/G                 FALSE        NA        NA         1
chr1:13838_C/T                 FALSE        NA        NA         1
chr1:14522_G/A                 FALSE        NA        NA        42
...                              ...       ...       ...       ...
chrUn_GL000218v1:87538_T/C     FALSE        NA        NA         2
chrUn_GL000218v1:87674_C/T     FALSE        NA        NA         3
chrUn_GL000218v1:87869_A/G     FALSE        NA        NA         3
chrUn_GL000218v1:88759_C/G     FALSE        NA        NA        10
chrUn_GL000218v1:88948_T/C     FALSE        NA        NA        11
                                  VDB        RPB       MQB       BQB
                            <numeric>  <numeric> <numeric> <numeric>
chr1:13613_T/A                     NA         NA        NA        NA
chr1:13684_C/T                     NA         NA        NA        NA
chr1:13813_T/G                     NA         NA        NA        NA
chr1:13838_C/T                     NA         NA        NA        NA
chr1:14522_G/A              0.0618676 0.00524668   0.26525  0.904692
...                               ...        ...       ...       ...
chrUn_GL000218v1:87538_T/C 0.76000000         NA        NA        NA
chrUn_GL000218v1:87674_C/T 0.94000000         NA        NA        NA
chrUn_GL000218v1:87869_A/G 0.98000000         NA        NA        NA
chrUn_GL000218v1:88759_C/G 0.00106101          0         1  1.000000
chrUn_GL000218v1:88948_T/C 0.22000000          1         1  0.888889
                                MQSB       SGB      MQ0F       ICB
                           <numeric> <numeric> <numeric> <numeric>
chr1:13613_T/A                    NA -0.379885         0        NA
chr1:13684_C/T                    NA -0.379885         0        NA
chr1:13813_T/G                    NA -0.379885         0        NA
chr1:13838_C/T                    NA -0.379885         0        NA
chr1:14522_G/A                    NA -0.651104         0         1
...                              ...       ...       ...       ...
chrUn_GL000218v1:87538_T/C        NA -0.453602         0        NA
chrUn_GL000218v1:87674_C/T        NA -0.453602         0        NA
chrUn_GL000218v1:87869_A/G        NA -0.453602         0        NA
chrUn_GL000218v1:88759_C/G  0.916482 -0.651104         0         1
chrUn_GL000218v1:88948_T/C  0.964642 -0.453602         0         1
                                 HOB            AC        AN
                           <numeric> <IntegerList> <integer>
chr1:13613_T/A                    NA             2         2
chr1:13684_C/T                    NA             2         2
chr1:13813_T/G                    NA             2         2
chr1:13838_C/T                    NA             2         2
chr1:14522_G/A                   0.5             1         2
...                              ...           ...       ...
chrUn_GL000218v1:87538_T/C        NA             2         2
chrUn_GL000218v1:87674_C/T        NA             2         2
chrUn_GL000218v1:87869_A/G        NA             2         2
chrUn_GL000218v1:88759_C/G       0.5             1         2
chrUn_GL000218v1:88948_T/C       0.5             1         2
                                     DP4        MQ
                           <IntegerList> <integer>
chr1:13613_T/A                 0,0,1,...        50
chr1:13684_C/T                 0,0,1,...        50
chr1:13813_T/G                 0,0,0,...        50
chr1:13838_C/T                 0,0,0,...        50
chr1:14522_G/A                30,0,8,...        30
...                                  ...       ...
chrUn_GL000218v1:87538_T/C     0,0,0,...        50
chrUn_GL000218v1:87674_C/T     0,0,2,...        26
chrUn_GL000218v1:87869_A/G     0,0,2,...        26
chrUn_GL000218v1:88759_C/G     2,0,5,...        50
chrUn_GL000218v1:88948_T/C     3,6,1,...        50

VCF file columns CHROM, POS, and ID are represented by Granges, and their information can be found with rowRanges(vcf):

rowRanges(srr_vcf)

GRanges object with 550924 ranges and 5 metadata columns:
                                     seqnames    ranges strand |
                                        <Rle> <IRanges>  <Rle> |
              chr1:13613_T/A             chr1     13613      * |
              chr1:13684_C/T             chr1     13684      * |
              chr1:13813_T/G             chr1     13813      * |
              chr1:13838_C/T             chr1     13838      * |
              chr1:14522_G/A             chr1     14522      * |
                         ...              ...       ...    ... .
  chrUn_GL000218v1:87538_T/C chrUn_GL000218v1     87538      * |
  chrUn_GL000218v1:87674_C/T chrUn_GL000218v1     87674      * |
  chrUn_GL000218v1:87869_A/G chrUn_GL000218v1     87869      * |
  chrUn_GL000218v1:88759_C/G chrUn_GL000218v1     88759      * |
  chrUn_GL000218v1:88948_T/C chrUn_GL000218v1     88948      * |
                             paramRangeID            REF
                                 <factor> <DNAStringSet>
              chr1:13613_T/A           NA              T
              chr1:13684_C/T           NA              C
              chr1:13813_T/G           NA              T
              chr1:13838_C/T           NA              C
              chr1:14522_G/A           NA              G
                         ...          ...            ...
  chrUn_GL000218v1:87538_T/C           NA              T
  chrUn_GL000218v1:87674_C/T           NA              C
  chrUn_GL000218v1:87869_A/G           NA              A
  chrUn_GL000218v1:88759_C/G           NA              C
  chrUn_GL000218v1:88948_T/C           NA              T
                                            ALT      QUAL      FILTER
                             <DNAStringSetList> <numeric> <character>
              chr1:13613_T/A                  A   8.13869           .
              chr1:13684_C/T                  T   8.13869           .
              chr1:13813_T/G                  G   8.13869           .
              chr1:13838_C/T                  T   8.13869           .
              chr1:14522_G/A                  A  32.54720           .
                         ...                ...       ...         ...
  chrUn_GL000218v1:87538_T/C                  C  37.41520           .
  chrUn_GL000218v1:87674_C/T                  T  22.43910           .
  chrUn_GL000218v1:87869_A/G                  G   9.88514           .
  chrUn_GL000218v1:88759_C/G                  G 161.00000           .
  chrUn_GL000218v1:88948_T/C                  C   7.12033           .
  -------
  seqinfo: 195 sequences from output_ref genome

4. Genomic positions

To find information from the CHROM, POS, and ID fields of the VCF file, examine rowRanges:

head(rowRanges(srr_vcf), 3)

GRanges object with 3 ranges and 5 metadata columns:
                 seqnames    ranges strand | paramRangeID
                    <Rle> <IRanges>  <Rle> |     <factor>
  chr1:13613_T/A     chr1     13613      * |           NA
  chr1:13684_C/T     chr1     13684      * |           NA
  chr1:13813_T/G     chr1     13813      * |           NA
                            REF                ALT      QUAL
                 <DNAStringSet> <DNAStringSetList> <numeric>
  chr1:13613_T/A              T                  A   8.13869
  chr1:13684_C/T              C                  T   8.13869
  chr1:13813_T/G              T                  G   8.13869
                      FILTER
                 <character>
  chr1:13613_T/A           .
  chr1:13684_C/T           .
  chr1:13813_T/G           .
  -------
  seqinfo: 195 sequences from output_ref genome

Find the reference and alternative alele:

#reference alele from position 1 to 5
ref(srr_vcf)[1:5]

DNAStringSet object of length 5:
    width seq
[1]     1 T
[2]     1 C
[3]     1 T
[4]     1 C
[5]     1 G

#alternative alele from position 1 to 5:
alt(srr_vcf)[1:5]

DNAStringSetList of length 5
[[1]] A
[[2]] T
[[3]] G
[[4]] T
[[5]] A

ALT is stored as DNAStringSetList and allows for multiple alternate aleles per variant

5. Find specific data

Genetype information obtained from the FORMAT fields is parsed into the geno slot. Each sample has distinct data and may have multiple values within it. As a result, the data is structured into matrices or arrays with rows representing variants and columns representing samples. A multidimensional array indicates multiple values per sample.

geno (srr_vcf)

List of length 2
names(2): GT PL

sapply (geno(srr_vcf), class)

     GT       PL      
[1,] "matrix" "matrix"
[2,] "array"  "array"

In our vcf sample all variables are matrices.

Now we’ll take a look at the List of Phred-scaled genotype likelihoods (PL) variable as an example:

#header provides the variable definition and type
geno(header(srr_vcf))['PL',]

DataFrame with 1 row and 3 columns
        Number        Type            Description
   <character> <character>            <character>
PL           G     Integer List of Phred-scaled..

header() provides the variable definition and type for the variable.

PL<- geno(srr_vcf)$PL

dim(PL)

[1] 550924      1

We can see the dimention of the RO variable is a 550924x1 so it is unidimentional because we have only one sample.

PL[1:3,]

$`chr1:13613_T/A`
[1] 37  3  0

$`chr1:13684_C/T`
[1] 37  3  0

$`chr1:13813_T/G`
[1] 37  3  0

If we had more samples we could compute a five number summary of the minimum, lower-hinge (first quartile), median, upper-hinge (third quartile) and maximum using fivenum():

fivenum(PL)

We could also check the distribution of zero values:

length(which(PL==0))/length(PL)

And the distribution of non-zero values:

hist(PL[PL != 0], main='RO non-zero values', xlab='RO')

The info data are unique to the variant and the same across samples. All info variables are represented in a single DataFrame

info(srr_vcf)[1:4, 1:5]

DataFrame with 4 rows and 5 columns
                   INDEL       IDV       IMF        DP       VDB
               <logical> <integer> <numeric> <integer> <numeric>
chr1:13613_T/A     FALSE        NA        NA         1        NA
chr1:13684_C/T     FALSE        NA        NA         1        NA
chr1:13813_T/G     FALSE        NA        NA         1        NA
chr1:13838_C/T     FALSE        NA        NA         1        NA

6. Compare quality measures between novel (not in dbSNP) and known (in dbSNP) variants and the variant type present in the file.

Variants with membership in dbSNP can be identified by using the appropriate SNPlocs package:

BiocManager::install("SNPlocs.Hsapiens.dbSNP151.GRCh38")
library("SNPlocs.Hsapiens.dbSNP151.GRCh38")
library("BSgenome")

Find which SNPlocs are installed:

installed.SNPs()

[1] "SNPlocs.Hsapiens.dbSNP151.GRCh38"

Find available SNPlocs:

available.SNPs()

 [1] "SNPlocs.Hsapiens.dbSNP.20101109"     
 [2] "SNPlocs.Hsapiens.dbSNP.20120608"     
 [3] "SNPlocs.Hsapiens.dbSNP141.GRCh38"    
 [4] "SNPlocs.Hsapiens.dbSNP142.GRCh37"    
 [5] "SNPlocs.Hsapiens.dbSNP144.GRCh37"    
 [6] "SNPlocs.Hsapiens.dbSNP144.GRCh38"    
 [7] "SNPlocs.Hsapiens.dbSNP149.GRCh38"    
 [8] "SNPlocs.Hsapiens.dbSNP150.GRCh38"    
 [9] "SNPlocs.Hsapiens.dbSNP151.GRCh38"    
[10] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
[11] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh37"
[12] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh38"

snps <- SNPlocs.Hsapiens.dbSNP151.GRCh38
snpcount(snps)

       1        2        3        4        5        6        7 
46715974 50260448 41156733 39527627 37124420 34574891 33014286 
       8        9       10       11       12       13       14 
31658425 26136032 27600374 28401195 27346775 20231351 18399702 
      15       16       17       18       19       20       21 
17225735 19105548 16670331 16041550 12679024 13172762  7854326 
      22        X        Y       MT 
 8156852 23036227   442035     2239

When working with large VCF files it may be more efficient to read in subsets of the data. This can be accomplished by selecting genomic coordinates (ranges) or by specific fields from the VCF file.

More info on Variant Annotations: https://www.bioconductor.org/packages/devel/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf

Variant Call Analysis