The process of variant calling consists of finding variants that may be related to a particular population or trait. VCFs are standard formats for displaying sequence variation, such as SNPs, indels, and structural variants.
Variant call format (VCF) Header line:
#CHROM | Chromosome |
---|---|
POS | Co-ordinate - The start coordinate of the variant. the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position. |
ID | Identifier: a semicolon-separated list of marker identifiers. |
REF | Reference allele - The reference allele is whatever is found in the reference genome and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or “AAC”) |
ALT | Alternative allele - The alternative allele is the allele found in the sample you are studying and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or AAC”). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles. |
QUAL | Alternative allele - The alternative allele is the allele found in the sample you are studying and is expressed as a sequence of one or more A/C/G/T nucleotides (e.g. “A” or “AAC”). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles. |
FILTER | Pass/fail - If it passed quality filters. Either “PASS” or a semicolon-separated list of failed quality control filters. |
INFO | Further information - Allows you to provide further information on the variants. Keys in the INFO field can be defined in header lines above the table. |
FORMAT | Information about the following columns - The GT in the FORMAT column tells us to expect genotypes in the following columns. |
Other/optional | Individual identifier (optional) - The previous column told us to expect to see genotypes here. The genotype is in the form 0|1, where 0 indicates the reference allele and 1 indicates the alternative allele, i.e it is heterozygous. The vertical pipe | indicates that the genotype is phased, and is used to indicate which chromosome the alleles are on. If this is a slash /rather than a vertical pipe, it means we don’t know which chromosome they are on. |
Files needed: AOC1.bam, chr7.fa, SRR11498039.vcf
Using the Ubuntu terminal we can convert BAM files into VCF files:
conda conda install -c bioconda Samtools
conda install -c bioconda freebayes
conda install -c bioconda bcftools
Run comand to convert BAM into VCF
freebayes -f chr7.fa AOC1.bam >AOC1.vcf
- Using Bcftools, run:
bcftools mpileup -Ob -o AOC1_BCFtools.vcf -f chr7.fa AOC1.bam
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager)
BiocManager::install("VariantAnnotation")
BiocManager::install("Rsamtools")
install.packages("vcfR")
library("VariantAnnotation")
library("Rsubread")
library("Rsamtools")
library("vcfR")
library("R.filesets")
BAM1 <- "AOC1.bam"
refGenomeFile = "chr7.fa"
Bamindex <- indexBam(BAM1)
exactSNP(BAM1, isBAM = TRUE, refGenomeFile, SNPAnnotationFile = NULL, outputFile = paste0(BAM1, '.VCF'))
========== _____ _ _ ____ _____ ______ _____
===== / ____| | | | _ \| __ \| ____| /\ | __ \
===== | (___ | | | | |_) | |__) | |__ / \ | | | |
==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
Rsubread 2.8.2
//============================= exactSNP setting =============================\\
|| ||
|| Input file : AOC1.bam (BAM) ||
|| Output file : AOC1.bam.VCF ||
|| Reference genome : chr7.fa ||
|| Temp path : ||
|| ||
|| Threads : 1 ||
|| Min supporting reads : 1 ||
|| Min pct. of supporting reads : 0.0% ||
|| Min base quality score : 13 ||
|| Number of trimmed bases : 3 ||
|| ||
|| Q value cutoff : 10E-12.0 ||
|| P value upper bound : 0.00500 ||
|| Flanking windows size : 5 ||
|| ||
\\============================================================================//
//====================== Running (04-May-2022 16:43:38) ======================\\
|| ||
|| Split BAM file into .//temp-snps-010484-592EE877B03F-* ... ||
|| processed block chr7@15000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@30000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@45000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@60000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@75000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@90000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@105000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@120000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@135000000 by thread 1/1 [block number=0/1] ||
|| processed block chr7@150000000 by thread 1/1 [block number=0/1] ||
WARNING: Chromosome is shorter than expected: chr7
|| processed block chr7@159345788 by thread 1/1 [block number=1/1] ||
|| ||
|| Completed successfully. ||
|| ||
\\============================================================================//
//================================= Summary ==================================\\
|| ||
|| Processed reads : 7930 ||
|| Reported SNPs : 0 ||
|| Reported indels : 5 ||
|| ||
|| Running time : 0.2 minutes ||
|| ||
\\============================================================================//
class: VCFHeader
samples(0):
meta(2): comment fileformat
fixed(1): FILTER
info(6): DP BGMM ... INDEL SR
geno(0):
AOCzip <- bgzip(fl)
indexTabix(AOCzip, format = "vcf")
[1] "AOC1.bam.VCF.bgz.tbi"
This file “AOC1.bam.VCF.bgz” can now be uploaded to Emsemble VEP or to Genome Data Viewer (Options > Upload) to compare our findings to those from the VEP and see if we discovered new variants.
#Specify URL where file is stored
url <- 'https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.28_GRCh38.p13/GCA_000001405.28_GRCh38.p13_genomic.fna.gz'
#Specify the name you want to give to the downloaded file
destfile <- 'output_ref'
#Apply download.file function only if the file is not downloaded yet:
if( ! file.exists('output_ref') ) {
hg38 <- getFile()
} else {
hg38 <- 'download.file(url, destfile)'
}
The reference genome is downloaded to the working directory and its name is “output_ref”.
# Set your vcf file as an object:
srr <-'SRR11498039.vcf'
#Set the reference genome as an object
hg38 <- 'output_ref'
#Use readvcr() function to read your vcr file object and the reference genome file object together. Data are parsed into a VCF object with readVcf.
srr_vcf <- readVcf(srr, hg38)
#View VCF file
srr_vcf
class: CollapsedVCF
dim: 550924 1
rowRanges(vcf):
GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
info(vcf):
DataFrame with 17 columns: INDEL, IDV, IMF, DP, VDB, RPB, MQB, B...
info(header(vcf)):
Number Type Description
INDEL 0 Flag Indicates that the variant is an INDEL.
IDV 1 Integer Maximum number of reads supporting an indel
IMF 1 Float Maximum fraction of reads supporting an indel
DP 1 Integer Raw read depth
VDB 1 Float Variant Distance Bias for filtering splice...
RPB 1 Float Mann-Whitney U test of Read Position Bias ...
MQB 1 Float Mann-Whitney U test of Mapping Quality Bia...
BQB 1 Float Mann-Whitney U test of Base Quality Bias (...
MQSB 1 Float Mann-Whitney U test of Mapping Quality vs ...
SGB 1 Float Segregation based metric.
MQ0F 1 Float Fraction of MQ0 reads (smaller is better)
ICB 1 Float Inbreeding Coefficient Binomial test (bigg...
HOB 1 Float Bias in the number of HOMs number (smaller...
AC A Integer Allele count in genotypes for each ALT all...
AN 1 Integer Total number of alleles in called genotypes
DP4 4 Integer Number of high-quality ref-forward , ref-r...
MQ 1 Integer Average mapping quality
geno(vcf):
List of length 2: GT, PL
geno(header(vcf)):
Number Type Description
GT 1 String Genotype
PL G Integer List of Phred-scaled genotype likelihoods
This is a vcf class with a dimension (dim) of 550924 variants in 1 sample. - RwRanges(vcf) shows the GRanges with 5 metadata columns as: paramRangeID, REF, ALT, QUAL, FILTER - The info(vcf) slot shows the # of data frames (DFrame) – in this case 17 columns including: INDEL, IDV, IMF, DP, VDB, RPB, MQB, BQB, MQSB, SGB, MQ0F - The “geno” field shows the number of variables, which in this case are 2 variables: GT and PL
#Header information can be extracted from the VCF with header().
header(srr_vcf)
class: VCFHeader
samples(1): SRR11498039.bam
meta(5): fileformat reference samtoolsCommand samtoolsVersion
contig
fixed(2): FILTER ALT
info(17): INDEL IDV ... DP4 MQ
geno(2): GT PL
There is only one sample in this vcf file, 5 fields of meta information, 17 info fields and 2 geno fields.
Data can be further extracted using the named accessors:
There is 1 sample with the name “SRR11498039.bam”.
DataFrame with 2 rows and 3 columns
Number Type Description
<character> <character> <character>
GT 1 String Genotype
PL G Integer List of Phred-scaled..
DataFrame with 17 rows and 3 columns
Number Type Description
<character> <character> <character>
INDEL 0 Flag Indicates that the v..
IDV 1 Integer Maximum number of re..
IMF 1 Float Maximum fraction of ..
DP 1 Integer Raw read depth
VDB 1 Float Variant Distance Bia..
... ... ... ...
HOB 1 Float Bias in the number o..
AC A Integer Allele count in geno..
AN 1 Integer Total number of alle..
DP4 4 Integer Number of high-quali..
MQ 1 Integer Average mapping qual..
info(srr_vcf)
DataFrame with 550924 rows and 17 columns
INDEL IDV IMF DP
<logical> <integer> <numeric> <integer>
chr1:13613_T/A FALSE NA NA 1
chr1:13684_C/T FALSE NA NA 1
chr1:13813_T/G FALSE NA NA 1
chr1:13838_C/T FALSE NA NA 1
chr1:14522_G/A FALSE NA NA 42
... ... ... ... ...
chrUn_GL000218v1:87538_T/C FALSE NA NA 2
chrUn_GL000218v1:87674_C/T FALSE NA NA 3
chrUn_GL000218v1:87869_A/G FALSE NA NA 3
chrUn_GL000218v1:88759_C/G FALSE NA NA 10
chrUn_GL000218v1:88948_T/C FALSE NA NA 11
VDB RPB MQB BQB
<numeric> <numeric> <numeric> <numeric>
chr1:13613_T/A NA NA NA NA
chr1:13684_C/T NA NA NA NA
chr1:13813_T/G NA NA NA NA
chr1:13838_C/T NA NA NA NA
chr1:14522_G/A 0.0618676 0.00524668 0.26525 0.904692
... ... ... ... ...
chrUn_GL000218v1:87538_T/C 0.76000000 NA NA NA
chrUn_GL000218v1:87674_C/T 0.94000000 NA NA NA
chrUn_GL000218v1:87869_A/G 0.98000000 NA NA NA
chrUn_GL000218v1:88759_C/G 0.00106101 0 1 1.000000
chrUn_GL000218v1:88948_T/C 0.22000000 1 1 0.888889
MQSB SGB MQ0F ICB
<numeric> <numeric> <numeric> <numeric>
chr1:13613_T/A NA -0.379885 0 NA
chr1:13684_C/T NA -0.379885 0 NA
chr1:13813_T/G NA -0.379885 0 NA
chr1:13838_C/T NA -0.379885 0 NA
chr1:14522_G/A NA -0.651104 0 1
... ... ... ... ...
chrUn_GL000218v1:87538_T/C NA -0.453602 0 NA
chrUn_GL000218v1:87674_C/T NA -0.453602 0 NA
chrUn_GL000218v1:87869_A/G NA -0.453602 0 NA
chrUn_GL000218v1:88759_C/G 0.916482 -0.651104 0 1
chrUn_GL000218v1:88948_T/C 0.964642 -0.453602 0 1
HOB AC AN
<numeric> <IntegerList> <integer>
chr1:13613_T/A NA 2 2
chr1:13684_C/T NA 2 2
chr1:13813_T/G NA 2 2
chr1:13838_C/T NA 2 2
chr1:14522_G/A 0.5 1 2
... ... ... ...
chrUn_GL000218v1:87538_T/C NA 2 2
chrUn_GL000218v1:87674_C/T NA 2 2
chrUn_GL000218v1:87869_A/G NA 2 2
chrUn_GL000218v1:88759_C/G 0.5 1 2
chrUn_GL000218v1:88948_T/C 0.5 1 2
DP4 MQ
<IntegerList> <integer>
chr1:13613_T/A 0,0,1,... 50
chr1:13684_C/T 0,0,1,... 50
chr1:13813_T/G 0,0,0,... 50
chr1:13838_C/T 0,0,0,... 50
chr1:14522_G/A 30,0,8,... 30
... ... ...
chrUn_GL000218v1:87538_T/C 0,0,0,... 50
chrUn_GL000218v1:87674_C/T 0,0,2,... 26
chrUn_GL000218v1:87869_A/G 0,0,2,... 26
chrUn_GL000218v1:88759_C/G 2,0,5,... 50
chrUn_GL000218v1:88948_T/C 3,6,1,... 50
VCF file columns CHROM, POS, and ID are represented by Granges, and their information can be found with rowRanges(vcf):
rowRanges(srr_vcf)
GRanges object with 550924 ranges and 5 metadata columns:
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
chr1:13613_T/A chr1 13613 * |
chr1:13684_C/T chr1 13684 * |
chr1:13813_T/G chr1 13813 * |
chr1:13838_C/T chr1 13838 * |
chr1:14522_G/A chr1 14522 * |
... ... ... ... .
chrUn_GL000218v1:87538_T/C chrUn_GL000218v1 87538 * |
chrUn_GL000218v1:87674_C/T chrUn_GL000218v1 87674 * |
chrUn_GL000218v1:87869_A/G chrUn_GL000218v1 87869 * |
chrUn_GL000218v1:88759_C/G chrUn_GL000218v1 88759 * |
chrUn_GL000218v1:88948_T/C chrUn_GL000218v1 88948 * |
paramRangeID REF
<factor> <DNAStringSet>
chr1:13613_T/A NA T
chr1:13684_C/T NA C
chr1:13813_T/G NA T
chr1:13838_C/T NA C
chr1:14522_G/A NA G
... ... ...
chrUn_GL000218v1:87538_T/C NA T
chrUn_GL000218v1:87674_C/T NA C
chrUn_GL000218v1:87869_A/G NA A
chrUn_GL000218v1:88759_C/G NA C
chrUn_GL000218v1:88948_T/C NA T
ALT QUAL FILTER
<DNAStringSetList> <numeric> <character>
chr1:13613_T/A A 8.13869 .
chr1:13684_C/T T 8.13869 .
chr1:13813_T/G G 8.13869 .
chr1:13838_C/T T 8.13869 .
chr1:14522_G/A A 32.54720 .
... ... ... ...
chrUn_GL000218v1:87538_T/C C 37.41520 .
chrUn_GL000218v1:87674_C/T T 22.43910 .
chrUn_GL000218v1:87869_A/G G 9.88514 .
chrUn_GL000218v1:88759_C/G G 161.00000 .
chrUn_GL000218v1:88948_T/C C 7.12033 .
-------
seqinfo: 195 sequences from output_ref genome
To find information from the CHROM, POS, and ID fields of the VCF file, examine rowRanges:
GRanges object with 3 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
chr1:13613_T/A chr1 13613 * | NA
chr1:13684_C/T chr1 13684 * | NA
chr1:13813_T/G chr1 13813 * | NA
REF ALT QUAL
<DNAStringSet> <DNAStringSetList> <numeric>
chr1:13613_T/A T A 8.13869
chr1:13684_C/T C T 8.13869
chr1:13813_T/G T G 8.13869
FILTER
<character>
chr1:13613_T/A .
chr1:13684_C/T .
chr1:13813_T/G .
-------
seqinfo: 195 sequences from output_ref genome
Find the reference and alternative alele:
#reference alele from position 1 to 5
ref(srr_vcf)[1:5]
DNAStringSet object of length 5:
width seq
[1] 1 T
[2] 1 C
[3] 1 T
[4] 1 C
[5] 1 G
#alternative alele from position 1 to 5:
alt(srr_vcf)[1:5]
DNAStringSetList of length 5
[[1]] A
[[2]] T
[[3]] G
[[4]] T
[[5]] A
ALT is stored as DNAStringSetList and allows for multiple alternate aleles per variant
Genetype information obtained from the FORMAT fields is parsed into the geno slot. Each sample has distinct data and may have multiple values within it. As a result, the data is structured into matrices or arrays with rows representing variants and columns representing samples. A multidimensional array indicates multiple values per sample.
In our vcf sample all variables are matrices.
Now we’ll take a look at the List of Phred-scaled genotype likelihoods (PL) variable as an example:
DataFrame with 1 row and 3 columns
Number Type Description
<character> <character> <character>
PL G Integer List of Phred-scaled..
header() provides the variable definition and type for the variable.
We can see the dimention of the RO variable is a 550924x1 so it is unidimentional because we have only one sample.
PL[1:3,]
$`chr1:13613_T/A`
[1] 37 3 0
$`chr1:13684_C/T`
[1] 37 3 0
$`chr1:13813_T/G`
[1] 37 3 0
If we had more samples we could compute a five number summary of the minimum, lower-hinge (first quartile), median, upper-hinge (third quartile) and maximum using fivenum():
fivenum(PL)
We could also check the distribution of zero values:
And the distribution of non-zero values:
hist(PL[PL != 0], main='RO non-zero values', xlab='RO')
The info data are unique to the variant and the same across samples. All info variables are represented in a single DataFrame
info(srr_vcf)[1:4, 1:5]
DataFrame with 4 rows and 5 columns
INDEL IDV IMF DP VDB
<logical> <integer> <numeric> <integer> <numeric>
chr1:13613_T/A FALSE NA NA 1 NA
chr1:13684_C/T FALSE NA NA 1 NA
chr1:13813_T/G FALSE NA NA 1 NA
chr1:13838_C/T FALSE NA NA 1 NA
Variants with membership in dbSNP can be identified by using the appropriate SNPlocs package:
BiocManager::install("SNPlocs.Hsapiens.dbSNP151.GRCh38")
library("SNPlocs.Hsapiens.dbSNP151.GRCh38")
library("BSgenome")
Find which SNPlocs are installed:
[1] "SNPlocs.Hsapiens.dbSNP151.GRCh38"
Find available SNPlocs:
[1] "SNPlocs.Hsapiens.dbSNP.20101109"
[2] "SNPlocs.Hsapiens.dbSNP.20120608"
[3] "SNPlocs.Hsapiens.dbSNP141.GRCh38"
[4] "SNPlocs.Hsapiens.dbSNP142.GRCh37"
[5] "SNPlocs.Hsapiens.dbSNP144.GRCh37"
[6] "SNPlocs.Hsapiens.dbSNP144.GRCh38"
[7] "SNPlocs.Hsapiens.dbSNP149.GRCh38"
[8] "SNPlocs.Hsapiens.dbSNP150.GRCh38"
[9] "SNPlocs.Hsapiens.dbSNP151.GRCh38"
[10] "XtraSNPlocs.Hsapiens.dbSNP141.GRCh38"
[11] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh37"
[12] "XtraSNPlocs.Hsapiens.dbSNP144.GRCh38"
snps <- SNPlocs.Hsapiens.dbSNP151.GRCh38
snpcount(snps)
1 2 3 4 5 6 7
46715974 50260448 41156733 39527627 37124420 34574891 33014286
8 9 10 11 12 13 14
31658425 26136032 27600374 28401195 27346775 20231351 18399702
15 16 17 18 19 20 21
17225735 19105548 16670331 16041550 12679024 13172762 7854326
22 X Y MT
8156852 23036227 442035 2239
When working with large VCF files it may be more efficient to read in subsets of the data. This can be accomplished by selecting genomic coordinates (ranges) or by specific fields from the VCF file.
More info on Variant Annotations: https://www.bioconductor.org/packages/devel/bioc/vignettes/VariantAnnotation/inst/doc/VariantAnnotation.pdf