Biopython Sequence Annotation

Beatriz Manso

1. Set Working directory and install required libraries

The "nbconvert" tool allows you to convert a Jupyter .ipynb notebook document file into another static format including HTML, LaTeX, PDF, Markdown, reStructuredText, and more. nbconvert can also add productivity to your workflow when used to execute notebooks programmatically.

2. The SeqRecord object

Creating a SeqRecord:

Pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:

Pass the ID - AC12345:

Pass the description:

Get the sequence:

Including an identifier is very important if you want to output your SeqRecord to a file. You would normally include this when creating the object:

3. Annotating sequences

.annotations[] is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding annotations is easy, and just involves dealing directly with the annotation dictionary:

letter_annotations[] is another dictionary that lets you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence:

Translate the sequence:

4. SeqRecord objects from FASTA files

NC 005816.fna - Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 (from NCBI)

Have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object:

Identifiers and description:

The first word of the FASTA record’s title line is used for both the id and name attributes. The whole title line (after removing the greater than symbol) is used for the record description.

The /db_xref qualifier serves as a vehicle for linking DNA sequence records to other external databases:

5. SeqRecord objects from GenBank files

Now we will look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, but this time as a GenBank file.

Get the sequence:

GenBank files don’t have any per-letter annotations:

All the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list.

6. Feature, location and position objects

SeqFeature objects

Biopython SeqFeature class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes.

Positions and locations

Position – This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions. Location – A location is region of sequence bounded by some positions. For instance 5..20 (i. ‍e. ‍5 to 20) is a location.

Fuzzy Positions

There are several types of fuzzy positions, so we have five classes do deal with them:

- ExactPosition - Represents a position which is specified as exact along the sequence. This is represented as just a number, and you can get the position by looking at the position attribute of the object.

- BeforePosition - Represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like "<13", signifying that the real position is located somewhere less than 13. To get the specified upper boundary, look at the position attribute of the object.

- AfterPosition - Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as ">13", and like BeforePosition, you get the boundary number by looking at the position attribute of the object.

- WithinPosition - Occasionally used for GenBank/EMBL locations, this class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as ‘(1.5)’, to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.

- OneOfPosition - Occasionally used for GenBank/EMBL locations, this class deals with a position where several possible values exist, for instance you could use this if the start codon was unclear and there where two candidates for the start of the gene. Alternatively, that might be handled explicitly as two related gene features.

- UnknownPosition - This class deals with a position of unknown location. This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt.

Create a location with fuzzy end points:

Access the fuzzy start and end positions using the start and end attributes of the location:

If you just want numbers, they are actually subclasses of integers so should work like integers:

For older versions of Biopython ask for the nofuzzy_start and nofuzzy_end attributes of the location which are plain integers:

You can just pass in numbers to the FeaturePosition constructors, and you’ll get back out ExactPosition objects:

Location testing

Python keyword in with a SeqFeature or location object lets us see if the base/residue for a parent coordinate is within the feature/location or not.

Suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop:

gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns.

Sequence described by a feature or location

A SeqFeature or location object doesn’t directly contain a sequence, instead the location describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(6..18), like this:

You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement. If you are using Biopython 1.59 or later, the feature location’s start and end are integer like so this works

SeqFeature object has an extract method to take care of all this (and since Biopython 1.78 can handle trans-splicing by supplying a dictionary of referenced sequences):

The length of a SeqFeature or location matches that of the region of sequence it describes.

For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions.

Comparison

What happens when you try to compare these “identical” records?

A single equal mark is used to assign a value to a variable, whereas two consecutive equal marks is used to check whether 2 expressions give the same value.

The format() method

format() method of the SeqRecord class gives a string containing your record formatted using one of the output file formats supported by Bio.SeqIO, such as FASTA:

This format method takes a single mandatory argument, a lower case string which is supported by Bio.SeqIO as an output format. However, some of the file formats Bio.SeqIO can write to require more than one record (typically the case for multiple sequence alignment formats), and thus won’t work via this format() method.

Slicing a SeqRecord

Slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence. What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted).

From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list:

Slice this parent record from 4300 to 4800 (enough to include the pim gene/CDS), and see how many features we get:

Our sub-record just has two features, the gene and CDS entries for YP_pPCP05:

The locations have been adjusted to reflect the new parent sequence!

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not. To avoid guessing, with the exception of the molecule type, the .annotations and .dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate.

You may wish to preserve other entries like the organism? Beware of copying the entire annotations dictionary as in this case your partial sequence is no longer circular DNA - it is now linear:

The same point could be made about the record id, name and description, but for practicality these are preserved:

This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described before

Adding SeqRecord objects

Add SeqRecord objects together, giving a new SeqRecord. What is important here is that any common per-letter annotations are also added, all the features are preserved (with their locations adjusted), and any other common annotation is also kept (like the id, name and description).

We can make a new edited record by first slicing the SeqRecord before and after

You can make this shorter with just:

Example with features, we’ll use a GenBank file. Suppose you have a circular genome:

You can shift the origin like this:

SeqRecord slicing step is cautious in what annotation it preserves (erroneously propagating annotation can cause major problems). If you want to keep the database cross references or the annotations dictionary, this must be done explicitly:

Reverse-complementing SeqRecord objects

SeqRecord object’s reverse_complement method takes a number of optional arguments corresponding to properties of the record. Setting these arguments to True means copy the old values, while False means drop the old values and use the default value. You can alternatively provide the new desired value instead.

Here we take the reverse complement and specify a new identifier – but notice how most of the annotation is dropped (but not the features):