BLAST and MSA in Biopython

Beatriz Manso

BLAST (Basic Local Alignment Search Tool) is a bioinformatics tool designed to help us understand the relatioship between any query sequences by providing rapid sequence comparison of a query sequence against a database. Depending on the experiment, you can design use BLASTP, BLASTN, DELTABLAST, MMHER etc.

This practical provides the learning materials to develop the skills for writing CLI scripts to integrate BLAST searches into your pipelines.

We will use these four objects in Bio.SearchIO to: read, parse, index, or index_db.

1. Query Result

The QueryResult object represents a single search query and contains zero or more Hit objects.

Check how many items (hits) a QueryResult has:

Retrieve items (hits) from a QueryResult using the slice notation:

To retrieve multiple hits, you can slice QueryResult objects using the slice notation as well.

In this case, the slice will return a new QueryResult object containing only the sliced hits:

To retrieve hits using the hit’s ID. This is particularly useful if you know a given hit ID exists within a search query results:

Get a full list of Hit objects using hits and a full list of Hit IDs using hit_keys:

Check whether a particular hit is present in the query results: use "in" keyword

Check whether a hit is present is not enough; and show the rank of the hit: use the index method

Sort the results using QueryResult.sort based on each hit’s full sequence length.

Using hit_filter to filter out Hit objects that only have one HSP:

To filter and map your QueryResult objects you can use the hit_filter, hsp_filter and map, QueryResult methods. These methods filter its QueryResult object either on its Hit objects or HSP objects. The map, QueryResult objects provide the hit_map and hsp_map methods. These methods apply a given function to all hits or HSPs in a QueryResult object, respectively. The method accepts a callback function that checks whether a given Hit object passes the condition you set as True or False.

Use map methods to return a modified Hit or HSP object (depending on whether you’re using hit_map or hsp_map).

Example: use hit_map to rename the hit IDs:

2. Hit

Hit objects represent all query results from a single database entry. They are the second-level container in the Bio.SearchIO object. They are contained by QueryResult objects, but the Hit object in turn contains HSP objects.

a) BLAST search:

b) BLAT search:

The previously BLAT search provided one hit with 17 HSPs.

There are some differences with the BLAST:

• The e-value and bit score column values. As BLAT HSPs do not have evalues and bit scores, the display defaults to ‘?’.

• The span values is meant to display the complete alignment length, which consists of all residues and any gaps that may be present. The PSL format do not have this information readily available and Bio.SearchIO does not attempt to try guess what it is, so we get a ‘?’ similar to the e-value and bit score columns.

Use Hit objects to iterate and return one HSP object:

Use len on a Hit to see how many HSP objects it has:

Use the slice notation on Hit objects, to retrieve single HSP or multiple HSP objects.

3.HSP

HSP (high-scoring pair) represents region(s) in the hit sequence that contains significant alignment(s) to the query sequence. It contains the actual match between your query sequence and a database entry. As this match is determined by the sequence search tool’s algorithms, the HSP object contains the bulk of the statistics computed by the search tool. This also makes the distinction between HSP objects from different search tools more apparent compared to the differences you’ve seen in QueryResult or Hit objects.

a) BAST for HSP

Similar to QueryResult and Hit, print on an HSP shows its general details:

a) the e-value and bitscore.

b) HSP fragments.

c) the query and hit sequence alignment

b) Use dot notation to access each result:

c) Access other HSP object properties to probe the result, examples:

d) Calculate number of gaps and identical residues

These details are format-specific; they may not be present in other formats.

use .dict.keys() for a quick list of what’s available:

e) query and hit attributes of HSP are are SeqRecord objects

f) Print HSP object as an alignment property with MultipleSeqAlignment object:

4. HSPFragment

HSPFragment represents a single, contiguous match between the query and hit sequences. Consider it the core of the object model and search result, since it is the presence of these fragments that determine whether your search have results or not.

HSPFragment objects were written with to be as compact as possible. In most cases, they only contain attributes directly related to sequences: strands, reading frames, molecule types, coordinates,the sequences themselves, and their IDs and descriptions.

HSPFragment objects from the BLAST search:

5. Reading search output files

Use read and parse (submodules like Bio.SeqIO or Bio.AlignIO)

Requirement: you need to supply the search output file name and the file format name.

a) Bio.SearchIO.read is used for reading search output files – this

returns QueryResult object.

b) Bio.SearchIO.parse is used for reading search output files with any number of queries. Returning a generator object that yields a QueryResult object in each iteration.

6.Large search output files with indexing

For large file you can use Bio.SearchIO.index or Bio.SearchIO.index_db to index the files.

a) Use index with just the filename and format name:

b) Use with the format-specific keyword argument:

c) Use the key_function argument

7. Writing and converting search output files

Occasionally, you may need to manipulate search results from an output file and rewrite it to a new file. You can use the write function in Bio.SearchIO to do this. It takes a QueryResult objects, the output filename to write to. It returns a four-item tuple, which denotes the number or QueryResult, Hit, HSP, and HSPFragment objects that were written.

Use convert function: