Biopython Sequence Input/Output

Beatriz Manso

Import requried packages

Read in Sequence File using "parse" instead of "read"

Both require specification of the file type, eg. "fasta", "genbank".

Iterating over the records in a sequence file

The next() function is useful for obtaining only the first record in a multi-record file, resulting in the rest of the records in the file being ignored.

Getting a list of the records in a sequence file

Bio.SeqIO.parse() gives you a SeqRecord iterator, and you get the records one by one.

Often you need to be able to access the records in any order. The Python list data type is perfect for this, and we can turn the record iterator into a list of SeqRecord objects using the built-in Python function list() like so:

Extracting data

Get the species from a fasta file:

Modifying data

Note, if you want to change the way FASTA is output when written to a file, then you should modify both the id and description attributes. To ensure the correct behaviour, it is best to include the id plus a space at the start of the desired description:

Parsing sequences from compressed files

Instead of using a filename to parse a sequence data from a file, you can give Bio.SeqIO a handle. In this section we’ll use handles to parse sequence from compressed files.

Calculate the total length of the sequences in a multiple record GenBank file using a generator expression:

Here we use a file handle instead, using the with statement to close the handle automatically:

Use Python’s gzip module to open compressed files for reading - which gives us a handle object:

Parsing sequences from the net

Now we will use a different handle, an internet network connection, to use Entrez EFetch for downloading and parsing sequences from NCBI.

Let's access data for the Homo sapiens amine oxidase copper containing 1 (AOC1), transcript variant 1 from NCBI.

The file can be obtained from NCBI as a GenBank file:

If we use the parse() function instead of read() we can fetch multiple records.

Fetch the records for human, cotton and mouse AOC1:

Sequence files as Dictionaries

Using a Genbank file (AOC1seq.gb) containing sequence data for AOC1 from P. pastoris (a yeast), S. miltiorrhiza (red sage) and G hirsutum (cotton) index the records and access them like a database using the Python dictionary data type:

To view the contents of all of the records at once, the values can be listed:

Access a single record using the keys and manipulate the object:

Repeat the process above for the FASTA file:

Sequence files as Dictionaries – Indexed files

Bio.SeqIO.to_dict() is very flexible but as it keeps all the data in memory, it limits the file size that can be worked with based on your computer’s RAM.

When working with larger files, use Bio.SeqIO.index(), which still returns a dictionary like object, and does not keep everything in memory. But rather records where each record is within the file and returns the specific record when indicated.

To make a list of the indexed record keys:

Print the FASTA sequence of a specificed indexed record (gi|932245638|gb|KP300003.1|):

Getting the raw data for a record

The dictionary-like object from Bio.SeqIO.index() gives you each entry as a SeqRecord object. However, it is sometimes useful to be able to get the original raw data straight from the file. For this use the get_raw() method which takes a single argument (the record identifier) and returns a bytes string (extracted from the file without modification).

A motivating example is extracting a subset of a records from a large file where either Bio.SeqIO.write() does not (yet) support the output file format (e.g. the plain text SwissProt file format) or where you need to preserve the text exactly (e.g. GenBank or EMBL output from Biopython does not yet preserve every last bit of annotation).

Let’s suppose you have download the whole of UniProt in the plain text SwissPort file format from their FTP site and uncompressed it as the file uniprot_sprot.dat, and you want to extract just a few records from it:

Sequence files as Dictionaries – Database indexed files

Index the viral "gbvrl" GenBank files as follows:

The compressed file can be accessed the same way:

Writing Sequence Files

Bio.SeqIO.parse() has been used for sequence input (reading files) Bio.SeqIO.write() is for sequence output (writing files)

Create a few SeqRecord objects

With the list of SeqRecord objects, FASTA format file will be written.

Converting between sequence file formats

Getting your SeqRecord objects as formatted strings

Low level FASTA and FASTQ parsers

Working with the low-level SimpleFastaParser or FastqGeneralIterator is often more practical than Bio.SeqIO.parse when dealing with large high-throughput FASTA or FASTQ sequencing files where speed matters. As noted in the introduction to this chapter, the file-format neutral Bio.SeqIO interface has the overhead of creating many objects even for simple formats like FASTA.

When parsing FASTA files, internally Bio.SeqIO.parse() calls the low-level SimpleFastaParser with the file handle. You can use this directly - it iterates over the file handle returning each record as a tuple of two strings, the title line (everything after the > character) and the sequence (as a plain string):