import os
os.chdir("C:/Users/manso/OneDrive - University of West London/MSc Bioinformatics - UWL/3.DSB - Data Science for Bioinformatics/Practice/DSB W6")
! pip install markdown --user
! pip install nbconvert
import markdown
html = markdown.markdown('/SequenceAnnotation.ipynb')
print(html)
Requirement already satisfied: markdown in c:\users\manso\appdata\roaming\python\python38\site-packages (3.3.7) Requirement already satisfied: importlib-metadata>=4.4; python_version < "3.10" in c:\users\manso\appdata\roaming\python\python38\site-packages (from markdown) (4.11.4) Requirement already satisfied: zipp>=0.5 in c:\programdata\anaconda3\lib\site-packages (from importlib-metadata>=4.4; python_version < "3.10"->markdown) (3.4.0) Requirement already satisfied: nbconvert in c:\programdata\anaconda3\lib\site-packages (6.0.7) Requirement already satisfied: nbformat>=4.4 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (5.0.8) Requirement already satisfied: mistune<2,>=0.8.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.8.4) Requirement already satisfied: defusedxml in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.6.0) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.5.1) Requirement already satisfied: pandocfilters>=1.4.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (1.4.3) Requirement already satisfied: traitlets>=4.2 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (5.0.5) Requirement already satisfied: entrypoints>=0.2.2 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.3) Requirement already satisfied: jinja2>=2.4 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (2.11.2) Requirement already satisfied: bleach in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (3.2.1) Requirement already satisfied: testpath in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.4.4) Requirement already satisfied: jupyter-core in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (4.6.3) Requirement already satisfied: jupyterlab-pygments in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (0.1.2) Requirement already satisfied: pygments>=2.4.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert) (2.7.2) Requirement already satisfied: ipython-genutils in c:\programdata\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (0.2.0) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\programdata\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert) (3.2.0) Requirement already satisfied: jupyter-client>=6.1.5 in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (6.1.7) Requirement already satisfied: nest-asyncio in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (1.4.2) Requirement already satisfied: async-generator in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert) (1.10) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\manso\appdata\roaming\python\python38\site-packages (from jinja2>=2.4->nbconvert) (2.0.1) Requirement already satisfied: packaging in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert) (20.4) Requirement already satisfied: six>=1.9.0 in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert) (1.15.0) Requirement already satisfied: webencodings in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert) (0.5.1) Requirement already satisfied: pywin32>=1.0; sys_platform == "win32" in c:\programdata\anaconda3\lib\site-packages (from jupyter-core->nbconvert) (227) Requirement already satisfied: attrs>=17.4.0 in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert) (20.3.0) Requirement already satisfied: setuptools in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert) (50.3.1.post20201107) Requirement already satisfied: pyrsistent>=0.14.0 in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert) (0.17.3) Requirement already satisfied: python-dateutil>=2.1 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (2.8.1) Requirement already satisfied: tornado>=4.1 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (6.0.4) Requirement already satisfied: pyzmq>=13 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert) (19.0.2) Requirement already satisfied: pyparsing>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from packaging->bleach->nbconvert) (2.4.7) <p>/SequenceAnnotation.ipynb</p>
The "nbconvert" tool allows you to convert a Jupyter .ipynb notebook document file into another static format including HTML, LaTeX, PDF, Markdown, reStructuredText, and more. nbconvert can also add productivity to your workflow when used to execute notebooks programmatically.
# "!" sends the command to the shell
! jupyter nbconvert --to html /SequenceAnnotation.ipynb
This application is used to convert notebook files (*.ipynb) to various other formats. WARNING: THE COMMANDLINE INTERFACE MAY CHANGE IN FUTURE RELEASES. Options ======= The options below are convenience aliases to configurable class-options, as listed in the "Equivalent to" description-line of the aliases. To see all configurable class-options for some <cmd>, use: <cmd> --help-all --debug set log level to logging.DEBUG (maximize logging output) Equivalent to: [--Application.log_level=10] --generate-config generate default config file Equivalent to: [--JupyterApp.generate_config=True] -y Answer yes to any questions instead of prompting. Equivalent to: [--JupyterApp.answer_yes=True] --execute Execute the notebook prior to export. Equivalent to: [--ExecutePreprocessor.enabled=True] --allow-errors Continue notebook execution even if one of the cells throws an error and include the error message in the cell output (the default behaviour is to abort conversion). This flag is only relevant if '--execute' was specified, too. Equivalent to: [--ExecutePreprocessor.allow_errors=True] --stdin read a single notebook file from stdin. Write the resulting notebook with default basename 'notebook.*' Equivalent to: [--NbConvertApp.from_stdin=True] --stdout Write notebook output to stdout instead of files. Equivalent to: [--NbConvertApp.writer_class=StdoutWriter] --inplace Run nbconvert in place, overwriting the existing notebook (only relevant when converting to notebook format) Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory=] --clear-output Clear output of current file and save in place, overwriting the existing notebook. Equivalent to: [--NbConvertApp.use_output_suffix=False --NbConvertApp.export_format=notebook --FilesWriter.build_directory= --ClearOutputPreprocessor.enabled=True] --no-prompt Exclude input and output prompts from converted document. Equivalent to: [--TemplateExporter.exclude_input_prompt=True --TemplateExporter.exclude_output_prompt=True] --no-input Exclude input cells and output prompts from converted document. This mode is ideal for generating code-free reports. Equivalent to: [--TemplateExporter.exclude_output_prompt=True --TemplateExporter.exclude_input=True] --allow-chromium-download Whether to allow downloading chromium if no suitable version is found on the system. Equivalent to: [--WebPDFExporter.allow_chromium_download=True] --log-level=<Enum> Set the log level by value or name. Choices: any of [0, 10, 20, 30, 40, 50, 'DEBUG', 'INFO', 'WARN', 'ERROR', 'CRITICAL'] Default: 30 Equivalent to: [--Application.log_level] --config=<Unicode> Full path of a config file. Default: '' Equivalent to: [--JupyterApp.config_file] --to=<Unicode> The export format to be used, either one of the built-in formats ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf'] or a dotted object name that
[NbConvertApp] WARNING | pattern '/SequenceAnnotation.ipynb' matched no files
represents the import path for an `Exporter` class Default: '' Equivalent to: [--NbConvertApp.export_format] --template=<Unicode> Name of the template to use Default: '' Equivalent to: [--TemplateExporter.template_name] --template-file=<Unicode> Name of the template file to use Default: None Equivalent to: [--TemplateExporter.template_file] --writer=<DottedObjectName> Writer class used to write the results of the conversion Default: 'FilesWriter' Equivalent to: [--NbConvertApp.writer_class] --post=<DottedOrNone> PostProcessor class used to write the results of the conversion Default: '' Equivalent to: [--NbConvertApp.postprocessor_class] --output=<Unicode> overwrite base name use for output files. can only be used when converting one notebook at a time. Default: '' Equivalent to: [--NbConvertApp.output_base] --output-dir=<Unicode> Directory to write output(s) to. Defaults to output to the directory of each notebook. To recover previous default behaviour (outputting to the current working directory) use . as the flag value. Default: '' Equivalent to: [--FilesWriter.build_directory] --reveal-prefix=<Unicode> The URL prefix for reveal.js (version 3.x). This defaults to the reveal CDN, but can be any url pointing to a copy of reveal.js. For speaker notes to work, this must be a relative path to a local copy of reveal.js: e.g., "reveal.js". If a relative path is given, it must be a subdirectory of the current directory (from which the server is run). See the usage documentation (https://nbconvert.readthedocs.io/en/latest/usage.html#reveal-js-html- slideshow) for more details. Default: '' Equivalent to: [--SlidesExporter.reveal_url_prefix] --nbformat=<Enum> The nbformat version to write. Use this to downgrade notebooks. Choices: any of [1, 2, 3, 4] Default: 4 Equivalent to: [--NotebookExporter.nbformat_version] Examples -------- The simplest way to use nbconvert is > jupyter nbconvert mynotebook.ipynb --to html Options include ['asciidoc', 'custom', 'html', 'latex', 'markdown', 'notebook', 'pdf', 'python', 'rst', 'script', 'slides', 'webpdf']. > jupyter nbconvert --to latex mynotebook.ipynb Both HTML and LaTeX support multiple output templates. LaTeX includes 'base', 'article' and 'report'. HTML includes 'basic' and 'full'. You can specify the flavor of the format used. > jupyter nbconvert --to html --template lab mynotebook.ipynb You can also pipe the output to stdout, rather than a file > jupyter nbconvert mynotebook.ipynb --stdout PDF is generated via latex > jupyter nbconvert mynotebook.ipynb --to pdf You can get (and serve) a Reveal.js-powered slideshow > jupyter nbconvert myslides.ipynb --to slides --post serve Multiple notebooks can be given at the command line in a couple of different ways: > jupyter nbconvert notebook*.ipynb > jupyter nbconvert notebook1.ipynb notebook2.ipynb or you can specify the notebooks list in a config file, containing:: c.NbConvertApp.notebooks = ["my_notebook.ipynb"] > jupyter nbconvert --config mycfg.py To see all available configurables, use `--help-all`.
!pip install -U notebook-as-pdf
!pip install biopython
Collecting notebook-as-pdf Using cached notebook_as_pdf-0.5.0-py3-none-any.whl (6.5 kB) Requirement already satisfied, skipping upgrade: PyPDF2 in c:\programdata\anaconda3\lib\site-packages (from notebook-as-pdf) (2.3.1) Collecting pyppeteer Using cached pyppeteer-1.0.2-py3-none-any.whl (83 kB) Requirement already satisfied, skipping upgrade: nbconvert in c:\programdata\anaconda3\lib\site-packages (from notebook-as-pdf) (6.0.7) Requirement already satisfied, skipping upgrade: typing-extensions; python_version < "3.10" in c:\programdata\anaconda3\lib\site-packages (from PyPDF2->notebook-as-pdf) (3.7.4.3) Requirement already satisfied, skipping upgrade: pyee<9.0.0,>=8.1.0 in c:\programdata\anaconda3\lib\site-packages (from pyppeteer->notebook-as-pdf) (8.2.2) Requirement already satisfied, skipping upgrade: websockets<11.0,>=10.0 in c:\programdata\anaconda3\lib\site-packages (from pyppeteer->notebook-as-pdf) (10.3) Collecting certifi>=2021 Using cached certifi-2022.6.15-py3-none-any.whl (160 kB) Requirement already satisfied, skipping upgrade: importlib-metadata>=1.4 in c:\users\manso\appdata\roaming\python\python38\site-packages (from pyppeteer->notebook-as-pdf) (4.11.4) Requirement already satisfied, skipping upgrade: tqdm<5.0.0,>=4.42.1 in c:\programdata\anaconda3\lib\site-packages (from pyppeteer->notebook-as-pdf) (4.50.2) Requirement already satisfied, skipping upgrade: appdirs<2.0.0,>=1.4.3 in c:\programdata\anaconda3\lib\site-packages (from pyppeteer->notebook-as-pdf) (1.4.4) Requirement already satisfied, skipping upgrade: urllib3<2.0.0,>=1.25.8 in c:\programdata\anaconda3\lib\site-packages (from pyppeteer->notebook-as-pdf) (1.25.11) Requirement already satisfied, skipping upgrade: nbclient<0.6.0,>=0.5.0 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.5.1) Requirement already satisfied, skipping upgrade: pandocfilters>=1.4.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (1.4.3) Requirement already satisfied, skipping upgrade: defusedxml in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.6.0) Requirement already satisfied, skipping upgrade: pygments>=2.4.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (2.7.2) Requirement already satisfied, skipping upgrade: jupyter-core in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (4.6.3) Requirement already satisfied, skipping upgrade: testpath in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.4.4) Requirement already satisfied, skipping upgrade: traitlets>=4.2 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (5.0.5) Requirement already satisfied, skipping upgrade: bleach in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (3.2.1) Requirement already satisfied, skipping upgrade: mistune<2,>=0.8.1 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.8.4) Requirement already satisfied, skipping upgrade: jupyterlab-pygments in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.1.2) Requirement already satisfied, skipping upgrade: jinja2>=2.4 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (2.11.2) Requirement already satisfied, skipping upgrade: entrypoints>=0.2.2 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (0.3) Requirement already satisfied, skipping upgrade: nbformat>=4.4 in c:\programdata\anaconda3\lib\site-packages (from nbconvert->notebook-as-pdf) (5.0.8) Requirement already satisfied, skipping upgrade: zipp>=0.5 in c:\programdata\anaconda3\lib\site-packages (from importlib-metadata>=1.4->pyppeteer->notebook-as-pdf) (3.4.0) Requirement already satisfied, skipping upgrade: async-generator in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (1.10) Requirement already satisfied, skipping upgrade: jupyter-client>=6.1.5 in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (6.1.7) Requirement already satisfied, skipping upgrade: nest-asyncio in c:\programdata\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (1.4.2) Requirement already satisfied, skipping upgrade: pywin32>=1.0; sys_platform == "win32" in c:\programdata\anaconda3\lib\site-packages (from jupyter-core->nbconvert->notebook-as-pdf) (227) Requirement already satisfied, skipping upgrade: ipython-genutils in c:\programdata\anaconda3\lib\site-packages (from traitlets>=4.2->nbconvert->notebook-as-pdf) (0.2.0) Requirement already satisfied, skipping upgrade: webencodings in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert->notebook-as-pdf) (0.5.1) Requirement already satisfied, skipping upgrade: packaging in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert->notebook-as-pdf) (20.4) Requirement already satisfied, skipping upgrade: six>=1.9.0 in c:\programdata\anaconda3\lib\site-packages (from bleach->nbconvert->notebook-as-pdf) (1.15.0) Requirement already satisfied, skipping upgrade: MarkupSafe>=0.23 in c:\users\manso\appdata\roaming\python\python38\site-packages (from jinja2>=2.4->nbconvert->notebook-as-pdf) (2.0.1) Requirement already satisfied, skipping upgrade: jsonschema!=2.5.0,>=2.4 in c:\programdata\anaconda3\lib\site-packages (from nbformat>=4.4->nbconvert->notebook-as-pdf) (3.2.0) Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (2.8.1) Requirement already satisfied, skipping upgrade: tornado>=4.1 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (6.0.4) Requirement already satisfied, skipping upgrade: pyzmq>=13 in c:\programdata\anaconda3\lib\site-packages (from jupyter-client>=6.1.5->nbclient<0.6.0,>=0.5.0->nbconvert->notebook-as-pdf) (19.0.2) Requirement already satisfied, skipping upgrade: pyparsing>=2.0.2 in c:\programdata\anaconda3\lib\site-packages (from packaging->bleach->nbconvert->notebook-as-pdf) (2.4.7) Requirement already satisfied, skipping upgrade: pyrsistent>=0.14.0 in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert->notebook-as-pdf) (0.17.3) Requirement already satisfied, skipping upgrade: setuptools in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert->notebook-as-pdf) (50.3.1.post20201107) Requirement already satisfied, skipping upgrade: attrs>=17.4.0 in c:\programdata\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat>=4.4->nbconvert->notebook-as-pdf) (20.3.0) Installing collected packages: certifi, pyppeteer, notebook-as-pdf Attempting uninstall: certifi Found existing installation: certifi 2020.6.20 Uninstalling certifi-2020.6.20:
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\programdata\\anaconda3\\lib\\site-packages\\certifi-2020.6.20-py3.6.egg-info\\dependency_links.txt' Consider using the `--user` option or check the permissions.
Requirement already satisfied: biopython in c:\programdata\anaconda3\lib\site-packages (1.79) Requirement already satisfied: numpy in c:\programdata\anaconda3\lib\site-packages (from biopython) (1.19.2)
from Bio.SeqRecord import SeqRecord
help(SeqRecord)
Help on class SeqRecord in module Bio.SeqRecord: class SeqRecord(builtins.object) | SeqRecord(seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None) | | A SeqRecord object holds a sequence and information about it. | | Main attributes: | - id - Identifier such as a locus tag (string) | - seq - The sequence itself (Seq object or similar) | | Additional attributes: | - name - Sequence name, e.g. gene name (string) | - description - Additional text (string) | - dbxrefs - List of database cross references (list of strings) | - features - Any (sub)features defined (list of SeqFeature objects) | - annotations - Further information about the whole sequence (dictionary). | Most entries are strings, or lists of strings. | - letter_annotations - Per letter/symbol annotation (restricted | dictionary). This holds Python sequences (lists, strings | or tuples) whose length matches that of the sequence. | A typical use would be to hold a list of integers | representing sequencing quality scores, or a string | representing the secondary structure. | | You will typically use Bio.SeqIO to read in sequences from files as | SeqRecord objects. However, you may want to create your own SeqRecord | objects directly (see the __init__ method for further details): | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"), | ... id="YP_025292.1", name="HokC", | ... description="toxic membrane protein") | >>> print(record) | ID: YP_025292.1 | Name: HokC | Description: toxic membrane protein | Number of features: 0 | Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF') | | If you want to save SeqRecord objects to a sequence file, use Bio.SeqIO | for this. For the special case where you want the SeqRecord turned into | a string in a particular file format there is a format method which uses | Bio.SeqIO internally: | | >>> print(record.format("fasta")) | >YP_025292.1 toxic membrane protein | MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF | <BLANKLINE> | | You can also do things like slicing a SeqRecord, checking its length, etc | | >>> len(record) | 44 | >>> edited = record[:10] + record[11:] | >>> print(edited.seq) | MKQHKAMIVAIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF | >>> print(record.seq) | MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF | | Methods defined here: | | __add__(self, other) | Add another sequence or string to this sequence. | | The other sequence can be a SeqRecord object, a Seq object (or | similar, e.g. a MutableSeq) or a plain Python string. If you add | a plain string or a Seq (like) object, the new SeqRecord will simply | have this appended to the existing data. However, any per letter | annotation will be lost: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa") | >>> print("%s %s" % (record.id, record.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(record.letter_annotations)) | ['solexa_quality'] | | >>> new = record + "ACT" | >>> print("%s %s" % (new.id, new.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNNACT | >>> print(list(new.letter_annotations)) | [] | | The new record will attempt to combine the annotation, but for any | ambiguities (e.g. different names) it defaults to omitting that | annotation. | | >>> from Bio import SeqIO | >>> with open("GenBank/pBAD30.gb") as handle: | ... plasmid = SeqIO.read(handle, "gb") | >>> print("%s %i" % (plasmid.id, len(plasmid))) | pBAD30 4923 | | Now let's cut the plasmid into two pieces, and join them back up the | other way round (i.e. shift the starting point on this plasmid, have | a look at the annotated features in the original file to see why this | particular split point might make sense): | | >>> left = plasmid[:3765] | >>> right = plasmid[3765:] | >>> new = right + left | >>> print("%s %i" % (new.id, len(new))) | pBAD30 4923 | >>> str(new.seq) == str(right.seq + left.seq) | True | >>> len(new.features) == len(left.features) + len(right.features) | True | | When we add the left and right SeqRecord objects, their annotation | is all consistent, so it is all conserved in the new SeqRecord: | | >>> new.id == left.id == right.id == plasmid.id | True | >>> new.name == left.name == right.name == plasmid.name | True | >>> new.description == plasmid.description | True | >>> new.annotations == left.annotations == right.annotations | True | >>> new.letter_annotations == plasmid.letter_annotations | True | >>> new.dbxrefs == left.dbxrefs == right.dbxrefs | True | | However, we should point out that when we sliced the SeqRecord, | any annotations dictionary or dbxrefs list entries were lost. | You can explicitly copy them like this: | | >>> new.annotations = plasmid.annotations.copy() | >>> new.dbxrefs = plasmid.dbxrefs[:] | | __bool__(self) | Boolean value of an instance of this class (True). | | This behaviour is for backwards compatibility, since until the | __len__ method was added, a SeqRecord always evaluated as True. | | Note that in comparison, a Seq object will evaluate to False if it | has a zero length sequence. | | WARNING: The SeqRecord may in future evaluate to False when its | sequence is of zero length (in order to better match the Seq | object behaviour)! | | __contains__(self, char) | Implement the 'in' keyword, searches the sequence. | | e.g. | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta") | >>> "GAATTC" in record | False | >>> "AAA" in record | True | | This essentially acts as a proxy for using "in" on the sequence: | | >>> "GAATTC" in record.seq | False | >>> "AAA" in record.seq | True | | Note that you can also use Seq objects as the query, | | >>> from Bio.Seq import Seq | >>> Seq("AAA") in record | True | | See also the Seq object's __contains__ method. | | __eq__(self, other) | Define the equal-to operand (not implemented). | | __format__(self, format_spec) | Return the record as a string in the specified file format. | | This method supports the Python format() function and f-strings. | The format_spec should be a lower case string supported by | Bio.SeqIO as a text output file format. Requesting a binary file | format raises a ValueError. e.g. | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"), | ... id="YP_025292.1", name="HokC", | ... description="toxic membrane protein") | ... | >>> format(record, "fasta") | '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n' | >>> print(f"Here is {record.id} in FASTA format:\n{record:fasta}") | Here is YP_025292.1 in FASTA format: | >YP_025292.1 toxic membrane protein | MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF | <BLANKLINE> | | See also the SeqRecord's format() method. | | __ge__(self, other) | Define the greater-than-or-equal-to operand (not implemented). | | __getitem__(self, index) | Return a sub-sequence or an individual letter. | | Slicing, e.g. my_record[5:10], returns a new SeqRecord for | that sub-sequence with some annotation preserved as follows: | | * The name, id and description are kept as-is. | * Any per-letter-annotations are sliced to match the requested | sub-sequence. | * Unless a stride is used, all those features which fall fully | within the subsequence are included (with their locations | adjusted accordingly). If you want to preserve any truncated | features (e.g. GenBank/EMBL source features), you must | explicitly add them to the new SeqRecord yourself. | * With the exception of any molecule type, the annotations | dictionary and the dbxrefs list are not used for the new | SeqRecord, as in general they may not apply to the | subsequence. If you want to preserve them, you must explicitly | copy them to the new SeqRecord yourself. | | Using an integer index, e.g. my_record[5] is shorthand for | extracting that letter from the sequence, my_record.seq[5]. | | For example, consider this short protein and its secondary | structure as encoded by the PDB (e.g. H for alpha helices), | plus a simple feature for its histidine self phosphorylation | site: | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> from Bio.SeqFeature import SeqFeature, FeatureLocation | >>> rec = SeqRecord(Seq("MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLAT" | ... "EMMSEQDGYLAESINKDIEECNAIIEQFIDYLR"), | ... id="1JOY", name="EnvZ", | ... description="Homodimeric domain of EnvZ from E. coli") | >>> rec.letter_annotations["secondary_structure"] = " S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT " | >>> rec.features.append(SeqFeature(FeatureLocation(20, 21), | ... type = "Site")) | | Now let's have a quick look at the full record, | | >>> print(rec) | ID: 1JOY | Name: EnvZ | Description: Homodimeric domain of EnvZ from E. coli | Number of features: 1 | Per letter annotation for: secondary_structure | Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR') | >>> rec.letter_annotations["secondary_structure"] | ' S SSSSSSHHHHHTTTHHHHHHHHHHHHHHHHHHHHHHTHHHHHHHHHHHHHHHHHHHHHTT ' | >>> print(rec.features[0].location) | [20:21] | | Now let's take a sub sequence, here chosen as the first (fractured) | alpha helix which includes the histidine phosphorylation site: | | >>> sub = rec[11:41] | >>> print(sub) | ID: 1JOY | Name: EnvZ | Description: Homodimeric domain of EnvZ from E. coli | Number of features: 1 | Per letter annotation for: secondary_structure | Seq('RTLLMAGVSHDLRTPLTRIRLATEMMSEQD') | >>> sub.letter_annotations["secondary_structure"] | 'HHHHHTTTHHHHHHHHHHHHHHHHHHHHHH' | >>> print(sub.features[0].location) | [9:10] | | You can also of course omit the start or end values, for | example to get the first ten letters only: | | >>> print(rec[:10]) | ID: 1JOY | Name: EnvZ | Description: Homodimeric domain of EnvZ from E. coli | Number of features: 0 | Per letter annotation for: secondary_structure | Seq('MAAGVKQLAD') | | Or for the last ten letters: | | >>> print(rec[-10:]) | ID: 1JOY | Name: EnvZ | Description: Homodimeric domain of EnvZ from E. coli | Number of features: 0 | Per letter annotation for: secondary_structure | Seq('IIEQFIDYLR') | | If you omit both, then you get a copy of the original record (although | lacking the annotations and dbxrefs): | | >>> print(rec[:]) | ID: 1JOY | Name: EnvZ | Description: Homodimeric domain of EnvZ from E. coli | Number of features: 1 | Per letter annotation for: secondary_structure | Seq('MAAGVKQLADDRTLLMAGVSHDLRTPLTRIRLATEMMSEQDGYLAESINKDIEE...YLR') | | Finally, indexing with a simple integer is shorthand for pulling out | that letter from the sequence directly: | | >>> rec[5] | 'K' | >>> rec.seq[5] | 'K' | | __gt__(self, other) | Define the greater-than operand (not implemented). | | __init__(self, seq, id='<unknown id>', name='<unknown name>', description='<unknown description>', dbxrefs=None, features=None, annotations=None, letter_annotations=None) | Create a SeqRecord. | | Arguments: | - seq - Sequence, required (Seq or MutableSeq) | - id - Sequence identifier, recommended (string) | - name - Sequence name, optional (string) | - description - Sequence description, optional (string) | - dbxrefs - Database cross references, optional (list of strings) | - features - Any (sub)features, optional (list of SeqFeature objects) | - annotations - Dictionary of annotations for the whole sequence | - letter_annotations - Dictionary of per-letter-annotations, values | should be strings, list or tuples of the same length as the full | sequence. | | You will typically use Bio.SeqIO to read in sequences from files as | SeqRecord objects. However, you may want to create your own SeqRecord | objects directly. | | Note that while an id is optional, we strongly recommend you supply a | unique id string for each record. This is especially important | if you wish to write your sequences to a file. | | You can create a 'blank' SeqRecord object, and then populate the | attributes later. | | __iter__(self) | Iterate over the letters in the sequence. | | For example, using Bio.SeqIO to read in a protein FASTA file: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Fasta/loveliesbleeding.pro", "fasta") | >>> for amino in record: | ... print(amino) | ... if amino == "L": break | X | A | G | L | >>> print(record.seq[3]) | L | | This is just a shortcut for iterating over the sequence directly: | | >>> for amino in record.seq: | ... print(amino) | ... if amino == "L": break | X | A | G | L | >>> print(record.seq[3]) | L | | Note that this does not facilitate iteration together with any | per-letter-annotation. However, you can achieve that using the | python zip function on the record (or its sequence) and the relevant | per-letter-annotation: | | >>> from Bio import SeqIO | >>> rec = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa") | >>> print("%s %s" % (rec.id, rec.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(rec.letter_annotations)) | ['solexa_quality'] | >>> for nuc, qual in zip(rec, rec.letter_annotations["solexa_quality"]): | ... if qual > 35: | ... print("%s %i" % (nuc, qual)) | A 40 | C 39 | G 38 | T 37 | A 36 | | You may agree that using zip(rec.seq, ...) is more explicit than using | zip(rec, ...) as shown above. | | __le___(self, other) | Define the less-than-or-equal-to operand (not implemented). | | __len__(self) | Return the length of the sequence. | | For example, using Bio.SeqIO to read in a FASTA nucleotide file: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Fasta/sweetpea.nu", "fasta") | >>> len(record) | 309 | >>> len(record.seq) | 309 | | __lt__(self, other) | Define the less-than operand (not implemented). | | __ne__(self, other) | Define the not-equal-to operand (not implemented). | | __radd__(self, other) | Add another sequence or string to this sequence (from the left). | | This method handles adding a Seq object (or similar, e.g. MutableSeq) | or a plain Python string (on the left) to a SeqRecord (on the right). | See the __add__ method for more details, but for example: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa") | >>> print("%s %s" % (record.id, record.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(record.letter_annotations)) | ['solexa_quality'] | | >>> new = "ACT" + record | >>> print("%s %s" % (new.id, new.seq)) | slxa_0001_1_0001_01 ACTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(new.letter_annotations)) | [] | | __repr__(self) | Return a concise summary of the record for debugging (string). | | The python built in function repr works by calling the object's ___repr__ | method. e.g. | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> rec = SeqRecord(Seq("MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKAT" | ... "GEMKEQTEWHRVVLFGKLAEVASEYLRKGSQVYIEGQLRTRKWTDQ" | ... "SGQDRYTTEVVVNVGGTMQMLGGRQGGGAPAGGNIGGGQPQGGWGQ" | ... "PQQPQGGNQFSGGAQSRPQQSAPAAPSNEPPMDFDDDIPF"), | ... id="NP_418483.1", name="b4059", | ... description="ssDNA-binding protein", | ... dbxrefs=["ASAP:13298", "GI:16131885", "GeneID:948570"]) | >>> print(repr(rec)) | SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570']) | | At the python prompt you can also use this shorthand: | | >>> rec | SeqRecord(seq=Seq('MASRGVNKVILVGNLGQDPEVRYMPNGGAVANITLATSESWRDKATGEMKEQTE...IPF'), id='NP_418483.1', name='b4059', description='ssDNA-binding protein', dbxrefs=['ASAP:13298', 'GI:16131885', 'GeneID:948570']) | | Note that long sequences are shown truncated. Also note that any | annotations, letter_annotations and features are not shown (as they | would lead to a very long string). | | __str__(self) | Return a human readable summary of the record and its annotation (string). | | The python built in function str works by calling the object's ___str__ | method. e.g. | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"), | ... id="YP_025292.1", name="HokC", | ... description="toxic membrane protein, small") | >>> print(str(record)) | ID: YP_025292.1 | Name: HokC | Description: toxic membrane protein, small | Number of features: 0 | Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF') | | In this example you don't actually need to call str explicity, as the | print command does this automatically: | | >>> print(record) | ID: YP_025292.1 | Name: HokC | Description: toxic membrane protein, small | Number of features: 0 | Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF') | | Note that long sequences are shown truncated. | | format(self, format) | Return the record as a string in the specified file format. | | The format should be a lower case string supported as an output | format by Bio.SeqIO, which is used to turn the SeqRecord into a | string. e.g. | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF"), | ... id="YP_025292.1", name="HokC", | ... description="toxic membrane protein") | >>> record.format("fasta") | '>YP_025292.1 toxic membrane protein\nMKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF\n' | >>> print(record.format("fasta")) | >YP_025292.1 toxic membrane protein | MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF | <BLANKLINE> | | The Python print function automatically appends a new line, meaning | in this example a blank line is shown. If you look at the string | representation you can see there is a trailing new line (shown as | slash n) which is important when writing to a file or if | concatenating multiple sequence strings together. | | Note that this method will NOT work on every possible file format | supported by Bio.SeqIO (e.g. some are for multiple sequences only, | and binary formats are not supported). | | lower(self) | Return a copy of the record with a lower case sequence. | | All the annotation is preserved unchanged. e.g. | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Fasta/aster.pro", "fasta") | >>> print(record.format("fasta")) | >gi|3298468|dbj|BAA31520.1| SAMIPF | GGHVNPAVTFGAFVGGNITLLRGIVYIIAQLLGSTVACLLLKFVTNDMAVGVFSLSAGVG | VTNALVFEIVMTFGLVYTVYATAIDPKKGSLGTIAPIAIGFIVGANI | <BLANKLINE> | >>> print(record.lower().format("fasta")) | >gi|3298468|dbj|BAA31520.1| SAMIPF | gghvnpavtfgafvggnitllrgivyiiaqllgstvaclllkfvtndmavgvfslsagvg | vtnalvfeivmtfglvytvyataidpkkgslgtiapiaigfivgani | <BLANKLINE> | | To take a more annotation rich example, | | >>> from Bio import SeqIO | >>> old = SeqIO.read("EMBL/TRBG361.embl", "embl") | >>> len(old.features) | 3 | >>> new = old.lower() | >>> len(old.features) == len(new.features) | True | >>> old.annotations["organism"] == new.annotations["organism"] | True | >>> old.dbxrefs == new.dbxrefs | True | | reverse_complement(self, id=False, name=False, description=False, features=True, annotations=False, letter_annotations=True, dbxrefs=False) | Return new SeqRecord with reverse complement sequence. | | By default the new record does NOT preserve the sequence identifier, | name, description, general annotation or database cross-references - | these are unlikely to apply to the reversed sequence. | | You can specify the returned record's id, name and description as | strings, or True to keep that of the parent, or False for a default. | | You can specify the returned record's features with a list of | SeqFeature objects, or True to keep that of the parent, or False to | omit them. The default is to keep the original features (with the | strand and locations adjusted). | | You can also specify both the returned record's annotations and | letter_annotations as dictionaries, True to keep that of the parent, | or False to omit them. The default is to keep the original | annotations (with the letter annotations reversed). | | To show what happens to the pre-letter annotations, consider an | example Solexa variant FASTQ file with a single entry, which we'll | read in as a SeqRecord: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa") | >>> print("%s %s" % (record.id, record.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(record.letter_annotations)) | ['solexa_quality'] | >>> print(record.letter_annotations["solexa_quality"]) | [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] | | Now take the reverse complement, here we explicitly give a new | identifier (the old identifier with a suffix): | | >>> rc_record = record.reverse_complement(id=record.id + "_rc") | >>> print("%s %s" % (rc_record.id, rc_record.seq)) | slxa_0001_1_0001_01_rc NNNNNNACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT | | Notice that the per-letter-annotations have also been reversed, | although this may not be appropriate for all cases. | | >>> print(rc_record.letter_annotations["solexa_quality"]) | [-5, -4, -3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40] | | Now for the features, we need a different example. Parsing a GenBank | file is probably the easiest way to get an nice example with features | in it... | | >>> from Bio import SeqIO | >>> with open("GenBank/pBAD30.gb") as handle: | ... plasmid = SeqIO.read(handle, "gb") | >>> print("%s %i" % (plasmid.id, len(plasmid))) | pBAD30 4923 | >>> plasmid.seq | Seq('GCTAGCGGAGTGTATACTGGCTTACTATGTTGGCACTGATGAGGGTGTCAGTGA...ATG') | >>> len(plasmid.features) | 13 | | Now, let's take the reverse complement of this whole plasmid: | | >>> rc_plasmid = plasmid.reverse_complement(id=plasmid.id+"_rc") | >>> print("%s %i" % (rc_plasmid.id, len(rc_plasmid))) | pBAD30_rc 4923 | >>> rc_plasmid.seq | Seq('CATGGGCAAATATTATACGCAAGGCGACAAGGTGCTGATGCCGCTGGCGATTCA...AGC') | >>> len(rc_plasmid.features) | 13 | | Let's compare the first CDS feature - it has gone from being the | second feature (index 1) to the second last feature (index -2), its | strand has changed, and the location switched round. | | >>> print(plasmid.features[1]) | type: CDS | location: [1081:1960](-) | qualifiers: | Key: label, Value: ['araC'] | Key: note, Value: ['araC regulator of the arabinose BAD promoter'] | Key: vntifkey, Value: ['4'] | <BLANKLINE> | >>> print(rc_plasmid.features[-2]) | type: CDS | location: [2963:3842](+) | qualifiers: | Key: label, Value: ['araC'] | Key: note, Value: ['araC regulator of the arabinose BAD promoter'] | Key: vntifkey, Value: ['4'] | <BLANKLINE> | | You can check this new location, based on the length of the plasmid: | | >>> len(plasmid) - 1081 | 3842 | >>> len(plasmid) - 1960 | 2963 | | Note that if the SeqFeature annotation includes any strand specific | information (e.g. base changes for a SNP), this information is not | amended, and would need correction after the reverse complement. | | Note trying to reverse complement a protein SeqRecord raises an | exception: | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> protein_rec = SeqRecord(Seq("MAIVMGR"), id="Test", | ... annotations={"molecule_type": "protein"}) | >>> protein_rec.reverse_complement() | Traceback (most recent call last): | ... | ValueError: Proteins do not have complements! | | If you have RNA without any U bases, it must be annotated as RNA | otherwise it will be treated as DNA by default with A mapped to T: | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> rna1 = SeqRecord(Seq("ACG"), id="Test") | >>> rna2 = SeqRecord(Seq("ACG"), id="Test", annotations={"molecule_type": "RNA"}) | >>> print(rna1.reverse_complement(id="RC", description="unk").format("fasta")) | >RC unk | CGT | <BLANKLINE> | >>> print(rna2.reverse_complement(id="RC", description="RNA").format("fasta")) | >RC RNA | CGU | <BLANKLINE> | | Also note you can reverse complement a SeqRecord using a MutableSeq: | | >>> from Bio.Seq import MutableSeq | >>> from Bio.SeqRecord import SeqRecord | >>> rec = SeqRecord(MutableSeq("ACGT"), id="Test") | >>> rec.seq[0] = "T" | >>> print("%s %s" % (rec.id, rec.seq)) | Test TCGT | >>> rc = rec.reverse_complement(id=True) | >>> print("%s %s" % (rc.id, rc.seq)) | Test ACGA | | translate(self, table='Standard', stop_symbol='*', to_stop=False, cds=False, gap=None, id=False, name=False, description=False, features=False, annotations=False, letter_annotations=False, dbxrefs=False) | Return new SeqRecord with translated sequence. | | This calls the record's .seq.translate() method (which describes | the translation related arguments, like table for the genetic code), | | By default the new record does NOT preserve the sequence identifier, | name, description, general annotation or database cross-references - | these are unlikely to apply to the translated sequence. | | You can specify the returned record's id, name and description as | strings, or True to keep that of the parent, or False for a default. | | You can specify the returned record's features with a list of | SeqFeature objects, or False (default) to omit them. | | You can also specify both the returned record's annotations and | letter_annotations as dictionaries, True to keep that of the parent | (annotations only), or False (default) to omit them. | | e.g. Loading a FASTA gene and translating it, | | >>> from Bio import SeqIO | >>> gene_record = SeqIO.read("Fasta/sweetpea.nu", "fasta") | >>> print(gene_record.format("fasta")) | >gi|3176602|gb|U78617.1|LOU78617 Lathyrus odoratus phytochrome A (PHYA) gene, partial cds | CAGGCTGCGCGGTTTCTATTTATGAAGAACAAGGTCCGTATGATAGTTGATTGTCATGCA | AAACATGTGAAGGTTCTTCAAGACGAAAAACTCCCATTTGATTTGACTCTGTGCGGTTCG | ACCTTAAGAGCTCCACATAGTTGCCATTTGCAGTACATGGCTAACATGGATTCAATTGCT | TCATTGGTTATGGCAGTGGTCGTCAATGACAGCGATGAAGATGGAGATAGCCGTGACGCA | GTTCTACCACAAAAGAAAAAGAGACTTTGGGGTTTGGTAGTTTGTCATAACACTACTCCG | AGGTTTGTT | <BLANKLINE> | | And now translating the record, specifying the new ID and description: | | >>> protein_record = gene_record.translate(table=11, | ... id="phya", | ... description="translation") | >>> print(protein_record.format("fasta")) | >phya translation | QAARFLFMKNKVRMIVDCHAKHVKVLQDEKLPFDLTLCGSTLRAPHSCHLQYMANMDSIA | SLVMAVVVNDSDEDGDSRDAVLPQKKKRLWGLVVCHNTTPRFV | <BLANKLINE> | | upper(self) | Return a copy of the record with an upper case sequence. | | All the annotation is preserved unchanged. e.g. | | >>> from Bio.Seq import Seq | >>> from Bio.SeqRecord import SeqRecord | >>> record = SeqRecord(Seq("acgtACGT"), id="Test", | ... description = "Made up for this example") | >>> record.letter_annotations["phred_quality"] = [1, 2, 3, 4, 5, 6, 7, 8] | >>> print(record.upper().format("fastq")) | @Test Made up for this example | ACGTACGT | + | "#$%&'() | <BLANKLINE> | | Naturally, there is a matching lower method: | | >>> print(record.lower().format("fastq")) | @Test Made up for this example | acgtacgt | + | "#$%&'() | <BLANKLINE> | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined) | | letter_annotations | Dictionary of per-letter-annotation for the sequence. | | For example, this can hold quality scores used in FASTQ or QUAL files. | Consider this example using Bio.SeqIO to read in an example Solexa | variant FASTQ file as a SeqRecord: | | >>> from Bio import SeqIO | >>> record = SeqIO.read("Quality/solexa_faked.fastq", "fastq-solexa") | >>> print("%s %s" % (record.id, record.seq)) | slxa_0001_1_0001_01 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTNNNNNN | >>> print(list(record.letter_annotations)) | ['solexa_quality'] | >>> print(record.letter_annotations["solexa_quality"]) | [40, 39, 38, 37, 36, 35, 34, 33, 32, 31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0, -1, -2, -3, -4, -5] | | The letter_annotations get sliced automatically if you slice the | parent SeqRecord, for example taking the last ten bases: | | >>> sub_record = record[-10:] | >>> print("%s %s" % (sub_record.id, sub_record.seq)) | slxa_0001_1_0001_01 ACGTNNNNNN | >>> print(sub_record.letter_annotations["solexa_quality"]) | [4, 3, 2, 1, 0, -1, -2, -3, -4, -5] | | Any python sequence (i.e. list, tuple or string) can be recorded in | the SeqRecord's letter_annotations dictionary as long as the length | matches that of the SeqRecord's sequence. e.g. | | >>> len(sub_record.letter_annotations) | 1 | >>> sub_record.letter_annotations["dummy"] = "abcdefghij" | >>> len(sub_record.letter_annotations) | 2 | | You can delete entries from the letter_annotations dictionary as usual: | | >>> del sub_record.letter_annotations["solexa_quality"] | >>> sub_record.letter_annotations | {'dummy': 'abcdefghij'} | | You can completely clear the dictionary easily as follows: | | >>> sub_record.letter_annotations = {} | >>> sub_record.letter_annotations | {} | | Note that if replacing the record's sequence with a sequence of a | different length you must first clear the letter_annotations dict. | | seq | The sequence itself, as a Seq or MutableSeq object. | | ---------------------------------------------------------------------- | Data and other attributes defined here: | | __hash__ = None
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
simple_seq = Seq("GATCAGGATTAGGCC")
simple_seq_r = SeqRecord(simple_seq)
simple_seq_r.id
'<unknown id>'
simple_seq_r.id = "AC12345"
simple_seq_r.description = "not a real sequence"
print(simple_seq_r.description)
not a real sequence
simple_seq_r.seq
Seq('GATCAGGATTAGGCC')
from Bio.Seq import Seq
simple_seq = Seq("GATC")
from Bio.SeqRecord import SeqRecord
simple_seq_r = SeqRecord(simple_seq, id="AC12345")
.annotations[] is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes. Adding annotations is easy, and just involves dealing directly with the annotation dictionary:
simple_seq_r.annotations["evidence"] = "None."
print(simple_seq_r.annotations)
{'evidence': 'None.'}
letter_annotations[] is another dictionary that lets you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence:
simple_seq_r.letter_annotations["phred_quality"] = [40, 38, 30, 30]
print(simple_seq_r.letter_annotations)
{'phred_quality': [40, 38, 30, 30]}
print(simple_seq_r.letter_annotations["phred_quality"])
[40, 38, 30, 30]
translated_seq = simple_seq_r.seq.translate()
print(translated_seq)
D
C:\ProgramData\Anaconda3\lib\site-packages\Bio\Seq.py:2979: BiopythonWarning: Partial codon, len(sequence) not a multiple of three. Explicitly trim the sequence or add trailing N before translation. This may become an error in future. warnings.warn(
NC 005816.fna - Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1 (from NCBI)
#https://raw.githubusercontent.com/biopython/biopython/master/Tests/GenBank/NC_005816.fna
from Bio import SeqIO
recordN = SeqIO.read("NC_005816.fna", "fasta")
recordN
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='gi|45478711|ref|NC_005816.1|', name='gi|45478711|ref|NC_005816.1|', description='gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
#https://www.ncbi.nlm.nih.gov/nuccore/V01325.1?report=fasta
from Bio import SeqIO
record = SeqIO.read("myFasta.fasta", "fasta")
record
SeqRecord(seq=Seq('AATTCCGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCAG...TAT'), id='V01325.1', name='V01325.1', description='V01325.1 Yeast (S. carlsbergensis) genes for 5.8S rRNA and 26S rRNA with intergenic region', dbxrefs=[])
Have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object:
record.seq
Seq('AATTCCGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCTTGGTATTCCAG...TAT')
recordN.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')
recordN.id
'gi|45478711|ref|NC_005816.1|'
record.name
'V01325.1'
record.description
'V01325.1 Yeast (S. carlsbergensis) genes for 5.8S rRNA and 26S rRNA with intergenic region'
The first word of the FASTA record’s title line is used for both the id and name attributes. The whole title line (after removing the greater than symbol) is used for the record description.
recordN.dbxrefs
[]
Now we will look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, but this time as a GenBank file.
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb", "genbank")
record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])
record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')
record.id
'NC_005816.1'
record.name
'NC_005816'
record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
record.letter_annotations
{}
len(record.annotations)
13
record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'
record.dbxrefs
['Project:58037']
All the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list.
len(record.features)
41
Biopython SeqFeature class attempts to encapsulate as much of the information about the sequence as possible. The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes.
Position – This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions. Location – A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location.
There are several types of fuzzy positions, so we have five classes do deal with them:
- ExactPosition - Represents a position which is specified as exact along the sequence. This is represented as just a number, and you can get the position by looking at the position attribute of the object.
- BeforePosition - Represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like "<13", signifying that the real position is located somewhere less than 13. To get the specified upper boundary, look at the position attribute of the object.
- AfterPosition - Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as ">13", and like BeforePosition, you get the boundary number by looking at the position attribute of the object.
- WithinPosition - Occasionally used for GenBank/EMBL locations, this class models a position which occurs somewhere between two specified nucleotides. In GenBank/EMBL notation, this would be represented as ‘(1.5)’, to represent that the position is somewhere within the range 1 to 5. To get the information in this class you have to look at two attributes. The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one. The extension attribute specifies the range to the higher boundary, so in this case it would be 4. So object.position is the lower boundary and object.position + object.extension is the upper boundary.
- OneOfPosition - Occasionally used for GenBank/EMBL locations, this class deals with a position where several possible values exist, for instance you could use this if the start codon was unclear and there where two candidates for the start of the gene. Alternatively, that might be handled explicitly as two related gene features.
- UnknownPosition - This class deals with a position of unknown location. This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt.
from Bio import SeqFeature
start_pos = SeqFeature.AfterPosition(5)
end_pos = SeqFeature.BetweenPosition(9, left=8, right=9)
my_location = SeqFeature.FeatureLocation(start_pos, end_pos)
print(my_location)
[>5:(8^9)]
my_location.start
AfterPosition(5)
print(my_location.start)
>5
my_location.end
BetweenPosition(9, left=8, right=9)
print(my_location.end)
(8^9)
int(my_location.start)
5
int(my_location.end)
9
For older versions of Biopython ask for the nofuzzy_start and nofuzzy_end attributes of the location which are plain integers:
my_location.nofuzzy_start
5
my_location.nofuzzy_end
9
exact_location = SeqFeature.FeatureLocation(5, 9)
print(exact_location)
[5:9]
exact_location.start
ExactPosition(5)
int(exact_location.start)
5
exact_location.nofuzzy_start
5
Python keyword in with a SeqFeature or location object lets us see if the base/residue for a parent coordinate is within the feature/location or not.
Suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop:
from Bio import SeqIO
my_snp = 4350
record = SeqIO.read("NC_005816.gb", "genbank")
for feature in record.features:
if my_snp in feature:
print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))
source ['taxon:229193'] gene ['GeneID:2767712'] CDS ['GI:45478716', 'GeneID:2767712']
gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns.
A SeqFeature or location object doesn’t directly contain a sequence, instead the location describes how to get this from the parent sequence. For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(6..18), like this:
from Bio.Seq import Seq
from Bio.SeqFeature import SeqFeature, FeatureLocation
seq = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)
You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement. If you are using Biopython 1.59 or later, the feature location’s start and end are integer like so this works
feature_seq = seq[feature.location.start:feature.location.end].reverse_complement()
print(feature_seq)
AGCCTTTGCCGTC
feature_seq = feature.extract(seq)
print(feature_seq)
AGCCTTTGCCGTC
The length of a SeqFeature or location matches that of the region of sequence it describes.
print(len(feature_seq))
13
print(len(feature))
13
print(len(feature.location))
13
For simple FeatureLocation objects the length is just the difference between the start and end positions. However, for a CompoundLocation the length is the sum of the constituent regions.
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record1 = SeqRecord(Seq("ACGT"), id="test")
record2 = SeqRecord(Seq("ACGT"), id="test")
What happens when you try to compare these “identical” records?
record1 = record2
record1.id == record2.id
True
record1.seq == record2.seq
True
A single equal mark is used to assign a value to a variable, whereas two consecutive equal marks is used to check whether 2 expressions give the same value.
format() method of the SeqRecord class gives a string containing your record formatted using one of the output file formats supported by Bio.SeqIO, such as FASTA:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
record = SeqRecord(
Seq(
"MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"
"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"
"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"
"SSAC"
),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalcone synthase [Cucumis sativus]",
)
print(record.format("fasta"))
>gi|14150838|gb|AAK54648.1|AF376133_1 chalcone synthase [Cucumis sativus] MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM SSAC
This format method takes a single mandatory argument, a lower case string which is supported by Bio.SeqIO as an output format. However, some of the file formats Bio.SeqIO can write to require more than one record (typically the case for multiple sequence alignment formats), and thus won’t work via this format() method.
Slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence. What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted).
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb", "genbank")
record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])
len(record)
9609
len(record.features)
41
From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list:
print(record.features[20])
type: gene location: [4342:4780](+) qualifiers: Key: db_xref, Value: ['GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05']
print(record.features[21])
# coding DNA sequence - CDS
type: CDS location: [4342:4780](+) qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)'] Key: product, Value: ['pesticin immunity protein'] Key: protein_id, Value: ['NP_995571.1'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH']
sub_record = record[4300:4800]
sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
len(sub_record)
500
len(sub_record.features)
2
Our sub-record just has two features, the gene and CDS entries for YP_pPCP05:
print(sub_record.features[0])
type: gene location: [42:480](+) qualifiers: Key: db_xref, Value: ['GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05']
print(sub_record.features[1])
type: CDS location: [42:480](+) qualifiers: Key: codon_start, Value: ['1'] Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712'] Key: gene, Value: ['pim'] Key: locus_tag, Value: ['YP_pPCP05'] Key: note, Value: ['similar to many previously sequenced pesticin immunity protein entries of Yersinia pestis plasmid pPCP, e.g. gi| 16082683|,ref|NP_395230.1| (NC_003132) , gi|1200166|emb|CAA90861.1| (Z54145 ) , gi|1488655| emb|CAA63439.1| (X92856) , gi|2996219|gb|AAC62543.1| (AF053945) , and gi|5763814|emb|CAB531 67.1| (AL109969)'] Key: product, Value: ['pesticin immunity protein'] Key: protein_id, Value: ['NP_995571.1'] Key: transl_table, Value: ['11'] Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSHGIYGKQTTFKQTEFTNIKSNTKKHIALINKDNSWMISLKILGIKRDEYTVCFEDFSLIRPPTYVAIHPLLIKKVKSGNFIVVKEIKKSIPGCTVYYH']
The locations have been adjusted to reflect the new parent sequence!
While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not. To avoid guessing, with the exception of the molecule type, the .annotations and .dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate.
sub_record.annotations
{'molecule_type': 'DNA'}
sub_record.dbxrefs
[]
You may wish to preserve other entries like the organism? Beware of copying the entire annotations dictionary as in this case your partial sequence is no longer circular DNA - it is now linear:
sub_record.annotations['topology'] = 'linear'
The same point could be made about the record id, name and description, but for practicality these are preserved:
sub_record.id
'NC_005816.1'
sub_record.name
'NC_005816'
sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'
This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described before
sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial"
print(sub_record.format("genbank")[:200] + "...")
LOCUS NC_005816 500 bp DNA linear UNK 01-JAN-1980 DEFINITION Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial. ACCESSION NC_005816 VERSION NC_0058...
Add SeqRecord objects together, giving a new SeqRecord. What is important here is that any common per-letter annotations are also added, all the features are preserved (with their locations adjusted), and any other common annotation is also kept (like the id, name and description).
from Bio import SeqIO
record = next(SeqIO.parse("GSM461182_untreat_single_subset.fastq", "fastq"))
len(record)
75
print(record.seq)
CGAGCATTGCAATCAGATGGACGANCNNNNNNNNNNNNNNNNNNNNNNTNTNTTTGGCCTGGTTNNNNNNNANNN
print(record.letter_annotations["phred_quality"])
[31, 20, 32, 29, 33, 33, 34, 34, 34, 34, 33, 34, 33, 33, 33, 32, 29, 29, 2, 2, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0]
left = record[:26]
print(left.seq)
CGAGCATTGCAATCAGATGGACGANC
print(left.letter_annotations["phred_quality"])
[31, 20, 32, 29, 33, 33, 34, 34, 34, 34, 33, 34, 33, 33, 33, 32, 29, 29, 2, 2, 2, 2, 2, 2, 0, 2]
We can make a new edited record by first slicing the SeqRecord before and after
right = record[21:]
print(right.seq)
CGANCNNNNNNNNNNNNNNNNNNNNNNTNTNTTTGGCCTGGTTNNNNNNNANNN
print(right.letter_annotations["phred_quality"])
[2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0]
edited = left + right
len(edited)
80
print(edited.seq)
CGAGCATTGCAATCAGATGGACGANCCGANCNNNNNNNNNNNNNNNNNNNNNNTNTNTTTGGCCTGGTTNNNNNNNANNN
print(edited.letter_annotations["phred_quality"])
[31, 20, 32, 29, 33, 33, 34, 34, 34, 34, 33, 34, 33, 33, 33, 32, 29, 29, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 0, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0]
You can make this shorter with just:
edited = record[:20] + record[21:]
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb", "genbank")
record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=['Project:58037'])
len(record)
9609
len(record.features)
41
record.dbxrefs
['Project:58037']
record.annotations.keys()
dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'gi', 'keywords', 'source', 'organism', 'taxonomy', 'references', 'comment'])
You can shift the origin like this:
shifted = record[2000:] + record[:2000]
shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA'), id='NC_005816.1', name='NC_005816', description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence', dbxrefs=[])
len(shifted)
9609
len(shifted.features)
40
shifted.dbxrefs
[]
shifted.annotations.keys()
dict_keys(['molecule_type'])
SeqRecord slicing step is cautious in what annotation it preserves (erroneously propagating annotation can cause major problems). If you want to keep the database cross references or the annotations dictionary, this must be done explicitly:
shifted.dbxrefs = record.dbxrefs[:]
shifted.annotations = record.annotations.copy()
shifted.dbxrefs
['Project:58037']
shifted.annotations.keys()
dict_keys(['molecule_type', 'topology', 'data_file_division', 'date', 'accessions', 'sequence_version', 'gi', 'keywords', 'source', 'organism', 'taxonomy', 'references', 'comment'])
SeqRecord object’s reverse_complement method takes a number of optional arguments corresponding to properties of the record. Setting these arguments to True means copy the old values, while False means drop the old values and use the default value. You can alternatively provide the new desired value instead.
from Bio import SeqIO
record = SeqIO.read("NC_005816.gb", "genbank")
print("%s %i %i %i %i" % (record.id, len(record), len(record.features), len(record.dbxrefs), len(record.annotations)))
NC_005816.1 9609 41 1 13
Here we take the reverse complement and specify a new identifier – but notice how most of the annotation is dropped (but not the features):
rc = record.reverse_complement(id="TESTING")
print("%s %i %i %i %i" % (rc.id, len(rc), len(rc.features), len(rc.dbxrefs), len(rc.annotations)))
TESTING 9609 41 0 0