Authors: | Tom Dunham |
---|---|
Date: | 2009-03-31 |
>>> from Bio.Seq import Seq >>> seq = Seq("AGTACACTGGT") >>> seq Seq('AGTACACTGGT', Alphabet()) >>> print seq AGTACACTGGT
Sequences behave a lot like strings
>>> seq[:2] Seq('AG', Alphabet()) >>> print seq[:2] AG >>> seq.count("A") 3 >>> Seq("ATCCGG") + Seq("TTGGAATC") Seq('ATCCGGTTGGAATC', Alphabet())
You can specify a specific alphabet:
>>> from Bio.Alphabet import IUPAC >>> seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna) >>> seq Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> seq Seq('AGTACACTGGT', IUPACUnambiguousDNA()) >>> seq.complement() Seq('TCATGTGACCA', IUPACUnambiguousDNA()) >>> seq.reverse_complement() Seq('ACCAGTGTACT', IUPACUnambiguousDNA())
Notice that the alphabet does not change
>>> seq = Seq('ATGAAGACAG', IUPAC.unambiguous_dna) >>> seq Seq('ATGAAGACAG', IUPACUnambiguousDNA()) >>> seq.transcribe() Seq('AUGAAGACAG', IUPACUnambiguousRNA())
Notice the alphabet changes
>>> seq Seq('ATGAAGACAG', IUPACUnambiguousDNA()) >>> mseq = seq.tomutable() >>> mseq MutableSeq('ATGAAGACAG', IUPACUnambiguousDNA())
>>> mseq.reverse_complement() >>> mseq MutableSeq('CTGTCTTCAT', IUPACUnambiguousDNA())
Notice that this behaviour means you cannot substitute Seq and MutableSeq objects.
Parsers yield SeqRecord objects:
>>> from Bio import SeqIO >>> p = SeqIO.parse(open(r"wgs_aacr_pro.dat", "rb"), "embl") >>> p.next() SeqRecord(seq=Seq('TG...GCA', IUPACAmbiguousDNA()), id='AACR01000001.1', name='AACR01000001', description='Pasteuria nishizawae str. North American PnPst10, whole genome shotgun sequence.', dbxrefs=[])
>>> rec = p.next()
>>> print rec ID: AACR01000002.1 Name: AACR01000002 Description: Pasteuria nishizawae str. North American PnEco4A, whole genome shotgun sequence. Number of features: 1 /taxonomy=['Bacteria', 'Firmicutes', 'Bacillales', 'Pasteuriaceae', 'Pasteuria'] /references=[<Bio.SeqFeature.Reference instance at 0x0155FAA8>] /accessions=['AACR01000002', 'AACR01000000', 'AACR00000000'] /data_file_division=PRO /organism=Pasteuria nishizawae str. North American /sequence_version=1 Seq('AATTCGTGGCGGGGCAAACGGGTTCATTATAAAAATTGTCGGAACGAGTTCAAC...ATT', IUPACAmbiguousDNA())
>>> rec.seq Seq('AATTCGTGGCGGGGCAAA...ATT', IUPACAmbiguousDNA()) >>> len(rec) 1124 >>> len(rec.seq) 1124
>>> rec.name 'AACR01000002' >>> len(rec.features) 1 >>> rec.features[0].type 'source' >>> rec.features[0].strand 1
>>> pprint(rec.annotations) {'accessions': ['AACR01000002', 'AACR01000000', 'AACR00000000'], 'data_file_division': 'PRO', 'organism': 'Pasteuria nishizawae str. North American', 'references': [<Bio.SeqFeature.Reference instance at 0x0155FAA8>], 'sequence_version': 1, 'taxonomy': ['Bacteria', 'Firmicutes', 'Bacillales', 'Pasteuriaceae', 'Pasteuria']}
from Bio import SeqIO in_file = SeqIO.parse(open("infile"), "embl") out_file = open(sys.argv[2], "w") SeqIO.write(in_file, out_file, "fasta")
write will accept lists or iterators.
See handout
1. Write a program that takes the name of a EMBL file and calculates some statstics on the lengths of the sequences within it - start with min and max.
2. Write a program that takes the name of a EMBL file and prints it in fasta format.