Authors: | Tom Dunham |
---|---|
Date: | 2009-03-31 |
So far we have populated, transformed and filtered lists:
acc = [] for item in list: if test(item): acc.append(transform(item))
Or:
acc = [transform(item) for item in list if test(item)]
It can be useful to imagine a program as a pipeline, with various units that do different kinds of processing
If I wanted the sequence for the third male in a Genbank file
Generator expressions look like list comprehensions:
(i * 2 for i in xrange(100))
But evaluate only when they have to
When you use yield inside a function, Python turns your function into a generator:
>>> def gen(): yield 1
>>> gen() <generator object at 0x00F4BFA8> >>> g = gen() >>> g.next() 1
Lazy evaluation means that infinite series are no longer a problem:
def fib(): a, b = 0, 1 while True: yield a a, b = b, a + b
The for loop "understands" iterators, and can process them as if they were a sequence, by repeatedly calling next and binding the result to the loop variable:
for f in fib(): print f
That will run forever.
The most efficient and safest way to read large data files is one line at a time:
for line in file: # process line
Generators allow us to write a function that accumulates the lines from a file that make up a record and build a structure from them, then yield that structure.
See handout
What will the following program print?:
def pcount(i): print i return i
print "-" * 50 [pcount(i) for i in range(10)] print "-" * 50 (pcount(i) for i in range(10)) print "-" * 50 [pcount(i) for i in xrange(10)]
The program:
def countup(): i = 1 while True: yield i i = i + 1
c = (i for i in countup()) c.next() c.next() c.next() c.next() c.next() for en, i in enumerate(c): if en > 5: break print i print "Endloop" print c.next()
prints 6 7 8 9 10 11 13. Why is the last value 13?
3. Write a generater function that reads the crosswd.txt file and yields every word longer than 20 charaters.
4. Write another generator function that reads the crosswd.txt file, but this time keeps track of the last word as well as the current one, and yields both words joined with a space if the sum of their lengths is greater than 40.
This is a sample from a fastaq file:
@GAII01_3:1:2:225:1639 GTATTGCCAATCTCTTATTGGCTGATTCATCTAATT +GAII01_3:1:2:225:1639 hhhhhhhhhhhhhhhhhhhhhhhgWQhRPhEhFDgF @GAII01_3:1:2:286:1934 GTTATTATAGATGAGAATGAAGAGTTATGTGGAGTT +GAII01_3:1:2:286:1934 hhhhhhhhhhhhhhhhhhhhhhhhhh]hhhhhI\hh
The lines beginning @ reprisent the beginning of a sequence record, the rest of the line is an identifier (which always begins GAII01). The following line is sequence of bases. This is followed by another line that starts with a +, then the same identifier, then a indication of the quality of the read (a string that's the same length as the sequence).
The sample above is of two records. A records could be read into a program as a tuple containing id, sequence and quality:
("GAII01_3:1:2:225:1639", "GTATTGCCAATCTCTTATTGGCTGATTCATCTAATT", "hhhhhhhhhhhhhhhhhhhhhhhgWQhRPhEhFDgF") Write a generator function that takes a file object as a parameter and yields tuples in this form. You may find the ``startswith`` method of the string object helpful. Remember that you can call the ``next`` method on an iterator yourself.