Authors: Tom Dunham
Date: 2009-03-31


So far we have populated, transformed and filtered lists:

acc = []
for item in list:
    if test(item):


acc = [transform(item) for item in list if test(item)]


It can be useful to imagine a program as a pipeline, with various units that do different kinds of processing

If I wanted the sequence for the third male in a Genbank file




Creating Generators

When you use yield inside a function, Python turns your function into a generator:

>>> def gen():
        yield 1
>>> gen()
<generator object at 0x00F4BFA8>
>>> g = gen()

Fibonacci, again

Lazy evaluation means that infinite series are no longer a problem:

def fib():
    a, b = 0, 1
    while True:
        yield a
        a, b = b, a + b

Fib and for

The for loop "understands" iterators, and can process them as if they were a sequence, by repeatedly calling next and binding the result to the loop variable:

for f in fib():
    print f

That will run forever.

Generating Records

The most efficient and safest way to read large data files is one line at a time:

for line in file:
    # process line

Generators allow us to write a function that accumulates the lines from a file that make up a record and build a structure from them, then yield that structure.


See handout

  1. What will the following program print?:

    def pcount(i):
        print i
        return i
    print "-" * 50
    [pcount(i) for i in range(10)]
    print "-" * 50
    (pcount(i) for i in range(10))
    print "-" * 50
    [pcount(i) for i in xrange(10)]
  2. The program:

    def countup():
        i = 1
        while True:
             yield i
             i = i + 1
    c = (i for i in countup())
    for en, i in enumerate(c):
        if en > 5: break
        print i
    print "Endloop"

    prints 6 7 8 9 10 11 13. Why is the last value 13?

3. Write a generater function that reads the crosswd.txt file and yields every word longer than 20 charaters.

4. Write another generator function that reads the crosswd.txt file, but this time keeps track of the last word as well as the current one, and yields both words joined with a space if the sum of their lengths is greater than 40.

  1. This is a sample from a fastaq file:


    The lines beginning @ reprisent the beginning of a sequence record, the rest of the line is an identifier (which always begins GAII01). The following line is sequence of bases. This is followed by another line that starts with a +, then the same identifier, then a indication of the quality of the read (a string that's the same length as the sequence).

    The sample above is of two records. A records could be read into a program as a tuple containing id, sequence and quality:

    Write a generator function that takes a file object as a
    parameter and yields tuples in this form. You may find the
    ``startswith`` method of the string object helpful. Remember
    that you can call the ``next`` method on an iterator yourself.