Regular Expressions

Authors: Tom Dunham
Date: 2009-03-31

Regular Expressions

Links on regular expressions (including the origin of that quote)

And there is a good book

Mastering Regular Expressions Powerful Techniques for Perl and Other Tools By Jeffrey E. F. Friedl ISBN 10: 1-56592-257-3 | ISBN 13: 9781565922570

Syntax

Regular expressions are a mixture of literal characters and meta-characters.

More Syntax

Common Character Classes

More Syntax

You can repeat matches with * and +

Regular Expressions in Python

Regular expressions are provided by the re module:

>>> import re
>>> re.findall(r"\bto[aeiou]+t\b", allwords)
['toit', 'toot', 'tout']

findall will return a list of all non-overlapping matches in a string.

Exercise

See handout

Using the wordlist CROSSWD.TXT,

  1. Find all words ending in oligy
  2. Find words the start with dys
  3. Write an expression that matches 'gelatine', 'gemstone', 'gendarme', 'gene', 'genome', 'genuine', 'geophone', and 'germane'
  4. The file rebase_allenz.txt contains a header, a number of records and a set of references. Extract the references

Extraction

You can extract data from a string that matches a regular expression by putting the area you are interested in in a group.

Greediness

Matching happens from left to right, and every term matches as much as possible (it's greedy). Be careful with .* and .+

"^t(.+)(.*)t$" The second group will never match anything

Non-Greediness

You can change this behavior, and force a non-greedy match using a ?

^t(.+?)(.*)t$ Matches and extracts:

turnout u,rnou
turnspit u,rnspi
turret u,rre
tut u,
twangiest w,angies

Flags

It can be useful to change the way matches are performed, you do this by passing flags:

re.search(r"\b.*oligy\b", w, re.IGNORECASE)
re.search(r"\b.*oligy\b", w, re.IGNORECASE|re.DOTALL)

The IGNORECASE flag makes the expression case-insensitive, the DOTALL flag makes . match everything, including newlines.

Exercise

See handout

  1. Extract the eight field names from the header of rebase_allenz.txt
  2. Extract the enzyme names and recognition sequences from rebase_allenz.txt. Remove the point of cleavage information, and produce a list of pairs (tuples with 2 elements) containing (enzyme name, (list of sites)). Print those enzymes with more than one site (there are six).

The Bad News

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

-- Jamie Zawinski

Some more regex links

An online regex builder http://www.txt2re.com

The RegEx library http://regexlib.com/

Site with tutorials and more http://www.regular-expressions.info/