Authors: | Tom Dunham |
---|---|
Date: | 2009-03-31 |
Links on regular expressions (including the origin of that quote)
And there is a good book
Mastering Regular Expressions Powerful Techniques for Perl and Other Tools By Jeffrey E. F. Friedl ISBN 10: 1-56592-257-3 | ISBN 13: 9781565922570
Regular expressions are a mixture of literal characters and meta-characters.
You can repeat matches with * and +
Regular expressions are provided by the re module:
>>> import re >>> re.findall(r"\bto[aeiou]+t\b", allwords) ['toit', 'toot', 'tout']
findall will return a list of all non-overlapping matches in a string.
>>> import re >>> m = re.search(r"\bto[aeiou]+t\b", allwords) >>> m.start() 916486 >>> m.end() 916490 >>> allwords[m.start():m.end()] 'toit'
See handout
Using the wordlist CROSSWD.TXT,
You can extract data from a string that matches a regular expression by putting the area you are interested in in a group.
Groups are defined by () - "^t(..)t$" matches and extracts:
tact ac tart ar taut au text ex that ha tilt il
Matching happens from left to right, and every term matches as much as possible (it's greedy). Be careful with .* and .+
"^t(.+)(.*)t$" The second group will never match anything
You can change this behavior, and force a non-greedy match using a ?
^t(.+?)(.*)t$ Matches and extracts:
turnout u,rnou turnspit u,rnspi turret u,rre tut u, twangiest w,angies
It can be useful to change the way matches are performed, you do this by passing flags:
re.search(r"\b.*oligy\b", w, re.IGNORECASE) re.search(r"\b.*oligy\b", w, re.IGNORECASE|re.DOTALL)
The IGNORECASE flag makes the expression case-insensitive, the DOTALL flag makes . match everything, including newlines.
See handout
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
-- Jamie Zawinski
Some more regex links
An online regex builder http://www.txt2re.com
The RegEx library http://regexlib.com/
Site with tutorials and more http://www.regular-expressions.info/