Smith College Computer Science 250, Spring 2013

Pattern Matching Lab

February 13-15, 2013

Work with a partner in class. I also encourage you to work with a partner on the homework

Use a Secure Shell program to Log in to beowulf.csc with your class account You can either use a secure shell to log in to beowulf now and do all your work there, or do your work on a local mac or pc that runs python, and then sftp it later.

Eventually you'll have to sftp or scp your homework file to beowulf and submit it via your class account.

If you want to login now, make sure you use a secure shell whether on PC or Mac.

We can use the ssh command to tell it to connect to another machine:

ssh 250b-xx@beowulf.csc.smith.edu

where xx are the two letters in your own class account username. ssh stands for (Secure Shell) Client.
When you are prompted, type in the password that you received with the account name. You will see some messages and finally you will see beowulf's prompt -- a short line of text ending with $.

run the Python Interpreter version 2.7
. We will use version 2.7 to stay synchronized with the on-line tutorials.
To start the Python interpreter, just type in the command
python and hit Enter
(if you want to use python3 that is fine; the re module works the same; use the print function instead of the print statement)
You should obtain something like this:

[jfrankli@beowulf 250s13]$ python
Python 2.7.3 (default, Apr 30 2012, 21:18:11) 
[GCC 4.7.0 20120416 (Red Hat 4.7.0-2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

where >>> is the Python interpreter's prompt. The prompt is prompting you for a Python command, or statement. Try this:

>>> print "Hello, World!"

and also this:

>>> print 5+7

and this:

>>> print 16+32, "is the sum of 16 and 32"

Next, check out how easy string concatenation is in python with the plus (+) operator (+ is not or in python, as it is in our 250 textbook):

>>> x = "computer"
>>> y = "science"
>>> z = x + " " + science
>>> print z

Exit from the Python Intepreter
Exit from the Python interpreter by typing Ctrl-D.

Python's re module:
Regular expressions (or REs) are essentially a tiny, highly specialized programming language embedded inside Python and made available through the re module. Using this little language, you specify the rules for the set of possible strings that you want to match; this set might contain English sentences, or e-mail addresses, or TeX commands, or anything you like. You can then ask questions such as

``Does this string match the pattern?'',
 or ``Is there a match for the pattern anywhere in this string?''.

You can also use REs to modify a string or to split it apart in various ways.

Most letters and characters will simply match themselves. For example, the regular expression

test

will match the string "test" exactly.
Meta-Characters: characters that are part of the re module's alphabet, but don't match themselves. Instead, they have special meaning to the re interpreter. The * is very much like the * we have seen in our theoretical regular expressions. In practice the re module allows other meta-characters. They are:

. ^ $ * + ? { } [ ] \ | ( )
Meta-character Highlights:

*         ca*t will match "ct" (0 "a"characters), "cat" (1 "a"), "caaat" (3 "a"characters), and so forth.
+        Another repeating metacharacter is +, which matches one or more times.
?        The question mark character, ?, matches either one or zero times.
|        Vertical bar | is used as or, 'y | p' would match either character y or p.
[ ]       Square brackets give a choice of a class of characters (like or in theoretical regular expressions). Let's consider the expression a[bcd]*b. This matches the letter "a", zero or more letters from the class [bcd], and finally ends with a "b". It will try to match up the final b using the [bcd]* part, but then backtracks to let the final b match the final b in the search pattern.
{ }       The most complicated repeated qualifier is {m,n}, where m and n are base 10 positive integers. This qualifier means there must be at least m repetitions, and at most n of the character or expression appearing before the curly brackets. For example, a/{1,3}b will match "a/b", "a//b", and "a///b" but not "a////b", and not "ab".
\         The backslash is a metacharacter in the re regular expressions, and is used in combination with other characters or metacharacters to escape their usual interpretation. Thus \| is the vertical bar, \\ matches a single backslash, and \[ matches a left square bracket. \d is a single token matching a digit, otherwise d by itself is the character d.

A potential problem is that python strings also use the backslash to escape characters (e.g. \n means new line). To avoid confusion, we'll use python's raw string option for our searches.
Here's how it works:

The python language gives us the option of using a raw string. A raw string's characters are all simply characters and are not interpeted by python as anything else. This is especially useful when using the `\' character. This is usually the escape character in python, java, C++, etc. that is used for print formatting. When using raw strings in python, it and all other characters, including all meta-characters above are all simply characters.
To indicate that a string is raw, just put the r character in front of it. We can see the difference it makes in this python session in which the \n causes formatting in the first string, but is merely the slash followed by the n in the second, raw string.
>>> m = "We are \numerous" >>> print m We are umerous >>> m = r"We are \numerous" >>> print m We are \numerous >>>

Note: python strings can be delimited by either single or double quotes

How to Use re

Invoke the python interpreter again by typing python
Import the python module by typing import re
The re.match() function expects two strings as parameters. The first one contains the re with possible meta-characters, and the second string is the string to match.
>>> re.match(r'From\s*','Fromage amk') <_sre.SRE_Match object at 0xb7f24870>
We are searching the string 'Fromage amk' for the expression 'From\s*'. This pattern requires the substring 'From' followed by zero or more white space characters. (\s matches any whitespace character, and * indicates one or more of them). Do you remember that python is object-oriented? Because there is a match, we are given the address of an _sre.SRE_Match object in return. How nice!
Before we look further, let's try requiring one or more whitespaces after the From:
>>> print re.match(r'From\s+', 'Fromage amk') None
We are searching the string 'Fromage amk' for the expression 'From\s+'. This pattern requires the substring 'From' followed by at least one white space character. (\s matches any whitespace character, and + indicates one or more of them).
The result printed is None, indicating no matching python object ....i.e. no match. Why?

Now store the object's address in a variable and call the method group() to see the matching string with this string:
>>> m = re.match(r'From\s*','Fromage amk') >>> print m.group() From

Now try these, and ask if you have any questions:
>>> m = re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') <re.MatchObject instance at 80c5978> >>> m.group() 'From '
The group() method reveals the substring that matches the pattern.
Here we demonstrate or, also called alternation, using the vertical bar '|', and the re search() method:
>>> a = 'i|y' >>> b = 'Judy' >>> c = 'Judi' >>> m = re.search(a,b) >>> m.group() 'y' >>> m = re.search(a,c) >>> m.group() 'i' >>> m = re.match(a.c) >>> ?????

o Click here for comparison of match() vs. search() vs. findall() vs. finditer()
Look up the difference between the match() and the search() functions in this nice tutorial on using regular expressions in python: Go to http://www.python.org/doc/howto/ and click on Regular Expression HOWTO
. Some highlights from it are given below. Try each concept/example as you go.
Make sure that you understand these examples, discussing them with each other.

Groups
Groups are marked by the "(", ")" metacharacters. "(" and ")" have much the same meaning as they do in mathematical expressions; they group together the expressions contained inside them. For example, you can repeat the contents of a group with a repeating qualifier, such as *, +, ?, or {m,n}. For example, (ab)* will match zero or more repetitions of "ab".
>>> m = re.search('(ab)*', 'ab') >>> m.group() 'ab' >>> m.group(0) 'ab'
Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> m = re.search('(a(b)c)d', 'abcd') >>> m.group(0) 'abcd' >>> m.group(1) 'abc' >>> m.group(2) 'b'
Backreferences
Backreferences in a pattern allow you to specify that the contents of an earlier capturing group must also be found at the current location in the string. For example, \1 will succeed if the exact contents of group 1 can be found at the current position, and fails otherwise. Remember that Python's string literals also use a backslash followed by numbers to allow including arbitrary characters in a string, so be sure to use a raw string when incorporating backreferences in a RE. For example, the following RE detects doubled words in a string.
>>> p = r'(\b\w+)\s+\1' >>> m = re.search(p, 'Paris in the the spring') >>> m.group() 'the the'
where \b means Word boundary (non alphanumeric) and \w means word (alphanumeric character, i.e. equivalent to the class [a-zA-Z0-9_].) Backreferences like this aren't often useful for just searching through a string -- there are few text formats which repeat data in this way -- but you'll soon find out that they're very useful when performing string substitutions. Why wouldn't this pattern work on 'Paris in thethe spring' ?

Lookahead
To read about lookahead, go to the web site http://www.regular-expressions.info/lookaround.html and scroll down to the section called Positive and Negative Lookahead. Negative lookahead is indispensable if you want to match something that is not followed by something else.

Negative lookahead provides the solution: q(?!u)
This matches a "q" not followed by a "u", where u is a regular expression.
Positive lookahead works just the same. q(?=u) matches a q that is followed by a u, without making the u part of the match.

Examples to Consider
The link below gives examples. Before reading them over, use emacs to open a file
emacs lab250.py
and in emacs type in the following.:
import re # import the reg expr module functions def main(): # define function called main a = "I like Red Hat Linux for development." m = re.search(r"(.{7})Linux", a) # Any 7 characters then string 'Linux' b = m.group(1) # The first group in parentheses print "<" + b + ">" print m.group(0) # The whole match
Save and exit emacs then invoke python to call the function:
jfrankli-PB1:~ jfrankli$ python Python 2.3.5 (#1, Mar 20 2005, 20:38:20) [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import lab250 >>> lab250.main() <ed Hat > ed Hat Linux >>>
Make sure you and your partner understand exactly how this all works before moving on.
You can also import and run re functions in the python interpreter:
jfrankli-PB1:~ jfrankli$ python Python 2.3.5 (#1, Mar 20 2005, 20:38:20) [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re # import the reg expr module functions >>> a = "I like Red Hat Linux for development." >>> m = re.search(r"(.{7})Linux", a) >>> b = m.group(1) >>> print "<" + b + ">" <ed Hat > >>> print m.group(0) 'ed Hat Linux'

Now using the Regular-Expresions.info reference page, figure out what is happening in each example below:
>>> a = r"Valentines is 2/14/2012. Don\'t forget!" >>> import re >>> m = re.search(r"\D\d{1,2}/\d{1,2}/\d{2,4}\D", a) >>> m.group(0)
How can you ensure that the date contains the full four digit date?

How about this one?
>>> s = "My, how are you?" >>> m = re.search("^how are you$", s) >>>
Why didn't this match? What can you do to make it match, including the question mark?

Here's one that makes me love python:
import re def simul(): a = "11/21/1999" m = re.search(r"^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})", a) month,day,year = m.groups() print year, month, day
Make sure you understand this before moving on.

Now try this one. First get the test file either at the beowulf prompt:
beowulf$ getcopy test.tst
or download test.tst here. Then in emacs in your file lab250.py, add this function definition:
def brack(): infile = open("test.tst", "r") x = infile.readline() # read one line from file while x != "": x = x[:-1] # slice off newline char at end of string m = re.search(r"\[\s*(.*)\s*\]", x) # treat bracket as just bracket char if m: print m.group(1) # just print inside group x = infile.readline() infile.close()
Figure out why the lines that matched did by examining test.tst.
Now, the following examples come from these sources:
http://www.regular-expressions.info/index.html is the main page, with menu on the left.

http://www.regular-expressions.info/tutorial.html

The following is one of the examples. Try each example out.

def email(addr): b = r"^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$" m = re.search(b, addr, re.IGNORECASE) # allow upper or lower case print m.group(0)
\b means word boundary (anything that is not alphanumeric or underscore qualifies).
[A-Z0-9._%-]+ means one or more of the character class inside the square brackets.
@ just means there has to be an at sign next.
[A-Z0-9._%-]+ appears again, followed by \. which is how we get a period and avoid having it be a metacharacter. Lastly the [A-Z]{2,4} requires between 2 and 4 characters from A-Z and the \b makes it another word boundary.

Remind yourself what the ^ and $ are doing.

The site at the URL below has examples of matching html tags, trimming white space, matching IP addresses, etc.: http://www.regular-expressions.info/examples.html. We examine one of its examples below. What is the ^ doing in this example?
>>> a = r'<html\b[^>]*>(.*?)</html>' >>> m = re.search(a,'<html><H1>Test Page</H1><p>This is just a test</p></html>') >>> m <_sre.SRE_Match object at 0x96860> >>> m.group() '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m.group(0) '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m.group(1) '<H1>Test Page</H1><p>This is just a test</p>' >>> m.group(2) Traceback (most recent call last): File "", line 1, in ? IndexError: no such group
Using backreference and lazy *: The .*? Keeps the * from swallowing up the <
The \1 refers (backreferences) the first grouping within parens, saying we have a match if that same grouping appears again. Nice for matching html tags.
>>> a = r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>' >>> a '<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>' >>> b = '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> b '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m = re.search(a,b,re.IGNORECASE) >>> m <_sre.SRE_Match object at 0x741d0> >>> m.groups() ('html', '<H1>Test Page</H1><p>This is just a test</p>') >>>

Other Links
For a very succinct description of regular expressions in python, go to
Syntax of Regular Expressions in Python:
http://docs.python.org/lib/re-syntax.html
There is another tutorial at
http://gnosis.cx/TPiP/chap3.txt with many examples.

A short reference of reg expr
Don't forget to logout.