February 13-15, 2013
If you want to login now, make sure you use a secure shell whether on PC or Mac.
ssh 250b-xx@beowulf.csc.smith.eduwhere xx are the two letters in your own class account username. ssh stands for (Secure Shell) Client.
run the Python Interpreter version 2.7
.
We will use version 2.7 to stay synchronized with the on-line tutorials.
To start the Python interpreter, just type in the command
python and hit Enter
(if you want to use python3 that is fine; the re module works the same;
use the print function instead of the print statement)
You should obtain something like this:
[jfrankli@beowulf 250s13]$ python Python 2.7.3 (default, Apr 30 2012, 21:18:11) [GCC 4.7.0 20120416 (Red Hat 4.7.0-2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>where >>> is the Python interpreter's prompt. The prompt is prompting you for a Python command, or statement. Try this:
>>> print "Hello, World!"and also this:
>>> print 5+7and this:
>>> print 16+32, "is the sum of 16 and 32"Next, check out how easy string concatenation is in python with the plus (+) operator (+ is not or in python, as it is in our 250 textbook):
>>> x = "computer" >>> y = "science" >>> z = x + " " + science >>> print z
Exit from the Python Intepreter
Exit from the Python interpreter by typing Ctrl-D.Python's re module:
``Does this string match the pattern?'', or ``Is there a match for the pattern anywhere in this string?''.You can also use REs to modify a string or to split it apart in various ways.
testwill match the string "test" exactly.
. ^ $ * + ? { } [ ] \ | ( )Meta-character Highlights:
>>> m = "We are \numerous" >>> print m We are umerous >>> m = r"We are \numerous" >>> print m We are \numerous >>>
>>> re.match(r'From\s*','Fromage amk') <_sre.SRE_Match object at 0xb7f24870>We are searching the string 'Fromage amk' for the expression 'From\s*'. This pattern requires the substring 'From' followed by zero or more white space characters. (\s matches any whitespace character, and * indicates one or more of them). Do you remember that python is object-oriented? Because there is a match, we are given the address of an _sre.SRE_Match object in return. How nice!
>>> print re.match(r'From\s+', 'Fromage amk') NoneWe are searching the string 'Fromage amk' for the expression 'From\s+'. This pattern requires the substring 'From' followed by at least one white space character. (\s matches any whitespace character, and + indicates one or more of them).
>>> m = re.match(r'From\s*','Fromage amk') >>> print m.group() From
>>> m = re.match(r'From\s+', 'From amk Thu May 14 19:12:10 1998') <re.MatchObject instance at 80c5978> >>> m.group() 'From 'The group() method reveals the substring that matches the pattern.
>>> a = 'i|y' >>> b = 'Judy' >>> c = 'Judi' >>> m = re.search(a,b) >>> m.group() 'y' >>> m = re.search(a,c) >>> m.group() 'i' >>> m = re.match(a.c) >>> ?????
o Click here for comparison of match() vs. search() vs. findall() vs. finditer()
Look up the difference between the match() and the search() functions in this nice tutorial on using regular expressions in python: Go to http://www.python.org/doc/howto/ and click on Regular Expression HOWTO>>> m = re.search('(ab)*', 'ab') >>> m.group() 'ab' >>> m.group(0) 'ab'Subgroups are numbered from left to right, from 1 upward. Groups can be nested; to determine the number, just count the opening parenthesis characters, going from left to right.
>>> m = re.search('(a(b)c)d', 'abcd') >>> m.group(0) 'abcd' >>> m.group(1) 'abc' >>> m.group(2) 'b'Backreferences
>>> p = r'(\b\w+)\s+\1' >>> m = re.search(p, 'Paris in the the spring') >>> m.group() 'the the'where \b means Word boundary (non alphanumeric) and \w means word (alphanumeric character, i.e. equivalent to the class [a-zA-Z0-9_].) Backreferences like this aren't often useful for just searching through a string -- there are few text formats which repeat data in this way -- but you'll soon find out that they're very useful when performing string substitutions. Why wouldn't this pattern work on 'Paris in thethe spring' ?
emacs lab250.pyand in emacs type in the following.:
import re # import the reg expr module functions def main(): # define function called main a = "I like Red Hat Linux for development." m = re.search(r"(.{7})Linux", a) # Any 7 characters then string 'Linux' b = m.group(1) # The first group in parentheses print "<" + b + ">" print m.group(0) # The whole matchSave and exit emacs then invoke python to call the function:
jfrankli-PB1:~ jfrankli$ python Python 2.3.5 (#1, Mar 20 2005, 20:38:20) [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import lab250 >>> lab250.main() <ed Hat > ed Hat Linux >>>Make sure you and your partner understand exactly how this all works before moving on.
jfrankli-PB1:~ jfrankli$ python Python 2.3.5 (#1, Mar 20 2005, 20:38:20) [GCC 3.3 20030304 (Apple Computer, Inc. build 1809)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re # import the reg expr module functions >>> a = "I like Red Hat Linux for development." >>> m = re.search(r"(.{7})Linux", a) >>> b = m.group(1) >>> print "<" + b + ">" <ed Hat > >>> print m.group(0) 'ed Hat Linux'
>>> a = r"Valentines is 2/14/2012. Don\'t forget!" >>> import re >>> m = re.search(r"\D\d{1,2}/\d{1,2}/\d{2,4}\D", a) >>> m.group(0)How can you ensure that the date contains the full four digit date?
>>> s = "My, how are you?" >>> m = re.search("^how are you$", s) >>>Why didn't this match? What can you do to make it match, including the question mark?
import re def simul(): a = "11/21/1999" m = re.search(r"^([01]?\d)[/-]([0123]?\d)[/-](\d{2,4})", a) month,day,year = m.groups() print year, month, dayMake sure you understand this before moving on.
beowulf$ getcopy test.tstor download test.tst here. Then in emacs in your file lab250.py, add this function definition:
def brack(): infile = open("test.tst", "r") x = infile.readline() # read one line from file while x != "": x = x[:-1] # slice off newline char at end of string m = re.search(r"\[\s*(.*)\s*\]", x) # treat bracket as just bracket char if m: print m.group(1) # just print inside group x = infile.readline() infile.close()Figure out why the lines that matched did by examining test.tst.
def email(addr): b = r"^[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4}$" m = re.search(b, addr, re.IGNORECASE) # allow upper or lower case print m.group(0)\b means word boundary (anything that is not alphanumeric or underscore qualifies).
>>> a = r'<html\b[^>]*>(.*?)</html>' >>> m = re.search(a,'<html><H1>Test Page</H1><p>This is just a test</p></html>') >>> m <_sre.SRE_Match object at 0x96860> >>> m.group() '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m.group(0) '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m.group(1) '<H1>Test Page</H1><p>This is just a test</p>' >>> m.group(2) Traceback (most recent call last): File "Using backreference and lazy *: The .*? Keeps the * from swallowing up the <", line 1, in ? IndexError: no such group
>>> a = r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>' >>> a '<([A-Z][A-Z0-9]*)\\b[^>]*>(.*?)</\\1>' >>> b = '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> b '<html><H1>Test Page</H1><p>This is just a test</p></html>' >>> m = re.search(a,b,re.IGNORECASE) >>> m <_sre.SRE_Match object at 0x741d0> >>> m.groups() ('html', '<H1>Test Page</H1><p>This is just a test</p>') >>>
Don't forget to logout.