The World of Regex

28 minute read

The world of regex

“Mary has a problem, and she chose regex to solve the problem. Mary now has two problems.”

Introduction

When working with text data, you are often required to write regular expressions to pre-process the data, extract useful information from the data, create new features etc. In this post, I will start from the very basic regular expression syntax and slowly move to some advanced concepts. This will serve as a reference to any regular expression magic that you need to perform from time-to-time in the NLP world.

Agenda

  1. Rules for searching
  2. Using regex in Python
  3. Metacharacters
  4. Quantifiers
  5. Match groups
  6. Character classes
  7. Finding multiple matches using re.findall()
  8. Greedy vs Lazy ( .+ vs .+? )
  9. Alternatives
  10. Use of flags like flags=re.IGNORECASE
  11. Substitution and the use of \1 to reference groups
  12. Anchors
  13. Use of flag re.MULTILINE to tackle strings with \n
  14. Example IMBD title using groups, anchors and alternatives
  15. re.VERBOSE to improve readability of regex
  16. Use of re.compile()

Part 1: Rules for searching

  • Ordinary characters (also known as literals) match themselves exactly
  • Case-sensitive (by default)
  • Search proceeds through the string from start to end, stopping at first match
  • The entire pattern must match a continuous sequence of characters

Part 2: Using regex in Python

# use built-in regex module
import re

# define the string to search
s = 'my 1st string!!'
# pattern as 'raw string', then string to search, returns 'match object'
re.search(r'st', s)
<_sre.SRE_Match object; span=(4, 6), match='st'>
# access the results using the 'group' method
re.search(r'st', s).group()
'st'
# returns 'None' if no match is found
re.search(r'sti', s)
# causes an error since 'None' does not have a 'group' method
# re.search(r'sti', s).group()
# better error handling
match = re.search(r'st', s)
if match:
    print(match.group())
st
# does not cause an error since condition fails
match = re.search(r'sti', s)
if match:
    print(match.group())

Part 3: Metacharacters

Metacharacters are the opposite of literal characters, because they represent something other than themselves.

Metacharacter What it matches
. any character except newline \n
\w word character (letter, digit, underscore)
\W non-word character
\d digit (0 through 9)
\s whitespace character (space, newline, return, tab, form)
\S non-whitespace character
\. period (you must escape a special character to match it)
s = 'my 1st string!!'
re.search(r'..', s).group()
'my'
re.search(r'..t', s).group()
'1st'
re.search(r'\w\w', s).group()
'my'
re.search(r'\w\w\w', s).group()
'1st'
re.search(r'\W', s).group()
' '
re.search(r'\W\W', s).group()
'!!'
re.search(r'\W\wt', s).group()
' st'
re.search(r'\d..', s).group()
'1st'

Part 4: Quantifiers

Quantifiers modify the required quantity of a character or a pattern.

Quantifier What it matches
a+ 1 or more occurrences of ‘a’ (the pattern directly to its left)
a* 0 or more occurrences of ‘a’
a? 0 or 1 occurrence of ‘a’
s = 'sid is missing class'
re.search(r'miss\w+', s).group()
'missing'
re.search(r'is\w*', s).group()
'is'
re.search(r'is\w+', s).group()
'issing'

+ and * are “greedy”, meaning that they try to use up as much of the string as possible:

s = 'Some text <h1>my heading</h1> More text'
re.search(r'<.+>', s).group()
'<h1>my heading</h1>'
Quantifier What it matches
a{3} exactly 3 occurrences of ‘a’
a{3,} 3 or more occurrences of ‘a’
a{1,3} 1 to 3 occurrences of ‘a’
s = 'Sales on November 14: $250 for item 54321'
re.search(r'\d{3}', s).group()
'250'
re.search(r'\d{4,5}', s).group()
'54321'

Part 5: Match groups

Parentheses create logical groups inside of match text:

  • match.group() corresponds to entire match text (as usual)
  • match.group(1) corresponds to first group
  • match.group(2) corresponds to second group

Note: There is no limit to the number of groups you can create.

s = 'my 1st string!!'
re.search(r'\d..', s).group()
'1st'
re.search(r'(\d)(..)', s).group()
'1st'
re.search(r'(\d)(..)', s).group(1)
'1'
re.search(r'(\d)(..)', s).group(2)
'st'

Example 1: FAA tower closures

A list of FAA tower closures has been copied from a PDF into the file faa.txt, which is stored in the data directory of the repository.

# read the file into a single string
with open('../data/faa.txt') as f:
    data = f.read()
# check the number of characters
len(data)
5574
# examine the first 500 characters
print(data[0:500])
FAA Contract Tower Closure List
(149 FCTs)
3-22-2013
LOC
ID Facility Name City State
DHN DOTHAN RGNL DOTHAN AL
TCL TUSCALOOSA RGNL TUSCALOOSA AL
FYV DRAKE FIELD FAYETTEVILLE AR
TXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR
GEU GLENDALE MUNI GLENDALE AZ
GYR PHOENIX GOODYEAR GOODYEAR AZ
IFP LAUGHLIN/BULLHEAD INTL BULLHEAD CITY AZ
RYN RYAN FIELD TUCSON AZ
FUL FULLERTON MUNI FULLERTON CA
MER CASTLE ATWATER CA
OXR OXNARD OXNARD CA
RAL RIVERSIDE MUNI RIVERSIDE CA
RNM RAMONA RAMONA CA
SAC SACRAMENTO EXECU
# examine the last 500 characters
print(data[-500:])
 YAKIMA WA
CWA CENTRAL WISCONSIN MOSINEE WI
EAU CHIPPEWA VALLEY RGNL EAU CLAIRE WI
ENW KENOSHA RGNL KENOSHA WI
Page 3 of 4
FAA Contract Tower Closure List
(149 FCTs)
3-22-2013
LOC
ID Facility Name City State
JVL SOUTHERN WISCONSIN RGNL JANESVILLE WI
LSE LA CROSSE MUNI LA CROSSE WI
MWC LAWRENCE J TIMMERMAN MILWAUKEE WI
OSH WITTMAN RGNL OSHKOSH WI
UES WAUKESHA COUNTY WAUKESHA WI
HLG WHEELING OHIO CO WHEELING WV
LWB GREENBRIER VALLEY LEWISBURG WV
PKB MID-OHIO VALLEY RGNL PARKERSBURG WV
Page 4 of 4

Your assignment is to create a list of tuples containing the tower IDs and the states they are located in.

Here is the expected output:

faa = [('DHN', 'AL'), ('TCL', 'AL'), ..., ('PKB', 'WV')]

import re
data
'FAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nDHN DOTHAN RGNL DOTHAN AL\nTCL TUSCALOOSA RGNL TUSCALOOSA AL\nFYV DRAKE FIELD FAYETTEVILLE AR\nTXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR\nGEU GLENDALE MUNI GLENDALE AZ\nGYR PHOENIX GOODYEAR GOODYEAR AZ\nIFP LAUGHLIN/BULLHEAD INTL BULLHEAD CITY AZ\nRYN RYAN FIELD TUCSON AZ\nFUL FULLERTON MUNI FULLERTON CA\nMER CASTLE ATWATER CA\nOXR OXNARD OXNARD CA\nRAL RIVERSIDE MUNI RIVERSIDE CA\nRNM RAMONA RAMONA CA\nSAC SACRAMENTO EXECUTIVE SACRAMENTO CA\nSDM BROWN FIELD MUNI SAN DIEGO CA\nSNS SALINAS MUNI SALINAS CA\nVCV SOUTHERN CALIFORNIA LOGISTICS VICTORVILLE CA\nWHP WHITEMAN LOS ANGELES CA\nWJF GENERAL WM J FOX AIRFIELD LANCASTER CA\nBDR IGOR I SIKORSKY MEMORIAL BRIDGEPORT CT\nDXR DANBURY MUNI DANBURY CT\nGON GROTON-NEW LONDON GROTON (NEW LONDON) CT\nHFD HARTFORD-BRAINARD HARTFORD CT\nHVN TWEED-NEW HAVEN NEW HAVEN CT\nOXC WATERBURY-OXFORD OXFORD CT\nAPF NAPLES MUNI NAPLES FL\nBCT BOCA RATON BOCA RATON FL\nEVB NEW SMYRNA BEACH MUNI NEW SMYRNA BEACH FL\nFMY PAGE FIELD FORT MYERS FL\nHWO NORTH PERRY HOLLYWOOD FL\nLAL LAKELAND LINDER RGNL LAKELAND FL\nLEE LEESBURG INTL LEESBURG FL\nOCF OCALA INTL-JIM TAYLOR FIELD OCALA FL\nOMN ORMOND BEACH MUNI ORMOND BEACH FL\nPGD PUNTA GORDA PUNTA GORDA FL\nSGJ NORTHEAST FLORIDA RGNL ST AUGUSTINE FL\nSPG ALBERT WHITTED ST PETERSBURG FL\nSUA WITHAM FIELD STUART FL\nTIX SPACE COAST RGNL TITUSVILLE FL\nABY SOUTHWEST GEORGIA RGNL ALBANY GA\nAHN ATHENS/BEN EPPS ATHENS GA\nLZU GWINNETT COUNTY - BRISCOE FIELD LAWRENCEVILLE GA\nMCN MIDDLE GEORGIA RGNL MACON GA\nRYY COBB COUNTY- MCCOLLUM FIELD ATLANTA GA\nDBQ DUBUQUE RGNL DUBUQUE IA\nIDA IDAHO FALLS RGNL IDAHO FALLS ID\nLWS LEWISTON-NEZ PERCE COUNTY LEWISTON ID\nPage 1 of 4\nFAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nPIH POCATELLO RGNL POCATELLO ID\nSUN FRIEDMAN MEMORIAL HAILEY ID\nALN ST LOUIS RGNL ALTON/ST LOUIS IL\nBMI CENTRAL IL RGNL ARPT AT BLOOMINGTON- NORMAL BLOOMINGTON/ NORMAL IL\nDEC DECATUR DECATUR IL\nMDH SOUTHERN ILLINOIS CARBONDALE/ MURPHYSBORO IL\nUGN WAUKEGAN RGNL CHICAGO/ WAUKEGAN IL\nBAK COLUMBUS MUNI COLUMBUS IN\nGYY GARY/CHICAGO INTL GARY IN\nHUT HUTCHINSON MUNI HUTCHINSON KS\nIXD NEW CENTURY AIRCENTER OLATHE KS\nMHK MANHATTAN RGNL MANHATTAN KS\nOJC JOHNSON COUNTY EXECUTIVE OLATHE KS\nTOP PHILIP BILLARD MUNI TOPEKA KS\nOWB OWENSBORO-DAVIESS COUNTY OWENSBORO KY\nPAH BARKLEY RGNL PADUCAH KY\nDTN SHREVEPORT DOWNTOWN SHREVEPORT LA\nBVY BEVERLY MUNI BEVERLY MA\nEWB NEW BEDFORD RGNL NEW BEDFORD MA\nLWM LAWRENCE MUNI LAWRENCE MA\nORH WORCESTER RGNL WORCESTER MA\nOWD NORWOOD MEMORIAL NORWOOD MA\nESN EASTON/NEWNAM FIELD EASTON MD\nFDK FREDERICK MUNI FREDERICK MD\nHGR HAGERSTOWN RGNL- RICHARD A HENSON FLD HAGERSTOWN MD\nMTN MARTIN STATE BALTIMORE MD\nSBY SALISBURY-OCEAN CITY WICOMICO RGNL SALISBURY MD\nBTL W K KELLOGG BATTLE CREEK MI\nDET COLEMAN A. YOUNG MUNI DETROIT MI\nSAW SAWYER INTL MARQUETTE MI\nANE ANOKA COUNTY-BLAINE ARPT(JANES FIELD) MINNEAPOLIS MN\nSTC ST CLOUD RGNL ST CLOUD MN\nBBG BRANSON BRANSON MO\nCOU COLUMBIA RGNL COLUMBIA MO\nGLH MID DELTA RGNL GREENVILLE MS\nHKS HAWKINS FIELD JACKSON MS\nHSA STENNIS INTL (HSA) BAY ST LOUIS MS\nOLV OLIVE BRANCH OLIVE BRANCH MS\nTUP TUPELO RGNL TUPELO MS\nGPI GLACIER PARK INTL KALISPELL MT\nEWN COASTAL CAROLINA REGIONAL NEW BERN NC\nHKY HICKORY RGNL HICKORY NC\nINT SMITH REYNOLDS WINSTON SALEM NC\nISO KINSTON RGNL JETPORT AT STALLINGS FLD KINSTON NC\nJQF CONCORD RGNL CONCORD NC\nASH BOIRE FIELD NASHUA NH\nTTN TRENTON MERCER TRENTON NJ\nPage 2 of 4\nFAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nAEG DOUBLE EAGLE II ALBUQUERQUE NM\nSAF SANTA FE MUNI SANTA FE NM\nITH ITHACA TOMPKINS RGNL ITHACA NY\nRME GRIFFISS INTL ROME NY\nCGF CUYAHOGA COUNTY CLEVELAND OH\nOSU OHIO STATE UNIVERSITY COLUMBUS OH\nTZR BOLTON FIELD COLUMBUS OH\nLAW LAWTON-FORT SILL RGNL LAWTON OK\nOUN UNIVERSITY OF OKLAHOMA WESTHEIMER NORMAN OK\nPWA WILEY POST OKLAHOMA CITY OK\nSWO STILLWATER RGNL STILLWATER OK\nOTH SOUTHWEST OREGON RGNL NORTH BEND OR\nPDT EASTERN OREGON RGNL AT PENDLETON PENDLETON OR\nSLE MCNARY FLD SALEM OR\nTTD PORTLAND-TROUTDALE PORTLAND OR\nCXY CAPITAL CITY HARRISBURG PA\nLBE ARNOLD PALMER RGNL LATROBE PA\nLNS LANCASTER LANCASTER PA\nCRE GRAND STRAND NORTH MYRTLE BEACH SC\nGYH DONALDSON CENTER GREENVILLE SC\nHXD HILTON HEAD HILTON HEAD ISLAND SC\nMKL MC KELLAR-SIPES RGNL JACKSON TN\nNQA MILLINGTON RGNL JETPORT MILLINGTON TN\nBAZ NEW BRAUNFELS MUNI NEW BRAUNFELS TX\nBRO BROWNSVILLE/ SOUTH PADRE ISLAND INTL BROWNSVILLE TX\nCLL EASTERWOOD FIELD COLLEGE STATION TX\nCNW TSTC WACO WACO TX\nCXO LONE STAR EXECUTIVE HOUSTON TX\nGTU GEORGETOWN MUNI GEORGETOWN TX\nHYI SAN MARCOS MUNI SAN MARCOS TX\nRBD DALLAS EXECUTIVE DALLAS TX\nSGR SUGAR LAND RGNL HOUSTON TX\nSSF STINSON MUNI SAN ANTONIO TX\nTKI COLLIN COUNTY RGNL AT MC KINNEY DALLAS TX\nTYR TYLER POUNDS RGNL TYLER TX\nVCT VICTORIA RGNL VICTORIA TX\nOGD OGDEN-HINCKLEY OGDEN UT\nPVU PROVO MUNI PROVO UT\nLYH LYNCHBURG RGNL/ PRESTON GLENN FLD LYNCHBURG VA\nOLM OLYMPIA RGNL OLYMPIA WA\nRNT RENTON MUNI RENTON WA\nSFF FELTS FIELD SPOKANE WA\nTIW TACOMA NARROWS TACOMA WA\nYKM YAKIMA AIR TERMINAL/ MCALLISTER FIELD YAKIMA WA\nCWA CENTRAL WISCONSIN MOSINEE WI\nEAU CHIPPEWA VALLEY RGNL EAU CLAIRE WI\nENW KENOSHA RGNL KENOSHA WI\nPage 3 of 4\nFAA Contract Tower Closure List\n(149 FCTs)\n3-22-2013\nLOC\nID Facility Name City State\nJVL SOUTHERN WISCONSIN RGNL JANESVILLE WI\nLSE LA CROSSE MUNI LA CROSSE WI\nMWC LAWRENCE J TIMMERMAN MILWAUKEE WI\nOSH WITTMAN RGNL OSHKOSH WI\nUES WAUKESHA COUNTY WAUKESHA WI\nHLG WHEELING OHIO CO WHEELING WV\nLWB GREENBRIER VALLEY LEWISBURG WV\nPKB MID-OHIO VALLEY RGNL PARKERSBURG WV\nPage 4 of 4\n'

Method 1: using re.search

  • Task 1: split the string based on new-line character to create a list of strings
  • Task 2: function should filter strings based on:
    • string should start with 3 characters
    • string should end with 2 characters
  • Task 3: for each string in the list, apply the above function.
lines = data.split('\n')
def filterStrings(line):
    # write a regex to check if the given line starts with 3 chars
    match = re.search(r'^(\w{3}).+(\s\w{2})$', line)
    if match:
        return match.group(1), match.group(2)
faa = []
for line in lines:
    result = filterStrings(line)
    if result != None:
        faa.append(result)
print(faa)
[('DHN', ' AL'), ('TCL', ' AL'), ('FYV', ' AR'), ('TXK', ' AR'), ('GEU', ' AZ'), ('GYR', ' AZ'), ('IFP', ' AZ'), ('RYN', ' AZ'), ('FUL', ' CA'), ('MER', ' CA'), ('OXR', ' CA'), ('RAL', ' CA'), ('RNM', ' CA'), ('SAC', ' CA'), ('SDM', ' CA'), ('SNS', ' CA'), ('VCV', ' CA'), ('WHP', ' CA'), ('WJF', ' CA'), ('BDR', ' CT'), ('DXR', ' CT'), ('GON', ' CT'), ('HFD', ' CT'), ('HVN', ' CT'), ('OXC', ' CT'), ('APF', ' FL'), ('BCT', ' FL'), ('EVB', ' FL'), ('FMY', ' FL'), ('HWO', ' FL'), ('LAL', ' FL'), ('LEE', ' FL'), ('OCF', ' FL'), ('OMN', ' FL'), ('PGD', ' FL'), ('SGJ', ' FL'), ('SPG', ' FL'), ('SUA', ' FL'), ('TIX', ' FL'), ('ABY', ' GA'), ('AHN', ' GA'), ('LZU', ' GA'), ('MCN', ' GA'), ('RYY', ' GA'), ('DBQ', ' IA'), ('IDA', ' ID'), ('LWS', ' ID'), ('PIH', ' ID'), ('SUN', ' ID'), ('ALN', ' IL'), ('BMI', ' IL'), ('DEC', ' IL'), ('MDH', ' IL'), ('UGN', ' IL'), ('BAK', ' IN'), ('GYY', ' IN'), ('HUT', ' KS'), ('IXD', ' KS'), ('MHK', ' KS'), ('OJC', ' KS'), ('TOP', ' KS'), ('OWB', ' KY'), ('PAH', ' KY'), ('DTN', ' LA'), ('BVY', ' MA'), ('EWB', ' MA'), ('LWM', ' MA'), ('ORH', ' MA'), ('OWD', ' MA'), ('ESN', ' MD'), ('FDK', ' MD'), ('HGR', ' MD'), ('MTN', ' MD'), ('SBY', ' MD'), ('BTL', ' MI'), ('DET', ' MI'), ('SAW', ' MI'), ('ANE', ' MN'), ('STC', ' MN'), ('BBG', ' MO'), ('COU', ' MO'), ('GLH', ' MS'), ('HKS', ' MS'), ('HSA', ' MS'), ('OLV', ' MS'), ('TUP', ' MS'), ('GPI', ' MT'), ('EWN', ' NC'), ('HKY', ' NC'), ('INT', ' NC'), ('ISO', ' NC'), ('JQF', ' NC'), ('ASH', ' NH'), ('TTN', ' NJ'), ('AEG', ' NM'), ('SAF', ' NM'), ('ITH', ' NY'), ('RME', ' NY'), ('CGF', ' OH'), ('OSU', ' OH'), ('TZR', ' OH'), ('LAW', ' OK'), ('OUN', ' OK'), ('PWA', ' OK'), ('SWO', ' OK'), ('OTH', ' OR'), ('PDT', ' OR'), ('SLE', ' OR'), ('TTD', ' OR'), ('CXY', ' PA'), ('LBE', ' PA'), ('LNS', ' PA'), ('CRE', ' SC'), ('GYH', ' SC'), ('HXD', ' SC'), ('MKL', ' TN'), ('NQA', ' TN'), ('BAZ', ' TX'), ('BRO', ' TX'), ('CLL', ' TX'), ('CNW', ' TX'), ('CXO', ' TX'), ('GTU', ' TX'), ('HYI', ' TX'), ('RBD', ' TX'), ('SGR', ' TX'), ('SSF', ' TX'), ('TKI', ' TX'), ('TYR', ' TX'), ('VCT', ' TX'), ('OGD', ' UT'), ('PVU', ' UT'), ('LYH', ' VA'), ('OLM', ' WA'), ('RNT', ' WA'), ('SFF', ' WA'), ('TIW', ' WA'), ('YKM', ' WA'), ('CWA', ' WI'), ('EAU', ' WI'), ('ENW', ' WI'), ('JVL', ' WI'), ('LSE', ' WI'), ('MWC', ' WI'), ('OSH', ' WI'), ('UES', ' WI'), ('HLG', ' WV'), ('LWB', ' WV'), ('PKB', ' WV')]

Method 2: using re.findall()

faa = re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})', data)
faa
[('DHN', 'AL'),
 ('TCL', 'AL'),
 ('FYV', 'AR'),
 ('TXK', 'AR'),
 ('GEU', 'AZ'),
 ('GYR', 'AZ'),
 ('IFP', 'AZ'),
 ('RYN', 'AZ'),
 ('FUL', 'CA'),
 ('MER', 'CA'),
 ('OXR', 'CA'),
 ('RAL', 'CA'),
 ('RNM', 'CA'),
 ('SAC', 'CA'),
 ('SDM', 'CA'),
 ('SNS', 'CA'),
 ('VCV', 'CA'),
 ('WHP', 'CA'),
 ('WJF', 'CA'),
 ('BDR', 'CT'),
 ('DXR', 'CT'),
 ('GON', 'CT'),
 ('HFD', 'CT'),
 ('HVN', 'CT'),
 ('OXC', 'CT'),
 ('APF', 'FL'),
 ('BCT', 'FL'),
 ('EVB', 'FL'),
 ('FMY', 'FL'),
 ('HWO', 'FL'),
 ('LAL', 'FL'),
 ('LEE', 'FL'),
 ('OCF', 'FL'),
 ('OMN', 'FL'),
 ('PGD', 'FL'),
 ('SGJ', 'FL'),
 ('SPG', 'FL'),
 ('SUA', 'FL'),
 ('TIX', 'FL'),
 ('ABY', 'GA'),
 ('AHN', 'GA'),
 ('LZU', 'GA'),
 ('MCN', 'GA'),
 ('RYY', 'GA'),
 ('DBQ', 'IA'),
 ('IDA', 'ID'),
 ('LWS', 'ID'),
 ('PIH', 'ID'),
 ('SUN', 'ID'),
 ('ALN', 'IL'),
 ('BMI', 'IL'),
 ('DEC', 'IL'),
 ('MDH', 'IL'),
 ('UGN', 'IL'),
 ('BAK', 'IN'),
 ('GYY', 'IN'),
 ('HUT', 'KS'),
 ('IXD', 'KS'),
 ('MHK', 'KS'),
 ('OJC', 'KS'),
 ('TOP', 'KS'),
 ('OWB', 'KY'),
 ('PAH', 'KY'),
 ('DTN', 'LA'),
 ('BVY', 'MA'),
 ('EWB', 'MA'),
 ('LWM', 'MA'),
 ('ORH', 'MA'),
 ('OWD', 'MA'),
 ('ESN', 'MD'),
 ('FDK', 'MD'),
 ('HGR', 'MD'),
 ('MTN', 'MD'),
 ('SBY', 'MD'),
 ('BTL', 'MI'),
 ('DET', 'MI'),
 ('SAW', 'MI'),
 ('ANE', 'MN'),
 ('STC', 'MN'),
 ('BBG', 'MO'),
 ('COU', 'MO'),
 ('GLH', 'MS'),
 ('HKS', 'MS'),
 ('HSA', 'MS'),
 ('OLV', 'MS'),
 ('TUP', 'MS'),
 ('GPI', 'MT'),
 ('EWN', 'NC'),
 ('HKY', 'NC'),
 ('INT', 'NC'),
 ('ISO', 'NC'),
 ('JQF', 'NC'),
 ('ASH', 'NH'),
 ('TTN', 'NJ'),
 ('AEG', 'NM'),
 ('SAF', 'NM'),
 ('ITH', 'NY'),
 ('RME', 'NY'),
 ('CGF', 'OH'),
 ('OSU', 'OH'),
 ('TZR', 'OH'),
 ('LAW', 'OK'),
 ('OUN', 'OK'),
 ('PWA', 'OK'),
 ('SWO', 'OK'),
 ('OTH', 'OR'),
 ('PDT', 'OR'),
 ('SLE', 'OR'),
 ('TTD', 'OR'),
 ('CXY', 'PA'),
 ('LBE', 'PA'),
 ('LNS', 'PA'),
 ('CRE', 'SC'),
 ('GYH', 'SC'),
 ('HXD', 'SC'),
 ('MKL', 'TN'),
 ('NQA', 'TN'),
 ('BAZ', 'TX'),
 ('BRO', 'TX'),
 ('CLL', 'TX'),
 ('CNW', 'TX'),
 ('CXO', 'TX'),
 ('GTU', 'TX'),
 ('HYI', 'TX'),
 ('RBD', 'TX'),
 ('SGR', 'TX'),
 ('SSF', 'TX'),
 ('TKI', 'TX'),
 ('TYR', 'TX'),
 ('VCT', 'TX'),
 ('OGD', 'UT'),
 ('PVU', 'UT'),
 ('LYH', 'VA'),
 ('OLM', 'WA'),
 ('RNT', 'WA'),
 ('SFF', 'WA'),
 ('TIW', 'WA'),
 ('YKM', 'WA'),
 ('CWA', 'WI'),
 ('EAU', 'WI'),
 ('ENW', 'WI'),
 ('JVL', 'WI'),
 ('LSE', 'WI'),
 ('MWC', 'WI'),
 ('OSH', 'WI'),
 ('UES', 'WI'),
 ('HLG', 'WV'),
 ('LWB', 'WV'),
 ('PKB', 'WV')]
len(faa)
149

As a bonus task, use regular expressions to extract the number of closures listed in the second line of the file (149), and then use an assertion to check that the number of closures is equal to the length of the faa list.

Stack Overflow reputation

I have downloaded my Stack Overflow reputation history into the file reputation.txt, which is stored in the data directory of the course repository. (If you are a Stack Overflow user with a reputation of 10 or more, you should be able to download your own reputation history.)

We are only interested in the lines that begin with two dashes, such as:

-- 2012-08-30 rep +5 = 6

That line can be interpreted as follows: “On 2012-08-30, my reputation increased by 5, bringing my reputation total to 6.”

import re
# read the file into a single string
with open('../data/reputation.txt') as f:
    data = f.read()
len(data)
2205
data
'total votes: 36\n 2  12201376 (5)\n-- 2012-08-30 rep +5    = 6         \n 2  13822612 (10)\n-- 2012-12-11 rep +10   = 16        \n 2  13822612 (10)\n-- 2013-03-20 rep +10   = 26        \n-- 2013-12-05 rep 0     = 26        \n-- 2014-01-25 rep 0     = 26        \n 16  7141669 (2)\n-- 2014-03-19 rep +2    = 28        \n 1  12202249 (2)\n-- 2014-05-11 rep +2    = 30        \n 16 23599806 (2)\n 2  23597220 (10)\n-- 2014-05-12 rep +12   = 42        \n 2  13822612 (10)\n-- 2014-06-12 rep +10   = 52        \n 2  23597220 (10)\n-- 2014-06-26 rep +10   = 62        \n-- 2014-07-05 rep 0     = 62        \n-- 2014-09-02 rep 0     = 62        \n 2  23597220 (10)\n-- 2014-09-03 rep +10   = 72        \n-- 2014-10-28 rep 0     = 72        \n 2  23597220 (10)\n-- 2014-11-14 rep +10   = 82        \n 16 12107971 (2)\n-- 2014-11-18 rep +2    = 84        \n 16  3621018 (2)\n-- 2014-12-08 rep +2    = 86        \n 2  23597220 (10)\n-- 2014-12-09 rep +10   = 96        \n 16 16328613 (2)\n-- 2014-12-12 rep +2    = 98        \n 2  23597220 (10)\n-- 2014-12-24 rep +10   = 108       \n-- 2015-02-03 rep 0     = 108       \n 2  23597220 (10)\n-- 2015-02-20 rep +10   = 118       \n 2  23597220 (10)\n-- 2015-03-28 rep +10   = 128       \n 2  23597220 (10)\n-- 2015-04-26 rep +10   = 138       \n 2  13822612 (10)\n-- 2015-05-05 rep +10   = 148       \n 2  23597220 (10)\n-- 2015-05-26 rep +10   = 158       \n 2  23597220 (10)\n 2  23597220 (10)\n-- 2015-05-27 rep +20   = 178       \n-- 2015-06-09 rep 0     = 178       \n 2  23597220 (10)\n-- 2015-07-03 rep +10   = 188       \n-- 2015-07-06 rep 0     = 188       \n 2  23597220 (10)\n-- bonuses   (100)\n-- 2015-07-22 rep +110  = 298       \n 2  23597220 (10)\n-- 2015-08-21 rep +10   = 308       \n 2  23597220 (10)\n-- 2015-09-07 rep +10   = 318       \n 3   1839257 (-1)\n-- 2015-10-14 rep -1    = 317       \n\n** rep today: 0\n** rep this week (2015-11-08 - 2015-11-14): 0\n** rep this month (2015-11-01 - 2015-11-30): 0\n** rep this quarter (2015-10-01 - 2015-12-31): -1\n** rep this year (2015-01-01 - 2015-12-31): 109\n** rep from bonuses: 100\n** total rep 317 :)\n\ndays represented 34\nrep cap was reached via rep from upvotes *only* on 0 days\nearned at least 200 reputation on 0 days\nearned 10 reputation from suggested edits\n'
st_lines = data.split('\n')
print(st_lines)
['total votes: 36', ' 2  12201376 (5)', '-- 2012-08-30 rep +5    = 6         ', ' 2  13822612 (10)', '-- 2012-12-11 rep +10   = 16        ', ' 2  13822612 (10)', '-- 2013-03-20 rep +10   = 26        ', '-- 2013-12-05 rep 0     = 26        ', '-- 2014-01-25 rep 0     = 26        ', ' 16  7141669 (2)', '-- 2014-03-19 rep +2    = 28        ', ' 1  12202249 (2)', '-- 2014-05-11 rep +2    = 30        ', ' 16 23599806 (2)', ' 2  23597220 (10)', '-- 2014-05-12 rep +12   = 42        ', ' 2  13822612 (10)', '-- 2014-06-12 rep +10   = 52        ', ' 2  23597220 (10)', '-- 2014-06-26 rep +10   = 62        ', '-- 2014-07-05 rep 0     = 62        ', '-- 2014-09-02 rep 0     = 62        ', ' 2  23597220 (10)', '-- 2014-09-03 rep +10   = 72        ', '-- 2014-10-28 rep 0     = 72        ', ' 2  23597220 (10)', '-- 2014-11-14 rep +10   = 82        ', ' 16 12107971 (2)', '-- 2014-11-18 rep +2    = 84        ', ' 16  3621018 (2)', '-- 2014-12-08 rep +2    = 86        ', ' 2  23597220 (10)', '-- 2014-12-09 rep +10   = 96        ', ' 16 16328613 (2)', '-- 2014-12-12 rep +2    = 98        ', ' 2  23597220 (10)', '-- 2014-12-24 rep +10   = 108       ', '-- 2015-02-03 rep 0     = 108       ', ' 2  23597220 (10)', '-- 2015-02-20 rep +10   = 118       ', ' 2  23597220 (10)', '-- 2015-03-28 rep +10   = 128       ', ' 2  23597220 (10)', '-- 2015-04-26 rep +10   = 138       ', ' 2  13822612 (10)', '-- 2015-05-05 rep +10   = 148       ', ' 2  23597220 (10)', '-- 2015-05-26 rep +10   = 158       ', ' 2  23597220 (10)', ' 2  23597220 (10)', '-- 2015-05-27 rep +20   = 178       ', '-- 2015-06-09 rep 0     = 178       ', ' 2  23597220 (10)', '-- 2015-07-03 rep +10   = 188       ', '-- 2015-07-06 rep 0     = 188       ', ' 2  23597220 (10)', '-- bonuses   (100)', '-- 2015-07-22 rep +110  = 298       ', ' 2  23597220 (10)', '-- 2015-08-21 rep +10   = 308       ', ' 2  23597220 (10)', '-- 2015-09-07 rep +10   = 318       ', ' 3   1839257 (-1)', '-- 2015-10-14 rep -1    = 317       ', '', '** rep today: 0', '** rep this week (2015-11-08 - 2015-11-14): 0', '** rep this month (2015-11-01 - 2015-11-30): 0', '** rep this quarter (2015-10-01 - 2015-12-31): -1', '** rep this year (2015-01-01 - 2015-12-31): 109', '** rep from bonuses: 100', '** total rep 317 :)', '', 'days represented 34', 'rep cap was reached via rep from upvotes *only* on 0 days', 'earned at least 200 reputation on 0 days', 'earned 10 reputation from suggested edits', '']
st_lines[2]
'-- 2012-08-30 rep +5    = 6         '
re.findall(r'-- (\d{4}-\d{2}-\d{2}) .+ ([-+]?\d+) .+= (\d+)', st_lines[2])
[('2012-08-30', '+5', '6')]
re.findall(r'-- (\d{4}-\d{2}-\d{2}) .+ ([-+]?\d+) .+= (\d+)', data)
[('2012-08-30', '+5', '6'),
 ('2012-12-11', '+10', '16'),
 ('2013-03-20', '+10', '26'),
 ('2013-12-05', '0', '26'),
 ('2014-01-25', '0', '26'),
 ('2014-03-19', '+2', '28'),
 ('2014-05-11', '+2', '30'),
 ('2014-05-12', '+12', '42'),
 ('2014-06-12', '+10', '52'),
 ('2014-06-26', '+10', '62'),
 ('2014-07-05', '0', '62'),
 ('2014-09-02', '0', '62'),
 ('2014-09-03', '+10', '72'),
 ('2014-10-28', '0', '72'),
 ('2014-11-14', '+10', '82'),
 ('2014-11-18', '+2', '84'),
 ('2014-12-08', '+2', '86'),
 ('2014-12-09', '+10', '96'),
 ('2014-12-12', '+2', '98'),
 ('2014-12-24', '+10', '108'),
 ('2015-02-03', '0', '108'),
 ('2015-02-20', '+10', '118'),
 ('2015-03-28', '+10', '128'),
 ('2015-04-26', '+10', '138'),
 ('2015-05-05', '+10', '148'),
 ('2015-05-26', '+10', '158'),
 ('2015-05-27', '+20', '178'),
 ('2015-06-09', '0', '178'),
 ('2015-07-03', '+10', '188'),
 ('2015-07-06', '0', '188'),
 ('2015-07-22', '+110', '298'),
 ('2015-08-21', '+10', '308'),
 ('2015-09-07', '+10', '318'),
 ('2015-10-14', '-1', '317')]
reputation = re.findall(r'-- (\d{4}-\d{2}-\d{2}) .+ ([-+]?\d+) .+= (\d+)', data)
import pandas as pd
reputation_df = pd.DataFrame(reputation, columns=['date', 'score', 'cummulative_score'])
reputation_df.score = reputation_df.score.astype(int)
reputation_df.cummulative_score = reputation_df.cummulative_score.astype(int)
print(reputation_df.info())
reputation_df
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34 entries, 0 to 33
Data columns (total 3 columns):
date                 34 non-null object
score                34 non-null int64
cummulative_score    34 non-null int64
dtypes: int64(2), object(1)
memory usage: 896.0+ bytes
None
date score cummulative_score
0 2012-08-30 5 6
1 2012-12-11 10 16
2 2013-03-20 10 26
3 2013-12-05 0 26
4 2014-01-25 0 26
5 2014-03-19 2 28
6 2014-05-11 2 30
7 2014-05-12 12 42
8 2014-06-12 10 52
9 2014-06-26 10 62
10 2014-07-05 0 62
11 2014-09-02 0 62
12 2014-09-03 10 72
13 2014-10-28 0 72
14 2014-11-14 10 82
15 2014-11-18 2 84
16 2014-12-08 2 86
17 2014-12-09 10 96
18 2014-12-12 2 98
19 2014-12-24 10 108
20 2015-02-03 0 108
21 2015-02-20 10 118
22 2015-03-28 10 128
23 2015-04-26 10 138
24 2015-05-05 10 148
25 2015-05-26 10 158
26 2015-05-27 20 178
27 2015-06-09 0 178
28 2015-07-03 10 188
29 2015-07-06 0 188
30 2015-07-22 110 298
31 2015-08-21 10 308
32 2015-09-07 10 318
33 2015-10-14 -1 317

Your assignment is to create a list of tuples containing only these dated entries, including the date, reputation change (regardless of whether it is positive/negative/zero), and running total.

Here is the expected output:

rep = [('2012-08-30', '+5', '6'), ('2012-12-11', '+10', '16'), ..., ('2015-10-14', '-1', '317')]

As a bonus task, convert this list of tuples into a pandas DataFrame. It should have appropriate column names, and the second and third columns should be of type integer (rather than string/object).

Alternatives

Alternatives define multiple possible patterns that can be used to produce a match. They are separated by a pipe and put in parentheses:

s = 'I live at 100 First St, which is around the corner.'
re.search(r'\d+ .+ (Ave|St|Rd)', s).group()
'100 First St'
with open('../data/yelp.csv') as f:
    reviews = f.read()
len(reviews)
8091031
reviews.split('\n')[:10]
['business_id,date,review_id,stars,text,type,user_id,cool,useful,funny',
 '9yKzy9PApeiPPOUJEtnvkg,2011-01-26,fWKvX83p0-ka4JS3dc6E5A,5,"My wife took me here on my birthday for breakfast and it was excellent.  The weather was perfect which made sitting outside overlooking their grounds an absolute pleasure.  Our waitress was excellent and our food arrived quickly on the semi-busy Saturday morning.  It looked like the place fills up pretty quickly so the earlier you get here the better.',
 '',
 "Do yourself a favor and get their Bloody Mary.  It was phenomenal and simply the best I've ever had.  I'm pretty sure they only use ingredients from their garden and blend them fresh when you order it.  It was amazing.",
 '',
 'While EVERYTHING on the menu looks excellent, I had the white truffle scrambled eggs vegetable skillet and it was tasty and delicious.  It came with 2 pieces of their griddled bread with was amazing and it absolutely made the meal complete.  It was the best ""toast"" I\'ve ever had.',
 '',
 'Anyway, I can\'t wait to go back!",review,rLtl8ZkDX5vH5nAx9C3q5Q,2,5,0',
 'ZRJwVLyzEJq1VAihDhYiow,2011-07-27,IjZ33sJrzXqU-0X6U8NwyA,5,"I have no idea why some people give bad reviews about this place. It goes to show you, you can please everyone. They are probably griping about something that their own fault...there are many people like that.',
 '']
re.findall(r'Mr \w+', reviews)
['Mr H',
 'Mr Chao',
 'Mr Chao',
 'Mr Mu',
 'Mr MS',
 'Mr Chu',
 'Mr don',
 'Mr Goodcents',
 'Mr and',
 'Mr didn',
 'Mr Owner',
 'Mr Sparky']

Substitution

s = 'my twitter account is @shra1, my emails are shravan@hotmail.com and shravan@yahoo.com'
# regex to find hotmail and yahoo
re.findall(r'@(hotmail|yahoo)\.com', s)
['hotmail', 'yahoo']
re.findall(r'\w+@[\w.]+', s)
['shravan@hotmail.com', 'shravan@yahoo.com']
re.findall(r'(\w+)@[\w.]+', s)
['shravan', 'shravan']
re.findall(r'((\w+)@[\w.]+)', s)
[('shravan@hotmail.com', 'shravan'), ('shravan@yahoo.com', 'shravan')]
# replace hotmail.com and yahoo.com with gmail.com
re.sub(r'@(hotmail|yahoo)\.com', r'@gmail.com', s)
'my twitter account is @shra1, my emails are shravan@gmail.com and shravan@gmail.com'
re.sub(r'(\w+)@[\w.]+', r'\1@gmail.com', s)
'my twitter account is @shra1, my emails are shravan@gmail.com and shravan@gmail.com'

Anchors

s = 'my twitter account is @shra1, my emails are shravan@hotmail.com and shravan@yahoo.com'
re.search(r'[\w.@]+$', s).group()
'shravan@yahoo.com'

IMDB Exercise

import pandas as pd
imdb = pd.read_csv('../data/imdb_100.csv')
print(imdb.columns)
Index(['star_rating', 'title', 'content_rating', 'genre', 'duration',
       'actors_list'],
      dtype='object')
print(imdb.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
star_rating       100 non-null float64
title             100 non-null object
content_rating    100 non-null object
genre             100 non-null object
duration          100 non-null int64
actors_list       100 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 4.8+ KB
None
titles = imdb.title.tolist()
titles
['The Shawshank Redemption',
 'The Godfather',
 'The Godfather: Part II',
 'The Dark Knight',
 'Pulp Fiction',
 '12 Angry Men',
 'The Good, the Bad and the Ugly',
 'The Lord of the Rings: The Return of the King',
 "Schindler's List",
 'Fight Club',
 'The Lord of the Rings: The Fellowship of the Ring',
 'Inception',
 'Star Wars: Episode V - The Empire Strikes Back',
 'Forrest Gump',
 'The Lord of the Rings: The Two Towers',
 'Interstellar',
 "One Flew Over the Cuckoo's Nest",
 'Seven Samurai',
 'Goodfellas',
 'Star Wars',
 'The Matrix',
 'City of God',
 "It's a Wonderful Life",
 'The Usual Suspects',
 'Se7en',
 'Life Is Beautiful',
 'Once Upon a Time in the West',
 'The Silence of the Lambs',
 'Leon: The Professional',
 'City Lights',
 'Spirited Away',
 'The Intouchables',
 'Casablanca',
 'Whiplash',
 'American History X',
 'Modern Times',
 'Saving Private Ryan',
 'Raiders of the Lost Ark',
 'Rear Window',
 'Psycho',
 'The Green Mile',
 'Sunset Blvd.',
 'The Pianist',
 'The Dark Knight Rises',
 'Gladiator',
 'Terminator 2: Judgment Day',
 'Memento',
 'Taare Zameen Par',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb',
 'The Departed',
 'Cinema Paradiso',
 'Apocalypse Now',
 'The Great Dictator',
 'The Prestige',
 'Back to the Future',
 'The Lion King',
 'The Lives of Others',
 'Alien',
 'Paths of Glory',
 'Django Unchained',
 '3 Idiots',
 'Grave of the Fireflies',
 'The Shining',
 'M',
 'WALL-E',
 'Witness for the Prosecution',
 'Munna Bhai M.B.B.S.',
 'American Beauty',
 'Das Boot',
 'Princess Mononoke',
 'Amelie',
 'North by Northwest',
 'Rang De Basanti',
 'Jodaeiye Nader az Simin',
 'Citizen Kane',
 'Aliens',
 'Vertigo',
 'Oldeuboi',
 'Once Upon a Time in America',
 'Double Indemnity',
 'Star Wars: Episode VI - Return of the Jedi',
 'Toy Story 3',
 'Braveheart',
 'To Kill a Mockingbird',
 'Requiem for a Dream',
 'Lawrence of Arabia',
 'A Clockwork Orange',
 'Bicycle Thieves',
 'The Kid',
 'Swades',
 'Reservoir Dogs',
 'Eternal Sunshine of the Spotless Mind',
 'Taxi Driver',
 'Dilwale Dulhania Le Jayenge',
 "Singin' in the Rain",
 'All About Eve',
 'Yojimbo',
 'The Sting',
 'Rashomon',
 'Amadeus']
s = 'The Shawshank Redemption'
re.findall(r'^(A|An|The) ', s, flags=re.IGNORECASE)
['The']
[re.sub(r'^(A|An|The) (.+)', r'\2, \1', title) for title in titles]
['Shawshank Redemption, The',
 'Godfather, The',
 'Godfather: Part II, The',
 'Dark Knight, The',
 'Pulp Fiction',
 '12 Angry Men',
 'Good, the Bad and the Ugly, The',
 'Lord of the Rings: The Return of the King, The',
 "Schindler's List",
 'Fight Club',
 'Lord of the Rings: The Fellowship of the Ring, The',
 'Inception',
 'Star Wars: Episode V - The Empire Strikes Back',
 'Forrest Gump',
 'Lord of the Rings: The Two Towers, The',
 'Interstellar',
 "One Flew Over the Cuckoo's Nest",
 'Seven Samurai',
 'Goodfellas',
 'Star Wars',
 'Matrix, The',
 'City of God',
 "It's a Wonderful Life",
 'Usual Suspects, The',
 'Se7en',
 'Life Is Beautiful',
 'Once Upon a Time in the West',
 'Silence of the Lambs, The',
 'Leon: The Professional',
 'City Lights',
 'Spirited Away',
 'Intouchables, The',
 'Casablanca',
 'Whiplash',
 'American History X',
 'Modern Times',
 'Saving Private Ryan',
 'Raiders of the Lost Ark',
 'Rear Window',
 'Psycho',
 'Green Mile, The',
 'Sunset Blvd.',
 'Pianist, The',
 'Dark Knight Rises, The',
 'Gladiator',
 'Terminator 2: Judgment Day',
 'Memento',
 'Taare Zameen Par',
 'Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb',
 'Departed, The',
 'Cinema Paradiso',
 'Apocalypse Now',
 'Great Dictator, The',
 'Prestige, The',
 'Back to the Future',
 'Lion King, The',
 'Lives of Others, The',
 'Alien',
 'Paths of Glory',
 'Django Unchained',
 '3 Idiots',
 'Grave of the Fireflies',
 'Shining, The',
 'M',
 'WALL-E',
 'Witness for the Prosecution',
 'Munna Bhai M.B.B.S.',
 'American Beauty',
 'Das Boot',
 'Princess Mononoke',
 'Amelie',
 'North by Northwest',
 'Rang De Basanti',
 'Jodaeiye Nader az Simin',
 'Citizen Kane',
 'Aliens',
 'Vertigo',
 'Oldeuboi',
 'Once Upon a Time in America',
 'Double Indemnity',
 'Star Wars: Episode VI - Return of the Jedi',
 'Toy Story 3',
 'Braveheart',
 'To Kill a Mockingbird',
 'Requiem for a Dream',
 'Lawrence of Arabia',
 'Clockwork Orange, A',
 'Bicycle Thieves',
 'Kid, The',
 'Swades',
 'Reservoir Dogs',
 'Eternal Sunshine of the Spotless Mind',
 'Taxi Driver',
 'Dilwale Dulhania Le Jayenge',
 "Singin' in the Rain",
 'All About Eve',
 'Yojimbo',
 'Sting, The',
 'Rashomon',
 'Amadeus']

Verbose

# read the file into a single string
with open('../data/reputation.txt') as f:
    data = f.read()
reputation = re.findall(r'-- (\d{4}-\d{2}-\d{2}) .+ ([-+]?\d+) .+= (\d+)', data)
reputation[:5]
[('2012-08-30', '+5', '6'),
 ('2012-12-11', '+10', '16'),
 ('2013-03-20', '+10', '26'),
 ('2013-12-05', '0', '26'),
 ('2014-01-25', '0', '26')]

When you use re.VERBOSE it ignores spaces, which is why we get back an empty list with no matches. So what is the purpose of using this flag ?.

Well, the main use of this flag is to improve readability of your regular expression. Imagine, if you come back a few days later and look at this regex, it would will look really complex. But with the re.VERBOSE flag, you can improve the readability of your regex.

re.findall(r'-- (\d{4}-\d{2}-\d{2}) .+ ([-+]?\d+) .+= (\d+)', data, flags=re.VERBOSE)
[]

We need escape the spaces for re.VERBOSE to work.

re.findall(r'--\ (\d{4}-\d{2}-\d{2})\ .+\ ([-+]?\d+)\ .+=\ (\d+)', data, flags=re.VERBOSE)
[('2012-08-30', '+5', '6'),
 ('2012-12-11', '+10', '16'),
 ('2013-03-20', '+10', '26'),
 ('2013-12-05', '0', '26'),
 ('2014-01-25', '0', '26'),
 ('2014-03-19', '+2', '28'),
 ('2014-05-11', '+2', '30'),
 ('2014-05-12', '+12', '42'),
 ('2014-06-12', '+10', '52'),
 ('2014-06-26', '+10', '62'),
 ('2014-07-05', '0', '62'),
 ('2014-09-02', '0', '62'),
 ('2014-09-03', '+10', '72'),
 ('2014-10-28', '0', '72'),
 ('2014-11-14', '+10', '82'),
 ('2014-11-18', '+2', '84'),
 ('2014-12-08', '+2', '86'),
 ('2014-12-09', '+10', '96'),
 ('2014-12-12', '+2', '98'),
 ('2014-12-24', '+10', '108'),
 ('2015-02-03', '0', '108'),
 ('2015-02-20', '+10', '118'),
 ('2015-03-28', '+10', '128'),
 ('2015-04-26', '+10', '138'),
 ('2015-05-05', '+10', '148'),
 ('2015-05-26', '+10', '158'),
 ('2015-05-27', '+20', '178'),
 ('2015-06-09', '0', '178'),
 ('2015-07-03', '+10', '188'),
 ('2015-07-06', '0', '188'),
 ('2015-07-22', '+110', '298'),
 ('2015-08-21', '+10', '308'),
 ('2015-09-07', '+10', '318'),
 ('2015-10-14', '-1', '317')]

In python a multi-line string is given by ‘’’ . Change the above regex to use a multi-line string.

# Lets take a peek at the original data
print(data[:300])
total votes: 36
 2  12201376 (5)
-- 2012-08-30 rep +5    = 6         
 2  13822612 (10)
-- 2012-12-11 rep +10   = 16        
 2  13822612 (10)
-- 2013-03-20 rep +10   = 26        
-- 2013-12-05 rep 0     = 26        
-- 2014-01-25 rep 0     = 26        
 16  7141669 (2)
-- 2014-03-19 rep +2    = 28
re.findall(r'''
--\                         # starting with two dashes
(\d{4}-\d{2}-\d{2})\        # followed by date. (First match group).
.+\                         # followed one or more characters.
([-+]?\d+)\                 # followed by either - or + or nothing and some digits. (Second match group)
.+=\                        # followed by one or more characters until you reach = and space after that.
(\d+)                       # followed by any number of digits. (Third match group)
''', data, flags=re.VERBOSE)
[('2012-08-30', '+5', '6'),
 ('2012-12-11', '+10', '16'),
 ('2013-03-20', '+10', '26'),
 ('2013-12-05', '0', '26'),
 ('2014-01-25', '0', '26'),
 ('2014-03-19', '+2', '28'),
 ('2014-05-11', '+2', '30'),
 ('2014-05-12', '+12', '42'),
 ('2014-06-12', '+10', '52'),
 ('2014-06-26', '+10', '62'),
 ('2014-07-05', '0', '62'),
 ('2014-09-02', '0', '62'),
 ('2014-09-03', '+10', '72'),
 ('2014-10-28', '0', '72'),
 ('2014-11-14', '+10', '82'),
 ('2014-11-18', '+2', '84'),
 ('2014-12-08', '+2', '86'),
 ('2014-12-09', '+10', '96'),
 ('2014-12-12', '+2', '98'),
 ('2014-12-24', '+10', '108'),
 ('2015-02-03', '0', '108'),
 ('2015-02-20', '+10', '118'),
 ('2015-03-28', '+10', '128'),
 ('2015-04-26', '+10', '138'),
 ('2015-05-05', '+10', '148'),
 ('2015-05-26', '+10', '158'),
 ('2015-05-27', '+20', '178'),
 ('2015-06-09', '0', '178'),
 ('2015-07-03', '+10', '188'),
 ('2015-07-06', '0', '188'),
 ('2015-07-22', '+110', '298'),
 ('2015-08-21', '+10', '308'),
 ('2015-09-07', '+10', '318'),
 ('2015-10-14', '-1', '317')]

Another example of making regular expressions more readable.

# read the file into a single string
with open('../data/faa.txt') as f:
    data = f.read()
print(data[:300])
FAA Contract Tower Closure List
(149 FCTs)
3-22-2013
LOC
ID Facility Name City State
DHN DOTHAN RGNL DOTHAN AL
TCL TUSCALOOSA RGNL TUSCALOOSA AL
FYV DRAKE FIELD FAYETTEVILLE AR
TXK TEXARKANA RGNL-WEBB FIELD TEXARKANA AR
GEU GLENDALE MUNI GLENDALE AZ
GYR PHOENIX GOODYEAR GOODYEAR AZ
IFP LAUGHLIN/BULL
faa = re.findall(r'([A-Z]{3}) .+ ([A-Z]{2})', data)
faa[:5]
[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ')]
# make the above regex more readable
# Step 1, use re.VERBOSE
# Step 2, escape the spaces
# Step 3, use a multi-line string
# Step 4, after space use a new line
# Step 5, after space, use more spaces and add comments with a space after #
faa = re.findall(r'''
([A-Z]{3})\        # Line starts with 3 capital characters
.+\                # followed by any character and a space
([A-Z]{2})         # ending in 2 capital characters
''', data, flags=re.VERBOSE)
faa[:5]
[('DHN', 'AL'), ('TCL', 'AL'), ('FYV', 'AR'), ('TXK', 'AR'), ('GEU', 'AZ')]

Compile

s = '-- 2012-08-30 rep +5    = 6'
# find the date in the above string
date = re.compile(r'\d{4}-\d{2}-\d{2}')
# Method 1, use the re object returned by re.compile
date.search(s).group()
'2012-08-30'
# Method 2, use the date inside as a pattern
re.search(date, s).group()
'2012-08-30'

Span

date.search(s).span()
(3, 13)
s[3:13]
'2012-08-30'

Tags: ,

Updated: