Regular expressions: repeating a capturing group and making the inner group non-repeating

In the dark old days when I was working on a Windows laptop, I used to use a tool called RegexBuddy to help me write regular expressions.

Up until recently, I didn't really get regular expressions. Working with Django forced me to change that and I'm so thankful I took the time to delve deeper into them. These days, I pretty much use them on a daily basis and it shaves so much time off of some mundane tasks that I can't believe I ever got by without them. (They're especially helpful since TextMate has a regular expression engine built into its Find and Replace dialog.)

Today, I needed to repeat a capture group that could occur one or more times in a string and I kept getting just the last iteration. Some quick googling brought up a very informative page on Repeating a Capturing Group vs. Capturing a Repeated Group by the author of RegexBuddy.

It appears I was making a common mistake by repeating a capturing group instead of capturing a repeated group.

The code in question is to parse Google App Engine datastore keys so that I capture the whole key path, including all ancestors. A sample string:

s = "datastore_types.Key.from_path('Parent', 1L, 'Child', 30L, _app=u'myapp')"

So my first attempt, the flawed one was:

r = r"datastore_types.Key.from_path\(('.*?', \d*?L, )+_app=u'.*?'\)"
rc = re.compile(r)
rc.match(s).groups()

>>> ("'Child', 30L, ",)

What I should have written, to capture the repeated group:

r = r"datastore_types.Key.from_path\((('.*?', \d*?L, )+)_app=u'.*?'\)"
rc = re.compile(r)
rc.match(s).groups()

>>>("'Parent', 1L, 'Child', 30L, ", "'Child', 30L, ")

This results in both the result for the outer group (the repeated group; what we want) and the last iteration of the inner group (which we don't care about).

To optimize it further, you can make the inner group non-capturing. So the final version looks like this:

r = r"datastore_types.Key.from_path\(((?:'.*?', \d*?L, )+)_app=u'.*?'\)"
rc = re.compile(r)
rc.match(s).groups()

>>>("'Parent', 1L, 'Child', 30L, ", )

I may be more comfortable with regular expressions, but there's still so much to learn! :)

Update: And, like most things, the actual solution I ended up going with is much simpler:

r = r'datastore_types.Key.from_path\((.*?), _app'

Comments