[Python-il] [python-il]location in file
cben at users.sf.net
Thu May 27 12:40:40 IDT 2010
On Wed, May 26, 2010 at 19:47, Yitzhak Wiener <Yitzhak.Wiener at dspg.com>wrote:
> Hi Shai,
> It worked. Thanks.
> Adding the '?' after the '*' solved the time problem. I found it in the
> python documentation but didn't really understand the logic of that. Why
> it has effect? The '*' is before so it should still be greedy according
> to logic!? Shouldn't it?
As Rani said "*?" is special syntax. Historically, regexp syntax started
out simple and elegant, but every time people wanted a new feature they had
to pick some combination of characters that wasn't used before (most of
these sticking a question mark in unnatural places), resulting in the
incredible mess that it is today :-(
See http://www.regular-expressions.info/repeat.html for details on the
precise meaning of greedy vs. lazy (aka non-greedy) repetition.
[That site is an excellent (though a bit technical) resource on everything
humanity knows about regular expressions.]
Note that laziness only changes the *order* in which the regexp engine tries
So when your file does match your pattern, the lazy version will match much
faster; but if your file doesn't have the expected structure, it will still
need to go through all possibilities until it concludes that it can't match,
which could take a lot of time.
Compare that your line-by-line loop logic:
You read until you find the "SECTION HEADER";
from that point you read until you find "RAW DATA:";
from that point you collect the hex words.
But you never go back to revise previous decisions!
E.g. you never say "this line doesn't look like RAW DATA", let's go back and
try to look for another "SECTION HEADER" :-)
A regexp engine (by default) does this kind of stuff (called "backtracking")
all the time!
It gives it a lot of flexibility, but in your case you don't want that.
Lazy repetition is a partial fix - it makes it try the simplest case you
want *first*, which is usually enough.
The true fix would be to tell it "once you found the SECTION HEADER, commit
to that decision and never go back". There is an advanced regexp construct
(http://www.regular-expressions.info/atomic.html) to do that, but Python
doesn't support it.
1. If you didn't understand any of the above, don't worry. It's slightly
2. When simple program logic (like Rani sent you) does the job, stick to it.
It's better for beginners because you understand perfectly what's going
3. When you *grow tired* of programming such logic, or when you need to
match complex patterns that you have no idea how to program yourself,
take the time to learn regular expressions.
They're a powerful tool (for *some* uses) that every programmer should
but they're more of a black box - and one that will bite you if you don't
understand how the box works!
Mastering them will pay off - but be warned that you will make a lot of
mistakes on the way.
Beni Cherniavsky-Paskin <cben at users.sf.net>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Python-il