[Python-il] problem in script

Ori Peleg oripel at gmail.com
Wed Jan 20 16:06:53 IST 2010


I think the "search" and "match" methods of compiled regular expression
objects accept optional "pos" and "endpos" arguments to limit the search
range.

On Wed, Jan 20, 2010 at 3:47 PM, Yitzhak Wiener <Yitzhak.Wiener at dspg.com>wrote:

>  Wow, Benny, this was great coaching.
>
> I appreciate it so much.
>
> The reason I opened it as array is because I indeed need to edit 16bit
> int's in raw data section of this file.
>
>
>
>
>
>
>
> Best Regards,
>
> Yitzhak
>   ------------------------------
>
> *From:* beni.cherniavsky at gmail.com [mailto:beni.cherniavsky at gmail.com] *On
> Behalf Of *Beni Cherniavsky
> *Sent:* Wednesday, January 20, 2010 12:19 PM
> *To:* Yitzhak Wiener
> *Cc:* python-il at hamakor.org.il
>
> *Subject:* Re: [Python-il] problem in script
>
>
>
>
>
> On Tue, Jan 19, 2010 at 18:11, Yitzhak Wiener <Yitzhak.Wiener at dspg.com>
> wrote:
>
> Hi Guys,
>
>
>
> May I ask you a question?
>
> I am trying to write a script that looking for some string expression
> (expression A) in a file, and after it finds it, it searches for 2 other
> expressions (B & C) which are located few lines after the fist expression.
>
> These 2 expressions appear few times in this file, that’s why I need to
> search for expression A first, and the next time B & C appears this is what
> I search for.
>
> If the expressions are fixed strings, you don't really need regexps - just
> use str.index() which takes optional start,stop parameters:
>
>
>
> *a_pos = s.index("MultiProgPage_Code at c0 - SECTION HEADER")*
>
> *b_pos = s.index("s_paddr", a_pos)*
>
> *c_pos = s.index("s_size", b_pos)  # or a_pos?*
>
>
>
> [If any of these never occurs, .index() will raise ValueError]
>
>
>
> If you need the flexibility of regexps, they don't take start,stop
> parameters, but you can slice the string itself:
>
>
>
> *a_match = re.search("MultiProgPage_Code at c0 - SECTION HEADER", s)*
>
> *b_match = re.search("s_paddr", s[a_match.start()])*
>
> *c_match = re.search("s_size", s[b_match.start()])  # or a_match?*
>
>
>
> But the whole point of regular expressions is that you can also express "A,
> then B, then C" at once:
>
>
>
> *match = re.search("MultiProgPage_Code at c0 - SECTION
> HEADER.*(s_paddr).*(s_size)", s)*
>
> *b_pos = match.start(1)*
>
> *c_pos = match.start(2)*
>
>
>
> If you don't know the order of s_paddr/s_size, the regexp is much trickier.
>
> I guess you want to look for things after "s_paddr", "s_size", so you want
> match.end(1).
>
>
>
> => Of these 3 ways, the first is probably simplest and cleanest.
>
>
>
> You seem to be parsing a COFF file, right?
>
> Regexps are not well-suited to parsing binary formats.
>
> The manual way to parse them is to work with strings, and the array/struct
> module to parse specific parts.
>
> (See my advices below mixed with your code.)
>
>
>
> If you intend to do a lot with COFF, consider the hachoir<http://bit.ly/hachoir> and
> Construct <http://construct.wikispaces.com/> frameworks.
>
> They allow parsing/modifying binary formats in a *declarative* way - your
> code looks like a *description* of the format, not like *actions* needed
> to parse it.
>
> And they have built-in definitions for a lot of formats.  E.g. both have
> ELF and PE (windows exe format) though not COFF.
>
> *Note however that PE is based on COFF, so I guess you can massage it a
> little and get a full COFF parser...*
>
>
>
>
>
> I attached the script I use for finding expression A, but now I don’t know
> how to tell the script to start searching for expression B & C from point A.
>
>
>
>  Some notes how your code can be simplified in Python:
>
>
>
>  *from array import array*
>
> * *
>
> *import os, stat, re*
>
> * *
>
> *#get coff file size*
>
> *file_size = os.stat("project_release.dump")[stat.ST_SIZE]*
>
> * *
>
>  Since python 2.2, the result of os.stat still pretends to be a tuple but
> can also be accessed with named attributes:
>
> *os.stat("project_release.dump").st_size*
>
>  * *
>
> *a = array('H')*
>
> *f =  open("project_release.dump","rb")*
>
> *f2 =  open("project_release_out.dump","wb")*
>
>  IMHO, it's cleaner to write a function that takes a string and returns a
> string,
> and do all file reading/writing at the end, where you call the function.
>
> This one is a question of taste, you might well disagree...
>
>  * *
>
> *a.fromfile( f,(file_size/2) )*
>
> *s = a.tostring()*
>
> Why use an array object to read the file, when all you seem to do with it
> is convert it to a string?
> I'd simply do:
>
>
>
> *s = open("project_release.dump","rb").read()*
>
>
>
> Then, if/when you need to parse parts of it as 16-bit ints, convert those
> parts to arrays: *array('H', s[start:stop])*
>
> This also gives you the flexibility to parse different parts as different
> types.  See also the struct module.
>
>
>
> Note that reading the file, then constructing the array() also saves
> checking the size and calling f.fromfile() separately!
>
>
>
>  * *
>
> *#search in coff for beggining of "MultiProgPage_Code" code section in
> coff file.*
>
> *#We need the beggining adress and size of this section*
>
> *pattern = re.compile ("MultiProgPage_Code at c0 - SECTION HEADER")*
>
> *result = pattern.search(s)*
>
> * *
>
>  You don't have to separately compile regexps - just directly call
> functions like re.search(regexp_string, s).
>
> [Compilation was supposed to improve performance when you use the same
> regexp a lot,
> but the re module has a cache of compiled regexps, so it usually doesn't
> matter.]
>
>
>
> And as I said above, s.index() is probably simpler than regexps for your
> needs.
>
>
>
>  *#result is MatchObject, and therefore result.start() holds the location
> of exression A in the file.*
>
> *#now we need to find the value of the first time s_paddr , and s_size are
> found after exression A *
>
>
>
>
>
> --
> Beni Cherniavsky-Paskin <cben at users.sf.net>
>
>
> ______________________________________________________________________
> DSP Group, Inc. automatically scans all emails and attachments using
> MessageLabs Email Security System.
> _____________________________________________________________________
>
>
> ______________________________________________________________________
> DSP Group, Inc. automatically scans all emails and attachments using
> MessageLabs Email Security System.
> _____________________________________________________________________
>
> _______________________________________________
> Python-il mailing list
> Python-il at hamakor.org.il
> http://hamakor.org.il/cgi-bin/mailman/listinfo/python-il
>
>


-- 
Check out my blog: http://orip.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://hamakor.org.il/pipermail/python-il/attachments/20100120/49b2ec4b/attachment-0001.htm 


More information about the Python-il mailing list