[Python-il] problem in script

Beni Cherniavsky cben at users.sf.net
Wed Jan 20 12:18:43 IST 2010


*
*
On Tue, Jan 19, 2010 at 18:11, Yitzhak Wiener <Yitzhak.Wiener at dspg.com>wrote:

>  Hi Guys,
>
>
>
> May I ask you a question?
>
> I am trying to write a script that looking for some string expression
> (expression A) in a file, and after it finds it, it searches for 2 other
> expressions (B & C) which are located few lines after the fist expression.
>
> These 2 expressions appear few times in this file, that’s why I need to
> search for expression A first, and the next time B & C appears this is what
> I search for.
>
> If the expressions are fixed strings, you don't really need regexps - just
use str.index() which takes optional start,stop parameters:

*a_pos = s.index("MultiProgPage_Code at c0 - SECTION HEADER")*
*b_pos = s.index("s_paddr", a_pos)*
*c_pos = s.index("s_size", b_pos)  # or a_pos?*

[If any of these never occurs, .index() will raise ValueError]

If you need the flexibility of regexps, they don't take start,stop
parameters, but you can slice the string itself:

*a_match = re.search("MultiProgPage_Code at c0 - SECTION HEADER", s)*
*b_match = re.search("s_paddr", s[a_match.start()])*
*c_match = re.search("s_size", s[b_match.start()])  # or a_match?*
*
*
But the whole point of regular expressions is that you can also express "A,
then B, then C" at once:

*match = re.search("MultiProgPage_Code at c0 - SECTION
HEADER.*(s_paddr).*(s_size)", s)*
*b_pos = match.start(1)*
*c_pos = match.start(2)*

If you don't know the order of s_paddr/s_size, the regexp is much trickier.
I guess you want to look for things after "s_paddr", "s_size", so you want
match.end(1).

=> Of these 3 ways, the first is probably simplest and cleanest.

You seem to be parsing a COFF file, right?
Regexps are not well-suited to parsing binary formats.
The manual way to parse them is to work with strings, and the array/struct
module to parse specific parts.
(See my advices below mixed with your code.)

If you intend to do a lot with COFF, consider the
hachoir<http://bit.ly/hachoir> and
Construct <http://construct.wikispaces.com/> frameworks.
They allow parsing/modifying binary formats in a *declarative* way - your
code looks like a *description* of the format, not like *actions* needed to
parse it.
And they have built-in definitions for a lot of formats.  E.g. both have ELF
and PE (windows exe format) though not COFF.
*Note however that PE is based on COFF, so I guess you can massage it a
little and get a full COFF parser...*


>
> I attached the script I use for finding expression A, but now I don’t know
> how to tell the script to start searching for expression B & C from point A.
>
>
>
Some notes how your code can be simplified in Python:

 *from array import array*
>
> * *
>
> *import os, stat, re*
>
> * *
>
> *#get coff file size*
>
> *file_size = os.stat("project_release.dump")[stat.ST_SIZE]*
>
> * *
>
Since python 2.2, the result of os.stat still pretends to be a tuple but can
also be accessed with named attributes:
*os.stat("project_release.dump").st_size*

> * *
>
> *a = array('H')*
>
> *f =  open("project_release.dump","rb")*
>
> *f2 =  open("project_release_out.dump","wb")*
>
> **
>
IMHO, it's cleaner to write a function that takes a string and returns a
string,
and do all file reading/writing at the end, where you call the function.
This one is a question of taste, you might well disagree...

> * *
>
> *a.fromfile( f,(file_size/2) )*
>
> *s = a.tostring()*
>
> **
>
Why use an array object to read the file, when all you seem to do with it is
convert it to a string?
I'd simply do:

*s = open("project_release.dump","rb").read()*

Then, if/when you need to parse parts of it as 16-bit ints, convert those
parts to arrays: *array('H', s[start:stop])*
This also gives you the flexibility to parse different parts as different
types.  See also the struct module.

Note that reading the file, then constructing the array() also saves
checking the size and calling f.fromfile() separately!

* *
>
> *#search in coff for beggining of "MultiProgPage_Code" code section in
> coff file.*
>
> *#We need the beggining adress and size of this section*
>
> *pattern = re.compile ("MultiProgPage_Code at c0 - SECTION HEADER")*
>
> *result = pattern.search(s)*
>
> * *
>
You don't have to separately compile regexps - just directly call functions
like re.search(regexp_string, s).
[Compilation was supposed to improve performance when you use the same
regexp a lot,
but the re module has a cache of compiled regexps, so it usually doesn't
matter.]

And as I said above, s.index() is probably simpler than regexps for your
needs.


> *#result is MatchObject, and therefore result.start() holds the location
> of exression A in the file.*
>
> *#now we need to find the value of the first time s_paddr , and s_size are
> found after exression A *
>
>
>
-- 
Beni Cherniavsky-Paskin <cben at users.sf.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://hamakor.org.il/pipermail/python-il/attachments/20100120/b5a670e4/attachment.htm 


More information about the Python-il mailing list