[Python-il] problem in script

Yitzhak Wiener Yitzhak.Wiener at dspg.com
Wed Jan 20 15:47:00 IST 2010

Wow, Benny, this was great coaching.

I appreciate it so much.

The reason I opened it as array is because I indeed need to edit 16bit
int's in raw data section of this file.




Best Regards,



From: beni.cherniavsky at gmail.com [mailto:beni.cherniavsky at gmail.com] On
Behalf Of Beni Cherniavsky
Sent: Wednesday, January 20, 2010 12:19 PM
To: Yitzhak Wiener
Cc: python-il at hamakor.org.il
Subject: Re: [Python-il] problem in script



On Tue, Jan 19, 2010 at 18:11, Yitzhak Wiener <Yitzhak.Wiener at dspg.com>

Hi Guys,


May I ask you a question? 

I am trying to write a script that looking for some string expression
(expression A) in a file, and after it finds it, it searches for 2 other
expressions (B & C) which are located few lines after the fist

These 2 expressions appear few times in this file, that's why I need to
search for expression A first, and the next time B & C appears this is
what I search for.

If the expressions are fixed strings, you don't really need regexps -
just use str.index() which takes optional start,stop parameters:


a_pos = s.index("MultiProgPage_Code at c0 - SECTION HEADER")

b_pos = s.index("s_paddr", a_pos)

c_pos = s.index("s_size", b_pos)  # or a_pos?


[If any of these never occurs, .index() will raise ValueError]


If you need the flexibility of regexps, they don't take start,stop
parameters, but you can slice the string itself:


a_match = re.search("MultiProgPage_Code at c0 - SECTION HEADER", s)

b_match = re.search("s_paddr", s[a_match.start()])

c_match = re.search("s_size", s[b_match.start()])  # or a_match?


But the whole point of regular expressions is that you can also express
"A, then B, then C" at once:


match = re.search("MultiProgPage_Code at c0 - SECTION
HEADER.*(s_paddr).*(s_size)", s)

b_pos = match.start(1)

c_pos = match.start(2)


If you don't know the order of s_paddr/s_size, the regexp is much

I guess you want to look for things after "s_paddr", "s_size", so you
want match.end(1).


=> Of these 3 ways, the first is probably simplest and cleanest.


You seem to be parsing a COFF file, right?

Regexps are not well-suited to parsing binary formats.

The manual way to parse them is to work with strings, and the
array/struct module to parse specific parts.

(See my advices below mixed with your code.)


If you intend to do a lot with COFF, consider the hachoir
<http://bit.ly/hachoir>  and Construct
<http://construct.wikispaces.com/>  frameworks.

They allow parsing/modifying binary formats in a declarative way - your
code looks like a description of the format, not like actions needed to
parse it.

And they have built-in definitions for a lot of formats.  E.g. both have
ELF and PE (windows exe format) though not COFF.

Note however that PE is based on COFF, so I guess you can massage it a
little and get a full COFF parser...



	I attached the script I use for finding expression A, but now I
don't know how to tell the script to start searching for expression B &
C from point A.


Some notes how your code can be simplified in Python:


	from array import array


	import os, stat, re


	#get coff file size

	file_size = os.stat("project_release.dump")[stat.ST_SIZE]


Since python 2.2, the result of os.stat still pretends to be a tuple but
can also be accessed with named attributes:



	a = array('H')

	f =  open("project_release.dump","rb")

	f2 =  open("project_release_out.dump","wb")

IMHO, it's cleaner to write a function that takes a string and returns a
and do all file reading/writing at the end, where you call the function.

This one is a question of taste, you might well disagree...


	a.fromfile( f,(file_size/2) )

	s = a.tostring()

Why use an array object to read the file, when all you seem to do with
it is convert it to a string?
I'd simply do:


s = open("project_release.dump","rb").read()


Then, if/when you need to parse parts of it as 16-bit ints, convert
those parts to arrays: array('H', s[start:stop])

This also gives you the flexibility to parse different parts as
different types.  See also the struct module.


Note that reading the file, then constructing the array() also saves
checking the size and calling f.fromfile() separately!



	#search in coff for beggining of "MultiProgPage_Code" code
section in coff file.

	#We need the beggining adress and size of this section

	pattern = re.compile ("MultiProgPage_Code at c0 - SECTION HEADER")

	result = pattern.search(s)


You don't have to separately compile regexps - just directly call
functions like re.search(regexp_string, s).

[Compilation was supposed to improve performance when you use the same
regexp a lot, 
but the re module has a cache of compiled regexps, so it usually doesn't


And as I said above, s.index() is probably simpler than regexps for your


	#result is MatchObject, and therefore result.start() holds the
location of exression A in the file.

	#now we need to find the value of the first time s_paddr , and
s_size are found after exression A 



Beni Cherniavsky-Paskin <cben at users.sf.net>

DSP Group, Inc. automatically scans all emails and attachments using
MessageLabs Email Security System.

DSP Group, Inc. automatically scans all emails and attachments using MessageLabs Email Security System.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://hamakor.org.il/pipermail/python-il/attachments/20100120/66ff0fcd/attachment-0001.htm 

More information about the Python-il mailing list