[Python-il] Determining if a string is RTL
aronovitch at gmail.com
Sun Jan 4 01:40:50 IST 2009
On Sat, Jan 3, 2009 at 11:11 PM, Dotan Cohen <dotancohen at gmail.com> wrote:
> 2009/1/3 Meir Kriheli <meir at mksoft.co.il>:
>> In that case he can use pygtk's pango module with the function Ori
>> pointed to:
1) Let us make things more explicit:
From the original regex, I guess you want to find whether the string
contains any RTL characters, which is different than the question of
finding the base direction (English text with some Hebrew words in the
middle still has LTR direction).
In that case, the relevant function is pango_unichar_direction, and
from python it would look something like this:
>>> has_rtl = pango.DIRECTION_RTL in map(pango.unichar_direction, text.decode("utf-8"))
(If you wish to find the base direction, the code would be something like:
>>> has_rtl = pango.DIRECTION_RTL == pango.find_base_dir(text, len(text))
2) Plug: reviving my abended fribidi-py code...
One problem with the above code (first case) is that the function is
called char-by-char (i.e. the "in" operator actually does a python
loop), which might be slow if you have a long text.
Now, the fribidi C library does provide a function, "get_types", for
calculating bidi props of the whole string. Unfortunately, the current
python interface - pyfribidi (by Kobi and Nir) - wraps only the main
functionality (log2vis) and not this low-level function.
I once started a project called fribidi-py for a complete wrapping of
FriBidi, but abbended it, mainly because once pyfribidi was done,
there seemed to be no urgent need for the lower level functionality.
Since this post made me recall that work, I checked, and it seems that
it is still functional enough to achieve the goal described here,
although the resulting code looks ugly (project was at a very
preliminary stage, and lot of stuff was left unwrapped). If somebody
wants to hack/revive it - see below. Currently this is an unusable
solution, but it seems that it should not be too hard to make this
From lack of time, I will probably not continue this myself unless
people tell me it would be very useful/important - but if someone
wishes to take it up, I will gladly help and maybe join.
Code is available in http://amit.freeshell.org/fribidi-py_0.1.4.tar.gz
No makefile (sorry, I did say preliminary code...). To use: unpack,
goto the directory, run the following:
$ . gen_types
$ ln -s . fribidi
OK - now lets check for RTL chars:
>>> from fribidi import *
>>> u = u'abc אבג'
>>> sbuf = (FriBidiChar * len(u))(*map(ord,u))
>>> rbuf = (FriBidiCharType * len(u))()
>>> get_types(sbuf, len(u), rbuf)
>>> [x%2 for x in rbuf]
[0, 0, 0, 0, 1, 1, 1]
Of course, several python loops are done here, which make it even less
efficient than the Pango method described above. However, these loops
can be avoided using numpy and some more ctypes hacks - just did not
want to make the example more ugly than it already is...
More information about the Python-il