Thursday, February 23, 2012

Moses: recaser issues

Nowadays, I am trying to put up an Moses-based MT demo.
I found that the moses/scripts/recaser/recase.perl actually does a lot of things other than using Moses to translate uncased text to cased text:
(1) by default the moses.ini configuration file of the MT system for recasing uses distortion-limit 6, which means it allows reordering, and the recase.perl script changes the distortion-limit to 1 by passing the option "-dl 1" to the Moses decoder.
(2) the recase.perl script also use some rules to do recasing, e.g., for English, it will always keep some specific words ("a","after","against","al-.+","and","any","as","at","be","because","between","by","during","el-.+","for","from","his","in","is","its","last","not","of","off","on","than","the","their","this","to","was","were","which","will","with") upper casing;
(3) the script also uppercases the initial word of a sentence.

Monday, February 20, 2012

Moses: pruning phrase tables

According to the page:

(1) I first download the source code of SALM from:
then I go to the directory:
and run command:
make allO32
make allO64
(There is some errors: make: *** No rule to make target `../../Bin/Linux/Search/SampleNGramIns.O32', needed by `allO32'. )
Note that I compile SALM using g++-4.1, and I had tried to use g++-4.4 but failed.

(2) I found that in the latest Moses got using command git there is no sub-directory named sigtest-filter, so I copied the sigtest-filter from some old version of Moses got using svn.
I go to the directory sigtest-filter, and run command:
make SALMDIR=/path/to/SALM
(using g++-4.4)

Friday, February 10, 2012

Python: multi threading problem

The canonical implementation of the Python programming language is based on C language. The term “CPython” is used when necessary to distinguish this implementation from others such as Jython or IronPython.

In CPython/Python, there is an important lock named global interpreter lock (GIL), which is the mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines. Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.

GIL actually prevents threads from running in parallel in Python. The GIL is only a problem when tackling CPU-bounded problems in Python, but it is not a big problem for I/O bounded threads.

(1) Jython is free of GIL;
(2) Cython;

Thursday, February 9, 2012

Compiling latest Moses from git

When I was using command:
./bjam --with-srilm=srilm-1.6.0 --with-irstlm=irstlm-5.70.04 --with-giza=giza-pp --with-boost=boost_1_48_0 -j1
to compile the latest Moses checked out using command git,
I got the following errors:
gcc.compile.c++ moses/src/LM/bin/gcc-4.4.1/release/debug-symbols-on/link-static/threading-multi/Factory.o
In file included from moses/src/LM/ORLM.h:8,
from moses/src/LM/Factory.cpp:41:
moses/src/DynSAInclude/onlineRLM.h:22: error: reference to ‘Vocab’ is ambiguous
moses/src/LM/SRI.h:33: error: candidates are: struct Vocab
moses/src/DynSAInclude/vocab.h:17: error: class Moses::Vocab

My solution is to replace all the "Vocab" with "Moses::Vocab" in moses/src/DynSAInclude.

Sunday, February 5, 2012

Python: buffering problem when using 'for line in sys.stdin'

Nowadays, I found a buffering problem when I use the following python code:
for line in sys.stdin:
print line

using which after I type in a sentence to the terminal, I get no output.

After investigating for a while, I come to the following solution (using readline() instead):
line=' '
while len(line)!=0:
print line

Saturday, February 4, 2012

Perl bug: spliting UTF-8 encoded Chinese string

I found a bug of perl, when I used regular expression /\s+/ to split a Chinese string "我想去你家,可以吗?我还想去月球,你想去吗?" which was encoded in UTF-8.