Thursday, February 23, 2012

Moses: recaser issues

Nowadays, I am trying to put up an Moses-based MT demo.
I found that the moses/scripts/recaser/recase.perl actually does a lot of things other than using Moses to translate uncased text to cased text:
(1) by default the moses.ini configuration file of the MT system for recasing uses distortion-limit 6, which means it allows reordering, and the recase.perl script changes the distortion-limit to 1 by passing the option "-dl 1" to the Moses decoder.
(2) the recase.perl script also use some rules to do recasing, e.g., for English, it will always keep some specific words ("a","after","against","al-.+","and","any","as","at","be","because","between","by","during","el-.+","for","from","his","in","is","its","last","not","of","off","on","than","the","their","this","to","was","were","which","will","with") upper casing;
(3) the script also uppercases the initial word of a sentence.

Monday, February 20, 2012

Moses: pruning phrase tables

According to the page:
http://www.statmt.org/moses/?n=Moses.AdvancedFeatures#ntoc16

(1) I first download the source code of SALM from:
http://projectile.sv.cmu.edu/research/public/tools/salm/salm.htm
then I go to the directory:
SALM/Distribution/Linux
and run command:
make allO32
make allO64
(There is some errors: make: *** No rule to make target `../../Bin/Linux/Search/SampleNGramIns.O32', needed by `allO32'. )
Note that I compile SALM using g++-4.1, and I had tried to use g++-4.4 but failed.

(2) I found that in the latest Moses got using command git there is no sub-directory named sigtest-filter, so I copied the sigtest-filter from some old version of Moses got using svn.
I go to the directory sigtest-filter, and run command:
make SALMDIR=/path/to/SALM
(using g++-4.4)

Friday, February 10, 2012

Python: multi threading problem

The canonical implementation of the Python programming language is based on C language. The term “CPython” is used when necessary to distinguish this implementation from others such as Jython or IronPython.

In CPython/Python, there is an important lock named global interpreter lock (GIL), which is the mechanism used by the CPython interpreter to assure that only one thread executes Python bytecode at a time. This simplifies the CPython implementation by making the object model (including critical built-in types such as dict) implicitly safe against concurrent access. Locking the entire interpreter makes it easier for the interpreter to be multi-threaded, at the expense of much of the parallelism afforded by multi-processor machines. Past efforts to create a “free-threaded” interpreter (one which locks shared data at a much finer granularity) have not been successful because performance suffered in the common single-processor case. It is believed that overcoming this performance issue would make the implementation much more complicated and therefore costlier to maintain.

GIL actually prevents threads from running in parallel in Python. The GIL is only a problem when tackling CPU-bounded problems in Python, but it is not a big problem for I/O bounded threads.

Solutions:
(1) Jython is free of GIL;
(2) Cython;

Thursday, February 9, 2012

Compiling latest Moses from git

When I was using command:
./bjam --with-srilm=srilm-1.6.0 --with-irstlm=irstlm-5.70.04 --with-giza=giza-pp --with-boost=boost_1_48_0 -j1
to compile the latest Moses checked out using command git,
I got the following errors:
gcc.compile.c++ moses/src/LM/bin/gcc-4.4.1/release/debug-symbols-on/link-static/threading-multi/Factory.o
In file included from moses/src/LM/ORLM.h:8,
from moses/src/LM/Factory.cpp:41:
moses/src/DynSAInclude/onlineRLM.h:22: error: reference to ‘Vocab’ is ambiguous
moses/src/LM/SRI.h:33: error: candidates are: struct Vocab
moses/src/DynSAInclude/vocab.h:17: error: class Moses::Vocab


My solution is to replace all the "Vocab" with "Moses::Vocab" in moses/src/DynSAInclude.
/onlineRLM.h

Sunday, February 5, 2012

Python: buffering problem when using 'for line in sys.stdin'

Nowadays, I found a buffering problem when I use the following python code:
for line in sys.stdin:
print line

using which after I type in a sentence to the terminal, I get no output.

After investigating for a while, I come to the following solution (using readline() instead):
infile=sys.stdin
line=' '
while len(line)!=0:
line=infile.readline()
print line

Saturday, February 4, 2012

Perl bug: spliting UTF-8 encoded Chinese string

I found a bug of perl, when I used regular expression /\s+/ to split a Chinese string "我想去你家,可以吗?我还想去月球,你想去吗?" which was encoded in UTF-8.

Thursday, January 19, 2012

Moses phrase-based decoder analysis

(1). from the moses-cmd/src/Main.cpp (int main(int argc, char* argv[]))

(2). Main.cpp first calls parameter->LoadParam(argc, argv) to load and check the parameters in the moses.ini configuration file and command line, where the model files are not loaded

(3). Main.cpp then calls StaticData::LoadDataStatic(parameter) to load weights and models according to the parameters of (2)
(3.1) StaticData::LoadDataStatic(parameter) calls StaticData::LoadData(Parameter *parameter)
(3.1.1) in StaticData::LoadData(Parameter *parameter), we load the weights and models by calling, e.g., StaticData::LoadLanguageModels(), LoadPhraseTables()
(3.1.1.1) in StaticData::LoadLanguageModels() calls LanguageModel* CreateLanguageModel(LMImplementation lmImplementation, const std::vector &factorTypes, size_t nGramOrder, const std::string &languageModelFile, float weight, ScoreIndexManager &scoreIndexManager , int dub) to create LM instances, where the highest level LM class is class LanguageModel : public StatefulFeatureFunction; LanguageModel is the parent class of LanguageModelSingleFactor and LanguageModelMultiFactor; LanguageModelInternal is a subclass of LanguageModelSingleFactor;
In Moses, the major specific interfaces of LM classes like LanguageModelInternal are: bool load(...) and float GetValue(const std::vector &contextFactor, State* finalState = 0, unsigned int* len = 0) const, where the former one is used to load a LM file while the later one calculates the probability for an n-gram saved in contextFactor; the class LanguageModel implements the general interface for a feature function, e.g., Evaluate(..)

(4). Main.cpp uses IOWrapper *ioWrapper = GetIODevice(staticData) to setup the input device (an input file or standard input)

(5). Main.cpp uses vector weights = staticData.GetAllWeights() to check on weights

(6). Main.cpp starts the main loop of translating input instances (text, confusion network, or lattice):
(6.1). use ReadInput(*ioWrapper,staticData.GetInputType(),source) to load an input, which is saved in source
(6.2). setup the translation manager by calling Manager manager(*source, staticData.GetSearchAlgorithm()), where by calling staticData.InitializeBeforeSentenceProcessing(source) we initialize the translation/language models for this sentence; the language model list is StaticDate.m_languageModel; the default search algorithm is SearchNormal;
(6.3). expand translation hypotheses stack by stack until the end of the input sentence using manager.ProcessSentence()
(6.3.1). ProcessSentence() first reset the statistics using staticData.ResetSentenceStats(m_source)
(6.3.2). ProcessSentence() then collects translation options for the input sentence
(6.3.3). ProcessSentence() calls the search algorithm to process the input using m_search->ProcessSentence()
(6.4). pick the best translation (maximum a posteriori decoding)