Wednesday, September 7, 2011

A bug in the Moses tokenizer

when you set the -l option as an unknown language, the Moses tokenizer will say it will fall back to English. However, it does not completely fall back to English. It only falls back to English for tokenizing the period (.) issues, but it will tokenize the single quotation marks (') differently from the English case.

for example, given the input "I'm a boy.", if you set -l en or do not set the -l option, the output is "I 'm a boy ."; if you set -l abc which is an unknown abbreviation of language, the output will be "I ' m a boy ."

No comments: