Tuesday, May 8, 2012

sentence-level alignment tools for statistical machine translation

Recently, I have found the following sentence-level alignment tools for statistical machine translation (SMT). These tools can pair sentences which have the same meaning but in different languages from parallel documents. This is also the first step of building an SMT system.

(1) CTK: Champollion Tool Kit
http://champollion.sourceforge.net/
Note: this tool (from LDC) uses translation lexicons to align sentences, and one disadvantage is that when the two documents are very different in the number of sentences, this tool can not work well.
CTK v1.2 supports three language pairs:
    English Chinese(GB)
    English Chinese(UTF8)
    English Arabic (UTF8)
    English Hindi (UTF8)

(2) Gale-Church Aligner
This is a very old sentence-level alignment algorithm, and fortunately Chris Crowner has implemented it in the NLTK.
http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/align/align.py?r=8552&spec=svn8552
Note that the python code is in the nltk_contrib, not in the main release of NLTK.

(3) MTTK: Machine Translation Toolkit
http://mi.eng.cam.ac.uk/~wjb31/distrib/mttkv1/
Note: this tool is supposed to have the ability to do sentence-level alignment, but I still can not figure out how to do it using the tool.

(4) Align
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.html
Note: this tool was developed by Adam Berger, and can be downloaded from:
http://www.cse.unt.edu/~rada/wa/tools/aberger/align.tar
It supports sentence-level alignment using some anchor labels.

(5) Bleualign
https://github.com/rsennrich/Bleualign
This tool requires automatic translations of one side of the unaligned corpus and then uses a modified BLEU evaluation to find the sentence-level alignments. Of course, you need a seed SMT system to generate the automatic translations. The tool is written in Python.
I found a problem when using this aligner which could use the same sentence on the target side multiple times in the output alignments.

(6) Microsoft Bilingual Sentence Aligner
https://www.microsoft.com/en-us/download/details.aspx?id=52608
This is a sentence aligner written in Perl. It uses sentence length.


16 comments:

Anonymous said...

Can any one help me with python code for translation from Arabic language to English pleeeeeeeeeeese??

noha said...

Can any one help me with python code for translation from Arabic language to English pleeeeeeeeeeese??

Pidong WANG said...

to the best of my knowledge, there is no Python statistical machine translation decoder so far, so you'd better turn to using Moses to build your translation system. Of course, before building the system, you need to prepare some parallel training data of Arabic and English. One free way of getting the training data is to get them from some open source parallel corpora, e.g. OPUS.

Anonymous said...

How to get access_token for ios and android devices. Any translation code snippet for ios will help me a lot

Mohammad N said...

Hello Noha,

please check the Kriya decoder which is an implementation of hierarchical phrase-based (hiero) SMT system. It is entirely implemented in Python and includes both grammar extractor and decoder modules.
Please see the PBML paper for technical details specific to this implementation -
Baskaran Sankaran, Majid Razmara and Anoop Sarkar. 2012. Kriya – An end-to-end Hierarchical
Phrase-based MT System. The Prague Bulletin of Mathematical Linguistics (PBML), (97), 83--98

cidermole said...

And there's also Bob Moore's excellent "Bilingual Sentence Aligner".

Currently residing at

http://research.microsoft.com/en-us/downloads/aafd5dcf-4dcc-49b2-8a22-f7055113e656/

though Microsoft seems to change download links.

ghostus said...

gledajte sve najnovije turske srbije na mreži i sva najnovija ažuriranja serija samo na serijama online


https://serijeonlines.net

Anonymous said...

KBergetar Tonton Drama Kepala Bergetar Melayu Dramas All Episod Watch Online,Layan dramas Watch Online , Melayu Drama Live Episod Tv3 And Astro Ria Full Episod, Tonron Melayu Drama Hd Replay

KBergetar Tonton Drama

Kepala Bergetar

ghostus said...

you are sharing good articles keep it up and try to explain it very informative


apk world
Free Games

Varun Lekho said...

Emirates ID does lie aged by way of UAE citizens as much a travel record in accordance with a journey inside the GCC Banks and Finance Companies hold taken that a mandatory want in imitation of procedure functions because of savings purposes.
Emirates ID
id status

Bergetar said...

Watch Perempuan Itu Online Episod 3 Live Drama Full Episod. Perempuan Itu episod 3 7 Hari Mencintaiku 3 Full Episode. Tonton Perempuan Itu online Episod 3 Full Malay Drama.

WWE RAW said...

Thanks for your post. WWE SMACKDOWN

techghani.com said...

Your Blog Is very atrective I read your All Post And They are so Impresive.WWE WRESTLING

Kepala Bergetar said...

Tonton Melayu Drama Kepala Bergetar Dan Download Malay Telefilem. Kbergetar Watch Online Tonton Live Episod Drama Video.

Website said...

the Bihar Har Ghar Bijli campaign launched by the INDIAN
government to provide access to electricity to every household, especially in rural areas. The goal is to improve the quality of life and promote economic growth by providing reliable and affordable electricity.

Ely Girl said...

The 8171 Ehsaas NADRA Gov Pk program is a commendable initiative that aims to provide financial assistance to those in need, reduce poverty, create employment


Ehsaas Program Registration online
8171 Ehsaas Program