Friday, March 2, 2012

Train huge language models

(1) using SRILM
(1.1) counting ngrams
Don't use ngram-count directly to count N-grams. Instead, use the make-batch-counts and merge-batch-counts scripts described in training-scripts(1). That way you can create N-gram counts limited only by the maximum file size on your system.
(1.2) training language models from ngram counts
You are likely to run out of memory either because of the size of ngram counts, or of the LM being built. The following are strategies for reducing the memory requirements for training LMs.
Assuming you are using Good-Turing or Kneser-Ney discounting, don't use ngram-count in "raw" form. Instead, use the make-big-lm wrapper script described in the training-scripts(1) man page.
Switch to using the "_c" or "_s" versions of the SRI binaries. For instructions on how to build them, see the INSTALL file. Once built, set your executable search path accordingly, and try make-big-lm again.
Lower the minimum counts for N-grams included in the LM, i.e., the values of the options -gt2min, -gt3min, -gt4min, etc. The higher order N-grams typically get higher minimum counts.
Get a machine with more memory. If you are hitting the limitations of a 32-bit machine architecture, get a 64-bit machine and recompile SRILM to take advantage of the expanded address space. (The MACHINE_TYPE=i686-m64 setting is for systems based on 64-bit AMD processors, as well as recent compatibles from Intel.) Note that 64-bit pointers will require a memory overhead in themselves, so you will need a machine with significantly, not just a little, more memory than 4GB.

(2) using IRSTLM

Training a language model from huge amounts of data can be definitively memory and time expensive. The IRSTLM toolkit features algorithms and data structures suitable to estimate, store, and access very large LMs. IRSTLM is open source and can be downloaded from here.

Typically, LM estimation starts with the collection of n-grams and their frequency counters. Then, smoothing parameters are estimated for each n-gram level; infrequent n-grams are possibly pruned and, finally, a LM file is created containing n-grams with probabilities and back-off weights. This procedure can be very demanding in terms of memory and time if applied to huge corpora. IRSTLM provides a simple way to split LM training into smaller and independent steps, which can be distributed among independent processes.

The procedure relies on a training script that makes little use of computer memory and implements the Witten-Bell smoothing method. (An approximation of the modified Kneser-Ney smoothing method is also available.) First, create a special directory stat under your working directory, where the script will save lots of temporary files; then, simply run the script as in the example: -i "gunzip -c corpus.gz" -n 3 -o train.irstlm.gz -k 10 

The script builds a 3-gram LM (option -n) from the specified input command (-i), by splitting the training procedure into 10 steps (-k). The LM will be saved in the output (-o) file train.irstlm.gz with an intermediate ARPA format. This format can be properly managed through the compile-lm command in order to produce a compiled version or a standard ARPA version of the LM.

For a detailed description of the procedure and of other commands available under IRSTLM please refer to the user manual supplied with the package.