Recently, I have used the ngram-count tool of SRILM to find n-grams of a corpus.
However, I have found that when n>=3, the tool will discard low-frequency n-grams by default.
In fact we can find the n-grams using the -write option of the tool, which is a better choice if you only care about n-grams, not the probabilities.
No comments:
Post a Comment