The project is a language model toolkit based on .NET framework. So far, the project supports two language model algorithms. The one is n-gram language modeling with Kneser-Ney smoothing and the other is recurrent neural network language modeling ported
from RNNLM by Tomas Mikolov.
By this project, users are able to train language model by pipeline tool and predict sentence's probability by decoder.
Ngram language modeling with Kneser-Ney smoothing
Users run build.bat file in pipeline directory to start training model.
build.bat [input file] [output file]
By default, running build.bat will generate a 4-gram language model. If you want to adjust N for gram, please update build.bat.
The project provides two ways to use the model. The one is a console tool and the other is API for developers.
lm_score.exe [word breaker dictionary] [language model] [ngram-order] <input file> <output file>
[word breaker dictionary] : the lexical dictionary loaded by word breaker
[language model] : language model file name used by the tool
[ngram-order] : the ngram-order value
<input file> : input file with text which will be processed by language model
<output file> : output file with text processed by language model
lm_score.exe wordbreak_dict.txt chsLM.txt 4 input.txt output.txt
The format of <output file> as follows:
Text \t Probability \t the number of OOV \t Perplexity
API for developers
The language model has provided some APIs for developers to use the model in their projects. The following paragraph introduces how to use APIs.
1. Add LMDecoder.dll as reference into project
2. Create LMDecoder.LMDecoder instance
3. Use LoadLM(string strFileName) to load language model from given file. The
strFileName is used to specify the language model path and file name.
4. Use LMResult GetSentProb(string strText, int order) to predict a specific string's score. The
strText is the string used to predict score and the order is the max-order. The return value type is
LMResult contains predicted result. Its structure as follows:
public class LMResult
public double logProb; //the probability score of given string
public int oovs; //the number of OOV tokens
public double perplexity; //the perplexity of given string
Recurrent neural network language modeling