Wednesday, 29 June 2016

Train Moses for translation (for example English-Hindi)

In this tutorial we will be training moses to translate text from English to Hindi.Before begining this tutorial you need to have moses installed in your machine.
Check out my blog on Installing moses and its components.
You need corpus to train moses for translation from English to Hindi
Download corpora.encorpora.hi and indic nlp library.


Corpus Input and making directory :

Make a new directory "corpus" in the MachineTranslation directory. Make a directory "training" inside the directory corpus and put the two files corpora.en and corpora.hi in it.

Extract the zip file of Indic nlp library into the MachineTranslation directory.


Pre-process corpora :

Hindi Normalisation :
cd corpus
python ../indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py training/corpora.hi corpora.norm.hi [True]
Hindi Tokenization :
python ../indic_nlp_library-master/src/indicnlp/tokenize/indic_tokenize.py corpora.norm.hi corpora.tok.hi hi
English Tokenization :
../mosesdecoder-master/scripts/tokenizer/tokenizer.perl -l en < training/corpora.en > corpora.tok.en
Create Truecase model for English :
../mosesdecoder-master/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus corpora.tok.en
Create Truecase model for Hindi :
../mosesdecoder-master/scripts/recaser/train-truecaser.perl --model truecase-model.hi --corpus corpora.tok.hi
Truecasing English :
../mosesdecoder-master/scripts/recaser/truecase.perl --model truecase-model.en < corpora.tok.en > corpora.true.en
Truecasing Hindi :
../mosesdecoder-master/scripts/recaser/truecase.perl --model truecase-model.hi < corpora.tok.hi > corpora.true.hi


Cleaning of English and Hindi :
../mosesdecoder-master/scripts/training/clean-corpus-n.perl corpora.true en hi corpora.clean 1 80

 LM Building :
cd ../
mkdir lm
cd lm/
../irstlm-5.80.08/bin/add-start-end.sh < ../corpus/corpora.clean.hi > corpora.sb.hi
export IRSTLM=/home1/MachineTranslation/irstlm-5.80.08/
../irstlm-5.80.08/bin/build-lm.sh -i corpora.sb.hi -t ./tmp -p -s improved-kneser-ney -o corpora.lm.hi
../irstlm-5.80.08/bin/compile-lm corpora.lm.hi.gz --text=yes corpora.arpa.hi

../mosesdecoder-master/bin/build_binary -i corpora.arpa.hi corpora.blm.hi

Training :
cd ../
mkdir working
cd working
nohup nice ../mosesdecoder-master/scripts/training/train-model.perl -root-dir train -corpus  ../corpus/corpora.clean -f en -e hi -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/home1/MachineTranslation/lm/corpora.blm.hi:8 -external-bin-dir ../mosesdecoder-master/tools >& training.out &

Enter the paths according to your PC. The last command takes a very long time. The time taken depends on the size of the corpora used. The corpora I have attached with this blog are of around 7 lakh lines and it took 7-8 hours for me for completing the training. You can see the operations going on by opening the file training.out in the directory working as whole output from this command is transferred to this file.
For checking whether the process is completed or not, enter the command ps.
When the operation is complete it will display the following ([1]+ Done) output:



Binarize Phrase Table and Lexical Reordering Table :

After the training is completed, the translation works but it is tooo slow. For large corpus as in this example it wont work at all. For working at good speeds i.e. translation within a second you need to binarize some files


Binarize Phrase Table :
/home1/MachineTranslation/mosesdecoder-master/bin/CreateOnDiskPt 1 1 4 100 2 /home1/MachineTranslation/working/train/model/phrase-table.gz /home1/MachineTranslation/working/phrase-table.1.folder
Enter the paths as per your PC. After the above command a folder with name phrase-table.1.folder will be created in the working directory which contains the data required for binarised phrase table.

After binarising phrase table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home1/MachineTranslation/working/train/model/phrase-table.gz input-factor=0 output-factor=0
 Comment the line by inserting '#' mark before it. And then enter the following line after it:
PhraseDictionaryOnDisk name=TranslationModel0 num-features=4 path=/home1/MachineTranslation/working/phrase-table.1.folder input-factor=0 output-factor=0
Note the colored changes that we have made. Path will depend on your PC.

Binarize Lexical Reordering Table :
/home1/MachineTranslation/mosesdecoder-master/bin/CreateOnDiskPt 1 1 6 100 2 /home1/MachineTranslation/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gz /home1/MachineTranslation/working/reordering-table.wbe-msd-bidirectional-fe.1.gz
Enter the paths as per your PC. After the above command a folder with name reordering-table.wbe-msd-bidirectional-fe.1.gz  will be created in the working directory which contains the data required for binarised lexical reordering table.

After binarising lexical reordering table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home1/MachineTranslation/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gz
 Comment the line by inserting '#' mark before it. And then enter the following line after it:
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home1/MachineTranslation/working/reordering-table.wbe-msd-bidirectional-fe.1.gz
Note the colored changes that we have made. Path will depend on your PC.

Testing :

At this stage we are ready for testing. We can test moses now for translating from English to Hindi:
Run the following command from MachineTranslation directory to start moses:
mosesdecoder-master/bin/moses -f working/train/model/moses.ini
Type 'hello' and it should be translated to 'नमस्ते'
Now you can translate anything from English to Hindi using moses
Basic syntax for running moses is:
<Location of moses binary file> -f <Location of moses.ini>
In case you wish to translate after taking input from a file (You may need this for translating multiple lines) :
mosesdecoder-master/bin/moses -f working/train/model/moses.ini < inputEn.txt > outputHi.txt

8 comments:

  1. python ../indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py training/corpora.hi corpora.norm.hi [True]

    File "/home/rajpmanjupc/mosesdecoder/corpus/indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py", line 83
    text=re.sub(ur'([\u0900-\u097f]):',ur'\\1\u0903',text)
    ^
    SyntaxError: invalid syntax

    text=re.sub(ur'([\u0900-\u097f]):',u'\\1\u0903',text)

    ReplyDelete
  2. This comment has been removed by the author.

    ReplyDelete
  3. Could you explain how you choose the parameters for the CreateOnDiskPt command?

    thank you very much for this great tutorial

    ReplyDelete
  4. Could you please help why Hindi tokenizer not working...its giving same file again

    ReplyDelete
  5. Thanks for giving a clear description on how to create the English-Hindi Translation model using Moses.

    Can you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.


    Thanks in advance.

    ReplyDelete
  6. Can you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.

    ReplyDelete
  7. Sir can you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.

    ReplyDelete
  8. I essentially discovered your site page and expected that would state that I have truly respected the experience of looking on the web diary sections. christian psychotherapist

    ReplyDelete