In this tutorial we will be training moses to translate text from English to Hindi.Before begining this tutorial you need to have moses installed in your machine.
Check out my blog on Installing moses and its components.
You need corpus to train moses for translation from English to Hindi
Download corpora.en , corpora.hi and indic nlp library.
Corpus Input and making directory :
Make a new directory "corpus" in the MachineTranslation directory. Make a directory "training" inside the directory corpus and put the two files corpora.en and corpora.hi in it.
Extract the zip file of Indic nlp library into the MachineTranslation directory.
Pre-process corpora :
Hindi Normalisation :
Cleaning of English and Hindi :
LM Building :
Training :
Enter the paths according to your PC. The last command takes a very long time. The time taken depends on the size of the corpora used. The corpora I have attached with this blog are of around 7 lakh lines and it took 7-8 hours for me for completing the training. You can see the operations going on by opening the file training.out in the directory working as whole output from this command is transferred to this file.
For checking whether the process is completed or not, enter the command ps.
When the operation is complete it will display the following ([1]+ Done) output:
Binarize Phrase Table and Lexical Reordering Table :
After the training is completed, the translation works but it is tooo slow. For large corpus as in this example it wont work at all. For working at good speeds i.e. translation within a second you need to binarize some files
Binarize Phrase Table :
After binarising phrase table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
Binarize Lexical Reordering Table :
After binarising lexical reordering table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
Testing :
At this stage we are ready for testing. We can test moses now for translating from English to Hindi:
Run the following command from MachineTranslation directory to start moses:
Now you can translate anything from English to Hindi using moses
Basic syntax for running moses is:
Check out my blog on Installing moses and its components.
You need corpus to train moses for translation from English to Hindi
Download corpora.en , corpora.hi and indic nlp library.
Corpus Input and making directory :
Make a new directory "corpus" in the MachineTranslation directory. Make a directory "training" inside the directory corpus and put the two files corpora.en and corpora.hi in it.
Extract the zip file of Indic nlp library into the MachineTranslation directory.
Pre-process corpora :
Hindi Normalisation :
cd corpus
python ../indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py training/corpora.hi corpora.norm.hi [True]Hindi Tokenization :
python ../indic_nlp_library-master/src/indicnlp/tokenize/indic_tokenize.py corpora.norm.hi corpora.tok.hi hiEnglish Tokenization :
../mosesdecoder-master/scripts/tokenizer/tokenizer.perl -l en < training/corpora.en > corpora.tok.enCreate Truecase model for English :
../mosesdecoder-master/scripts/recaser/train-truecaser.perl --model truecase-model.en --corpus corpora.tok.enCreate Truecase model for Hindi :
../mosesdecoder-master/scripts/recaser/train-truecaser.perl --model truecase-model.hi --corpus corpora.tok.hiTruecasing English :
../mosesdecoder-master/scripts/recaser/truecase.perl --model truecase-model.en < corpora.tok.en > corpora.true.enTruecasing Hindi :
../mosesdecoder-master/scripts/recaser/truecase.perl --model truecase-model.hi < corpora.tok.hi > corpora.true.hi
Cleaning of English and Hindi :
../mosesdecoder-master/scripts/training/clean-corpus-n.perl corpora.true en hi corpora.clean 1 80
LM Building :
cd ../
mkdir lm
cd lm/
../irstlm-5.80.08/bin/add-start-end.sh < ../corpus/corpora.clean.hi > corpora.sb.hi
export IRSTLM=/home1/MachineTranslation/irstlm-5.80.08/
../irstlm-5.80.08/bin/build-lm.sh -i corpora.sb.hi -t ./tmp -p -s improved-kneser-ney -o corpora.lm.hi
../irstlm-5.80.08/bin/compile-lm corpora.lm.hi.gz --text=yes corpora.arpa.hi
../mosesdecoder-master/bin/build_binary -i corpora.arpa.hi corpora.blm.hi
Training :
cd ../
mkdir working
cd working
nohup nice ../mosesdecoder-master/scripts/training/train-model.perl -root-dir train -corpus ../corpus/corpora.clean -f en -e hi -alignment grow-diag-final-and -reordering msd-bidirectional-fe -lm 0:3:/home1/MachineTranslation/lm/corpora.blm.hi:8 -external-bin-dir ../mosesdecoder-master/tools >& training.out &
Enter the paths according to your PC. The last command takes a very long time. The time taken depends on the size of the corpora used. The corpora I have attached with this blog are of around 7 lakh lines and it took 7-8 hours for me for completing the training. You can see the operations going on by opening the file training.out in the directory working as whole output from this command is transferred to this file.
For checking whether the process is completed or not, enter the command ps.
When the operation is complete it will display the following ([1]+ Done) output:
Binarize Phrase Table and Lexical Reordering Table :
After the training is completed, the translation works but it is tooo slow. For large corpus as in this example it wont work at all. For working at good speeds i.e. translation within a second you need to binarize some files
Binarize Phrase Table :
/home1/MachineTranslation/mosesdecoder-master/bin/CreateOnDiskPt 1 1 4 100 2 /home1/MachineTranslation/working/train/model/phrase-table.gz /home1/MachineTranslation/working/phrase-table.1.folderEnter the paths as per your PC. After the above command a folder with name phrase-table.1.folder will be created in the working directory which contains the data required for binarised phrase table.
After binarising phrase table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
PhraseDictionaryMemory name=TranslationModel0 num-features=4 path=/home1/MachineTranslation/working/train/model/phrase-table.gz input-factor=0 output-factor=0Comment the line by inserting '#' mark before it. And then enter the following line after it:
PhraseDictionaryOnDisk name=TranslationModel0 num-features=4 path=/home1/MachineTranslation/working/phrase-table.1.folder input-factor=0 output-factor=0Note the colored changes that we have made. Path will depend on your PC.
Binarize Lexical Reordering Table :
/home1/MachineTranslation/mosesdecoder-master/bin/CreateOnDiskPt 1 1 6 100 2 /home1/MachineTranslation/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gz /home1/MachineTranslation/working/reordering-table.wbe-msd-bidirectional-fe.1.gzEnter the paths as per your PC. After the above command a folder with name reordering-table.wbe-msd-bidirectional-fe.1.gz will be created in the working directory which contains the data required for binarised lexical reordering table.
After binarising lexical reordering table you will need to make some changes in the moses.ini file present in working/train/model directory.
Open the moses.ini file in a editor and search for the line:
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home1/MachineTranslation/working/train/model/reordering-table.wbe-msd-bidirectional-fe.gzComment the line by inserting '#' mark before it. And then enter the following line after it:
LexicalReordering name=LexicalReordering0 num-features=6 type=wbe-msd-bidirectional-fe-allff input-factor=0 output-factor=0 path=/home1/MachineTranslation/working/reordering-table.wbe-msd-bidirectional-fe.1.gzNote the colored changes that we have made. Path will depend on your PC.
Testing :
At this stage we are ready for testing. We can test moses now for translating from English to Hindi:
Run the following command from MachineTranslation directory to start moses:
mosesdecoder-master/bin/moses -f working/train/model/moses.iniType 'hello' and it should be translated to 'नमस्ते'
Now you can translate anything from English to Hindi using moses
Basic syntax for running moses is:
<Location of moses binary file> -f <Location of moses.ini>In case you wish to translate after taking input from a file (You may need this for translating multiple lines) :
mosesdecoder-master/bin/moses -f working/train/model/moses.ini < inputEn.txt > outputHi.txt
python ../indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py training/corpora.hi corpora.norm.hi [True]
ReplyDeleteFile "/home/rajpmanjupc/mosesdecoder/corpus/indic_nlp_library-master/src/indicnlp/normalize/indic_normalize.py", line 83
text=re.sub(ur'([\u0900-\u097f]):',ur'\\1\u0903',text)
^
SyntaxError: invalid syntax
text=re.sub(ur'([\u0900-\u097f]):',u'\\1\u0903',text)
This comment has been removed by the author.
ReplyDeleteCould you explain how you choose the parameters for the CreateOnDiskPt command?
ReplyDeletethank you very much for this great tutorial
Could you please help why Hindi tokenizer not working...its giving same file again
ReplyDeleteThanks for giving a clear description on how to create the English-Hindi Translation model using Moses.
ReplyDeleteCan you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.
Thanks in advance.
Can you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.
ReplyDeleteSir can you please include evaluation of the model also? I want to find the BLEU score of the generated model. Can you please help for this.
ReplyDeleteI essentially discovered your site page and expected that would state that I have truly respected the experience of looking on the web diary sections. christian psychotherapist
ReplyDelete