moses-smt / mgiza Goto Github PK
View Code? Open in Web Editor NEWA word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
A word alignment tool based on famous GIZA++, extended to support multi-threading, resume training and incremental training.
We're going to close this page soon. Please use the Moses mailing list instead [email protected]. Don't forget to subscribe to the mailing list before you post:
http://mailman.mit.edu/mailman/listinfo/moses-support
You can still create pull request if you have bug fixes you want to share
Hi!
I have been using mgiza
and I have noticed that the generated files does not contain the same information among different executions, not even the same number of lines. This happens when -ncpus
!= 1. I have tested using the same files and changing -ncpus
to 1, 2 and 8. Only when -ncpus 1
is provided, the two executions had exactly the same output files.
Command:
ncpus="1" # deterministic
#ncpus="2" # non-deterministic
#ncpus="8" # non-deterministic
for iteration in $(echo "1 2"); do
mgiza -ncpus $ncpus -CoocurrenceFile corpus.fr-en.cooc -c corpus.fr-en-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -mh 5 -m5 0 -model1dumpfrequency 1 -o test${iteration}.ncpus${ncpus}.corpus.fr-en -s corpus.en.vcb -t corpus.fr.vcb -emprobforempty 0.0 -probsmooth 1e-7
done
for f1 in $(ls test1.ncpus${ncpus}.corpus.fr-en*); do
f2=$(echo "$f1" | sed 's/^test1/test2/')
c=$(comm -3 <(cat "$f1" | sort) <(cat "$f2" | sort) | wc -l)
if [[ "$c" != "0" ]]; then
echo "Not equal: $f1 - $f2"
fi
done
The files has been generated using Bitextor 8.2. The files has been generated using data from this WARC. You may find the necessary files in order to reproduce the results attached in this issue (for corpus.fr-en.cooc.1.zip and corpus.fr-en.cooc.2.zip you will need to decompress and execute cat corpus.fr-en.cooc.1 corpus.fr-en.cooc.2 > corpus.fr-en.cooc
).
input_mgiza.zip
corpus.fr-en.cooc.2.zip
corpus.fr-en.cooc.1.zip
The issue is in Get_File_Spec() function, in mgizapp/src/file_spec.h:
calling _user = getenv("USER");_ when USER variable is not set causes a segmentation fault.
I worked it around by calling export USER='www-data' (I was invoking the training chain from within a PHP script), but is this really necessary just to write a log file?
Hi, when I follow the steps cmake .
and make
It runs with the error
[ 51%] Linking CXX executable ../bin/d4norm
ld: library not found for -lrt
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [bin/d4norm] Error 1
make[1]: *** [src/CMakeFiles/d4norm.dir/all] Error 2
make: *** [all] Error 2
I wonder how to solve this problem, thanks!
Hello, I tried to build probabilistic dictionaries (I need it for training Becleaner model), but as a result I get something like:
afterwards NULL 0.0000124
pension NULL 0.0000372
truss NULL 0.0000124
birthday NULL 0.0000744
commemorate NULL 0.0000248
Entire second column is "NULL"
The command I used is:
mosesdecoder/scripts/training/train-model.perl --alignment grow-diag-final-and --root-dir bicleaner_inf/ --corpus bicleaner_inf/corpus.clean --e en --f zh --mgiza -mgiza-cpus 8 --parallel --first-step 1 --last-step 4 --external-bin-dir mgiza/mgizapp/bin/
It looks like major error occurs in mgiza:
Merging A3.final.part* tables
Executing: enchmodels/mgiza/mgizapp/bin/merge_alignment.py enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final.part*> enchmodels/bicleaner_inf/giza.zh-en/zh-en.A3.final
Traceback (most recent call last):
File "enchmodels/mgiza/mgizapp/bin/merge_alignment.py", line 32, in
st1 = files[i].readline();
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 84: ordinal not in range(128)
Exit code: 1
And after it gives the whole chunk of errors like:
Use of uninitialized value $a in scalar chomp at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 105
Use of uninitialized value in substitution (s///) at enchmodels/mosesdecoder/scripts/training/LexicalTranslationModel.pm line 40.
I can't find any official documentation or usage of mgiza.
Please add in readme some usage/tutorial with example.
I am trying to produce word alignment for individual sentences. For this purpose I am using the "force align" functionality of mgiza++ Unfortunately when I am loading a big N table (fertility), mgiza crashes with a segmentation fault.
In particular, I have initially run mgiza on the full training parallel corpus using the default settings of the Moses script:
/project/qtleap/software/moses-2.1.1/bin/training-tools/mgiza -CoocurrenceFile /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de.cooc -c /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en-de-int-train.snt -m1 5 -m2 0 -m3 3 -m4 3 -model1dumpfrequency 1 -model4smoothfactor 0.4 -ncpus 24 -nodumps 0 -nsmooth 4 -o /local/tmp/elav01/selection-mechanism/systems/de-en/training/giza.1/en-de -onlyaldumps 0 -p0 0.999 -s /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/training/prepared.1/en.vcb
Afterwards, by executing the mgiza force-align script, I run the following command
/project/qtleap/software/moses-2.1.1/mgizapp-code/mgizapp//bin/mgiza giza.en-de/en-de.gizacfg -c /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en-de.snt -o /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de -s /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./de.vcb -t /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/prepared./en.vcb -m1 0 -m2 0 -mh 0 -coocurrence /local/tmp/elav01/selection-mechanism/systems/de-en/falign/qtmp_SOVBrE/giza./en-de.cooc -restart 11 -previoust giza.en-de/en-de.t3.final -previousa giza.en-de/en-de.a3.final -previousd giza.en-de/en-de.d3.final -previousn giza.en-de/en-de.n3.final -previousd4 giza.en-de/en-de.d4.final -previousd42 giza.en-de/en-de.D4.final -m3 0 -m4 1
This runs fine, until I get the following error:
We are going to load previous N model from giza.en-de/en-de.n3.final
Reading fertility table from giza.en-de/en-de.n3.final
Segmentation fault (core dumped)
The n-table that is failing has about 300k entries. For this reason, I thought I should try to see if the size is a problem. So I concatenated the table to 60k entries. And it works! But the alignments are not good.
I am struggling to fix this, so any help would be appreciated. I am running a freshly installed mgiza, on Ubuntu 12.04
Is there any readme or document for run a incremental training? I couldn't find any related materials. Pls help
Hi,
why do i get permission denied when i try to run mgiza in train-model.perl moses. what i am doing wrong.Please help
I used 64 CPUs to run mgiza, and my training steps are five IBM model 1, five HMM, three model 3, three model 4. But when I run the first HMM training, the No.9 thread always failed. It returns Error code 139, and no file outputs. What happened? Could you tell me what's the meaning of code 139?
Thank you.
As of this writing it is http://www.kyloo.net/software/ instead of http://geek.kyloo.net/software/
Sorry if this isn't the correct place to ask, but I noticed that this repository is still active and I didn't see any other information about support.
I asked a question on stackoverflow regarding my issues with installing mgiza.
Does anyone have experience with this kind of problem? In particular, is there something I should change in CMakeLists.txt to make installation work?
I have train a model on a big corpus, and then I want to obtain align result on some new data.
Like ./scripts/force-align-moses.sh
, the .vcb .cooc
files are new generated and the .classes
use existed files, then use mgiza
to obtain results.
However,there are nearly a half of sentences are lost in en2cn.A3.final.part000-047
, hence I can't use ./scripts/merge_result.py
to merge results.
Where could be my problem???
When I try to cmake the mgiza, it appears
Could you please tell me what maybe the reason?
export BOOST_ROOT=/home/user1/moses/boost
cmake .
-- The C compiler identification is GNU 4.9.3
-- The CXX compiler identification is GNU 4.9.3
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- You have not set the install dir, default to './inst', if
you want to set it, use cmake -DCMAKE_INSTALL_PREFIX to do so
-- Performing Test TR1_SHARED_PTR_USE_TR1_MEMORY
-- Performing Test TR1_SHARED_PTR_USE_TR1_MEMORY - Success
-- Performing Test TR1_UNORDERED_MAP_USE_TR1_UNORDERED_MAP
-- Performing Test TR1_UNORDERED_MAP_USE_TR1_UNORDERED_MAP - Success
Boost 1.41 found.
Found Boost components:
thread;system
Boost found
-- Boost_INCLUDE_DIR :
-- Found Threads: TRUE
-- Configuring done
-- Generating done
-- Build files have been written to: /home/user1/Word_Alignment/a/mgiza/mgizapp
make
Scanning dependencies of target snt2coocrmp
[ 1%] Building CXX object src/CMakeFiles/snt2coocrmp.dir/snt2cooc-reduce-mem-preprocess.cpp.o
[ 2%] Linking CXX executable ../bin/snt2coocrmp
[ 2%] Built target snt2coocrmp
Scanning dependencies of target snt2plain
[ 4%] Building CXX object src/CMakeFiles/snt2plain.dir/snt2plain.cpp.o
[ 5%] Linking CXX executable ../bin/snt2plain
[ 5%] Built target snt2plain
Scanning dependencies of target snt2cooc
[ 7%] Building CXX object src/CMakeFiles/snt2cooc.dir/snt2cooc.cpp.o
[ 8%] Linking CXX executable ../bin/snt2cooc
[ 8%] Built target snt2cooc
Scanning dependencies of target mgiza_lib
[ 10%] Building CXX object src/CMakeFiles/mgiza_lib.dir/alignment.cpp.o
[ 11%] Building CXX object src/CMakeFiles/mgiza_lib.dir/AlignTables.cpp.o
[ 13%] Building CXX object src/CMakeFiles/mgiza_lib.dir/ATables.cpp.o
[ 14%] Building C object src/CMakeFiles/mgiza_lib.dir/cmd.c.o
[ 16%] Building CXX object src/CMakeFiles/mgiza_lib.dir/collCounts.cpp.o
[ 17%] Building CXX object src/CMakeFiles/mgiza_lib.dir/Dictionary.cpp.o
[ 19%] Building CXX object src/CMakeFiles/mgiza_lib.dir/ForwardBackward.cpp.o
[ 20%] Building CXX object src/CMakeFiles/mgiza_lib.dir/getSentence.cpp.o
[ 22%] Building CXX object src/CMakeFiles/mgiza_lib.dir/hmm.cpp.o
[ 23%] Building CXX object src/CMakeFiles/mgiza_lib.dir/HMMTables.cpp.o
[ 25%] Building CXX object src/CMakeFiles/mgiza_lib.dir/logprob.cpp.o
[ 26%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model1.cpp.o
[ 27%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model2.cpp.o
[ 29%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model2to3.cpp.o
[ 30%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model345-peg.cpp.o
[ 32%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model3.cpp.o
/home/user1/Word_Alignment/a/mgiza/mgizapp/src/model3.cpp:735:0: warning: "TRAIN_ARGS" redefined
#define TRAIN_ARGS perp, trainViterbiPerp, sHandler1, true, alignfile.c_str(), true, modelName,is_final
^
/home/user1/Word_Alignment/a/mgiza/mgizapp/src/model3.cpp:481:0: note: this is the location of the previous definition
#define TRAIN_ARGS perp, trainViterbiPerp, sHandler1, dump_files, alignfile.c_str(), true, modelName,is_final
^
[ 33%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model3_viterbi.cpp.o
[ 35%] Building CXX object src/CMakeFiles/mgiza_lib.dir/model3_viterbi_with_tricks.cpp.o
[ 36%] Building CXX object src/CMakeFiles/mgiza_lib.dir/MoveSwapMatrix.cpp.o
[ 38%] Building CXX object src/CMakeFiles/mgiza_lib.dir/myassert.cpp.o
[ 39%] Building CXX object src/CMakeFiles/mgiza_lib.dir/NTables.cpp.o
[ 41%] Building CXX object src/CMakeFiles/mgiza_lib.dir/Parameter.cpp.o
[ 42%] Building CXX object src/CMakeFiles/mgiza_lib.dir/parse.cpp.o
[ 44%] Building CXX object src/CMakeFiles/mgiza_lib.dir/Perplexity.cpp.o
[ 45%] Building CXX object src/CMakeFiles/mgiza_lib.dir/reports.cpp.o
[ 47%] Building CXX object src/CMakeFiles/mgiza_lib.dir/SetArray.cpp.o
[ 48%] Building CXX object src/CMakeFiles/mgiza_lib.dir/transpair_model3.cpp.o
[ 50%] Building CXX object src/CMakeFiles/mgiza_lib.dir/transpair_model4.cpp.o
[ 51%] Building CXX object src/CMakeFiles/mgiza_lib.dir/transpair_model5.cpp.o
[ 52%] Building CXX object src/CMakeFiles/mgiza_lib.dir/TTables.cpp.o
[ 54%] Building CXX object src/CMakeFiles/mgiza_lib.dir/utility.cpp.o
[ 55%] Building CXX object src/CMakeFiles/mgiza_lib.dir/vocab.cpp.o
[ 57%] Linking CXX static library ../lib/libmgiza.a
[ 57%] Built target mgiza_lib
Scanning dependencies of target mgiza
[ 58%] Building CXX object src/CMakeFiles/mgiza.dir/main.cpp.o
[ 60%] Linking CXX executable ../bin/mgiza
[ 60%] Built target mgiza
Scanning dependencies of target hmmnorm
[ 61%] Building CXX object src/CMakeFiles/hmmnorm.dir/hmmnorm.cxx.o
[ 63%] Linking CXX executable ../bin/hmmnorm
[ 63%] Built target hmmnorm
Scanning dependencies of target plain2snt
[ 64%] Building CXX object src/CMakeFiles/plain2snt.dir/plain2snt.cpp.o
[ 66%] Linking CXX executable ../bin/plain2snt
[ 66%] Built target plain2snt
Scanning dependencies of target symal
[ 67%] Building CXX object src/CMakeFiles/symal.dir/symal.cpp.o
[ 69%] Building C object src/CMakeFiles/symal.dir/cmd.c.o
[ 70%] Linking CXX executable ../bin/symal
[ 70%] Built target symal
Scanning dependencies of target d4norm
[ 72%] Building CXX object src/CMakeFiles/d4norm.dir/d4norm.cxx.o
[ 73%] Linking CXX executable ../bin/d4norm
[ 73%] Built target d4norm
Scanning dependencies of target mkcls
[ 75%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/GDAOptimization.cpp.o
[ 76%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/general.cpp.o
[ 77%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/HCOptimization.cpp.o
[ 79%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/IterOptimization.cpp.o
[ 80%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/KategProblem.cpp.o
[ 82%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/KategProblemKBC.cpp.o
[ 83%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/KategProblemTest.cpp.o
[ 85%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/KategProblemWBC.cpp.o
[ 86%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/mkcls.cpp.o
[ 88%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/MYOptimization.cpp.o
[ 89%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/Optimization.cpp.o
[ 91%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/Problem.cpp.o
[ 92%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/ProblemTest.cpp.o
[ 94%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/RRTOptimization.cpp.o
[ 95%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/SAOptimization.cpp.o
[ 97%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/StatVar.cpp.o
[ 98%] Building CXX object src/mkcls/CMakeFiles/mkcls.dir/TAOptimization.cpp.o
[100%] Linking CXX executable ../../bin/mkcls
[100%] Built target mkcls
make install
[ 2%] Built target snt2coocrmp
[ 5%] Built target snt2plain
[ 8%] Built target snt2cooc
[ 57%] Built target mgiza_lib
[ 60%] Built target mgiza
[ 63%] Built target hmmnorm
[ 66%] Built target plain2snt
[ 70%] Built target symal
[ 73%] Built target d4norm
[100%] Built target mkcls
Install the project...
-- Install configuration: ""
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/lib/libmgiza.a
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./mgiza
-- Set runtime path of "inst/./mgiza" to ""
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./snt2cooc
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./snt2plain
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./plain2snt
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./symal
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./hmmnorm
-- Set runtime path of "inst/./hmmnorm" to ""
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./d4norm
-- Set runtime path of "inst/./d4norm" to ""
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./snt2coocrmp
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./mkcls
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./force-align-moses.sh
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./giza2bal.pl
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./merge_alignment.py
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./plain2snt-hasvcb.py
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./sntpostproc.py
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./force-align-moses-old.sh
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./run.sh
-- Installing: /home/user1/Word_Alignment/a/mgiza/mgizapp/inst/./snt2cooc.pl
Your web site is unreachable.
$ curl http://www.kyloo.net/software/
curl: (7) Failed to connect to www.kyloo.net port 80: Network is unreachable
I just installed the latest Cygwin to build MGIZA++ for the first time. Using small or big input it segfaults every time in the tmodel constructor called from: main.cpp:585 The CoocurrenceFile does exist.
Any ideas on what to do?
Is there a near current version Visual Studio version out there that builds? I'm pretty rusty with gdb.
Is there a similar tool implemented with C# somewhere?
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.