Giter Site home page Giter Site logo

Comments (31)

lpla avatar lpla commented on May 26, 2024

We are still updating the whole documentation. In fact, we created this page a few hours ago: https://github.com/bitextor/bitextor/wiki/Sample-of-a-bitext

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

This seems ideal - I will give it a try.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

This is the output I get when running the sample:

close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

The tmx got chopped ...

<?xml version="1.0"?>                                                                                                                                                                                                                                                     
<tmx version="1.4">                                                                                                                                                                                                                                                       
 <header                                                                                                                                                                                                                                                                  
   adminlang="en"                                                                                                                                                                                                                                                         
   srclang="en"                                                                                                                                                                                                                                                           
   o-tmf="PlainText"                                                                                                                                                                                                                                                      
   creationtool="bitextor"                                                                                                                                                                                                                                                
   creationtoolversion="4.0"                                                                                                                                                                                                                                              
   datatype="PlainText"                                                                                                                                                                                                                                                   
   segtype="sentence"                                                                                                                                                                                                                                                     
   creationdate="20180621T104019"
   o-encoding="utf-8">
 </header>
 <body>
 </body>
</tmx>
 

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

I'm sure how to debug this - the output doesn't give me much of a clue.

from bitextor.

lpla avatar lpla commented on May 26, 2024

Did you try to use the option -L PATH and/or -I PATH? These commands would help showing all the intermediate/temporal logs and files.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

All the log entries are about image/pdf/bib files apart from this one:

Jun 21, 2018 10:50:28 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

It is downloading websites, but does not seem to align any:

-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextorlett2lettr.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextorlett2idx.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextoridx2ridx_lang2-lang1.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextoridx2ridx_lang1-lang2.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextoralignsegments.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextoraligndocuments.log
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:50 bitextorcleantextalign.log
-rwxr--r-- 1 bhaddow users    229 Jun 21 10:54 sample.sh
-rw-r--r-- 1 bhaddow users    842 Jun 21 10:54 bitextorett2lett.log
-rw-r--r-- 1 bhaddow users   2892 Jun 21 10:54 bitextorcrawl.log
-rw-r--r-- 1 bhaddow users  86387 Jun 21 10:54 crawl
-rw-r--r-- 1 bhaddow users    946 Jun 21 10:54 bitextorcrawl2ett.log
-rw-r--r-- 1 bhaddow users  48329 Jun 21 10:54 crawl2ett
-rw-r--r-- 1 bhaddow users  43989 Jun 21 10:54 ett2lett
-rw-r--r-- 1 bhaddow users  45196 Jun 21 10:54 lett2lettr
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:54 distancefilter-lang2-lang1
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:54 distancefilter-lang1-lang2
-rw-r--r-- 1 bhaddow users     26 Jun 21 10:54 bitextordistancefilter_lang1-lang2.log
-rw-r--r-- 1 bhaddow users     26 Jun 21 10:54 bitextordistancefilter_lang2-lang1.log
-rw-r--r-- 1 bhaddow users   7752 Jun 21 10:54 lett2idx
-rw-r--r-- 1 bhaddow users     37 Jun 21 10:54 idx2ridx-lang2-lang1
-rw-r--r-- 1 bhaddow users     37 Jun 21 10:54 idx2ridx-lang1-lang2
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:54 aligndocuments
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:54 alignsegments
-rw-r--r-- 1 bhaddow users      0 Jun 21 10:54 distancefilter

from bitextor.

lpla avatar lpla commented on May 26, 2024

Looks that there is something wrong with the distancefilter files as they are empty. I have exactly the same files with the same size that you have except distancefilter-lang?-lang?, distancefilter, aligndocuments and alignsegments.

Maybe it is a problem with the vocabulary/dictionary file en-ca.dic? I updated the url of the Wiki page, now pointing to the RAW file instead of the Github page (bad for wget).

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

I checked out bitextor-data to get the dictionary. It looks OK:

buri]bhaddow: head /home/bhaddow/code/bitextor/data-github/dics/en-ca.dic
en      ca
3GPP    3GPP
A/H1N1  A/H1N1
A/H5N1  A/H5N1
ADP     ADP
AFP     AFP
ANSI    ANSI
AT&T    AT&T
ATP     ATP
API     API

from bitextor.

lpla avatar lpla commented on May 26, 2024

Wow, so this is really weird. I am unable to reproduce this problem and I have no clue of which error is happening underneath, as the stderr errors are anything but descriptive.

Could you just upload a zip with all the content you listed with ls -l several comments back? I will make a diff to try to guess differences between your run and mine.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

Sure
http://www.statmt.org/bhaddow/for-leo.tgz

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

This message that appears in the output:

close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr

suggests that a python program somewhere is trying to report an error message, but failing due to a broken pipe.

from bitextor.

lpla avatar lpla commented on May 26, 2024

Sorry for the delay of this reply. There was a holiday weekend in Alicante since last Friday.

I found a strange behaviour doing some diffs with the tgz you posted above. The files lett2idx have different content, which is not a good symptom. I noticed that some words with accents are split. Like "traducció" has two entries in your file: "traducci" and "ó". That must show a problem with NLTK wordpunct_tokenize, but it is weird that no other specific error is shown apart from the sys.stderr ones.

I did not find any other relevant difference or any clue that allowed me to reproduce the problem in any available machines we have (Ubuntu 14.04, 16.04 and 18.04 clean installed).

Could I have more information or access to a virtualenv, machine image or the machine itself?

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

There's possibly a difference in my locale which is affecting tokenisation.

I'm running on ubuntu 16.04. I'm starting to have a look at it now. I will try to set it up on Azure to see if I have the same problem, and so that I can share the machine.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

I'm finding this really hard to debug. Below is a listing of my working and temporary directories, Where does the working directory start to differ from yours? Can you post your working directory somewhere?
Are any of the empty files suspicious (distancefilter, idx?)

[buri]bhaddow: ls -l working/
total 244
-rw-r--r-- 1 bhaddow users 86387 Jun 26 16:58 crawl
-rw-r--r-- 1 bhaddow users 48329 Jun 26 16:58 crawl2ett
-rw-r--r-- 1 bhaddow users     0 Jun 26 16:58 distancefilter-lang1-lang2
-rw-r--r-- 1 bhaddow users     0 Jun 26 16:58 distancefilter-lang2-lang1
-rw-r--r-- 1 bhaddow users 43989 Jun 26 16:58 ett2lett
-rw-r--r-- 1 bhaddow users    37 Jun 26 16:58 idx2ridx-lang1-lang2
-rw-r--r-- 1 bhaddow users    37 Jun 26 16:58 idx2ridx-lang2-lang1
-rw-r--r-- 1 bhaddow users  7752 Jun 26 16:58 lett2idx
-rw-r--r-- 1 bhaddow users 45196 Jun 26 16:58 lett2lettr
[buri]bhaddow: ls -l /tmp/BUILDDICTTMP.C3KWIj/
total 648
prw-r--r-- 1 bhaddow users      0 Jun 26 16:58 crawl.ZTFdpH
-rw------- 1 bhaddow users      0 Jun 26 16:58 ett.4NJG6Z
-rw------- 1 bhaddow users 602791 Jun 26 16:58 hunalign_dic.4Mgyio
-rw------- 1 bhaddow users   7752 Jun 26 16:58 idx.gKVozP
-rw------- 1 bhaddow users      0 Jun 26 16:58 idx.XBBqBw
prw-r--r-- 1 bhaddow users      0 Jun 26 16:58 index_pipe.7dIqs3
prw-r--r-- 1 bhaddow users      0 Jun 26 16:58 index_pipe.nsG9qc
prw-r--r-- 1 bhaddow users      0 Jun 26 16:58 lett.0ZG9yL
-rw------- 1 bhaddow users  45196 Jun 26 16:58 lettr.HCfzut
prw-r--r-- 1 bhaddow users      0 Jun 26 16:58 output_pipe.Vg9KYV
-rw------- 1 bhaddow users      0 Jun 26 16:58 ridx.eNy2M4
-rw------- 1 bhaddow users      0 Jun 26 16:58 ridx.kPRanZ

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

Running on azure appears to work. I get a tmx file with content, and I do not have the strange tokenisation issues that you notices. I have some content in the distancefilter files (see below).

Do do you think the lett2idx problem is the cause?

-rw-rw-r-- 1 bhaddow bhaddow  4404 Jun 26 16:41 aligndocuments
-rw-rw-r-- 1 bhaddow bhaddow  4405 Jun 26 16:41 alignsegments
-rw-rw-r-- 1 bhaddow bhaddow 86387 Jun 26 16:41 crawl
-rw-rw-r-- 1 bhaddow bhaddow 48329 Jun 26 16:41 crawl2ett
-rw-rw-r-- 1 bhaddow bhaddow    16 Jun 26 16:41 distancefilter
-rw-rw-r-- 1 bhaddow bhaddow    30 Jun 26 16:41 distancefilter-lang1-lang2
-rw-rw-r-- 1 bhaddow bhaddow    29 Jun 26 16:41 distancefilter-lang2-lang1
-rw-rw-r-- 1 bhaddow bhaddow 43989 Jun 26 16:41 ett2lett
-rw-rw-r-- 1 bhaddow bhaddow    38 Jun 26 16:41 idx2ridx-lang1-lang2
-rw-rw-r-- 1 bhaddow bhaddow    37 Jun 26 16:41 idx2ridx-lang2-lang1
-rw-rw-r-- 1 bhaddow bhaddow  7629 Jun 26 16:41 lett2idx
-rw-rw-r-- 1 bhaddow bhaddow 45196 Jun 26 16:41 lett2lettr

from bitextor.

lpla avatar lpla commented on May 26, 2024

That last file listing output is just correct. I have exactly the same sizes in all files. So, there must be something weird with locale in your first machine that we never had, I guess.

If any error output is redirected to the stdout (by mistake, of course, and if we find it we will fix it ASAP) should be in temporal files idx.XXXXXX (idx.gKVozP in your case) or hunalign_dic.XXXXXX (in your case, hunalign_dic.4Mgyio).

Tomorrow I will try to reproduce your problem making changes to the locale variables.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

I have fixed the lett2idx problem by upgrading NLTK to v3.3 (I had v3.1 previously). Now it handles non-ascii characters correctly. See #7

However I still get no output. Comparing my the output on my server, to the output on azure, the only difference is in the distancefilter-* files. They are empty on my server. So that's the next thing to check out.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

I see the problem now. bitextor-rank is failing, because keras is not set up correctly (I think it is trying to use the gpu version of tensorflow on a non-gpu server). When I switch to a gpu server it works, and I get the expected corpus.

The error from bitextor-rank is the one that gets swallowed. I'm not sure how to get it into the logs, as that would have made debugging much simpler. It's called in the long piped expression in align_documents_and_segments(). Removing the final tee gets me an error message on the console, but I don't know if that's the right thing to do.

from bitextor.

lpla avatar lpla commented on May 26, 2024

Now I reproduce partially the issue. You are right about tensorflow. I uninstalled tensorflow package from pip and installed tensorflow-gpu instead in the server without dedicated GPU. The Python error pops up in console but I also get the Tensorflow errors in bitextordistancefilter_lang1-lang2.log. I don't know why your file had the usual "Using Tensorflow" log line only.

I will try to reproduce better the problem on Azure but I still don't have access to it.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

The fact that the error gets swallowed on my server could be down to timing. Python tries to print the exception, but the pipe is closed too quickly, hence the messages on the console about the closed sys.stderr.

Would it be sensible for bitextor to exit if bitextor-rank fails? Just giving an empty tmx is confusing since it just makes me think there is no parallel data. The long pipe in align_documents_and_segments() is very hard to debug.

from bitextor.

lpla avatar lpla commented on May 26, 2024

I just did a commit with small changes in stderr management. Could you try to reproduce the initial problem with the latest version of 'master'? If the tensorflow GPU error is shown in console and bitextor-rank is named there, then we are done.

These small changes should make that all errors are printed in console if you don't pass -L option (as the sample of the wiki does).

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

Still the same lack of error messages, with or without -L. In fact even the "tensorflow" message has been removed now.

Note what happens when I try to use keras:

[buri]bhaddow: python -c 'import keras'
Using TensorFlow backend.
Illegal instruction (core dumped)

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

If I remove the tee from the long pipe, and add the following after the wait

cp $DISTANCEFILTER12OUT  $RINDEX1
cp $DISTANCEFILTER21OUT  $RINDEX2

then I get a better error message. Well it at least tells me that something went badly wrong, although it doesn't tell me which process in the pipe fails.

It also seems to work when I switch back to the GPU host.

I don't understand the need for the tee. You want the same output in $DISTANCEFILTER12OUT and $RINDEX1, right? So just send stdout to one, and then copy?

from bitextor.

lpla avatar lpla commented on May 26, 2024

OK, first of all, your keras error is different than mine. So, first I will try to reproduce this one, but without Azure access I don't know if it would be easy.

The tensorflow backend message has been removed intentionally in latest commit.

What I don't understand and is freaking me out is that tee is hidding you error messages.

I know that $DISTANCEFILTER??OUT and $RINDEX? have the same value, but $DISTANCEFILTER12OUT only has a path if you used -I option. Otherwise is empty. So your cp will not work if you don't use -I. Personally, I am not a huge fan of this practice, but this is how bitextor was designed years ago.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

Does this work?

I replaced the call to bitextor-rank with

(/home/bhaddow/code/bitextor/install/bin/bitextor-rank $DOCSIMTHRESHOLD -m $MODEL -w $WEIGHTS && echo "bitextor-rank failed")

Without the -L option I get the output below on the console, with -L the output ends up in the log file

[buri]bhaddow: ./sample.sh 
/home/bhaddow/code/bitextor/install/bin/bitextor: line 356: 30429 Illegal instruction     (core dumped) /home/bhaddow/code/bitextor/install/bin/bitextor-rank $DOCSIMTHRESHOLD -m $MODEL -w $WEIGHTS
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
close failed in file object destructor:
sys.excepthook is missing
lost sys.stderr
<?xml version="1.0"?>
<tmx version="1.4">
 <header
   adminlang="en"
   srclang="en"
   o-tmf="PlainText"
   creationtool="bitextor"
   creationtoolversion="4.0"
   datatype="PlainText"
   segtype="sentence"
   creationdate="20180627T124916"
   o-encoding="utf-8">
 </header>
 <body>
 </body>
</tmx>

from bitextor.

lpla avatar lpla commented on May 26, 2024

I just got the exact error that you had as the origin of this issue (the Illegal instruction one). Now I know why is happening and why we never saw it. It is a problem with Tensorflow 1.6 and certain AMD CPUs. It is not related with GPUs. I tried to replicate your error in Azure (now that I have access) and I wasn't able to get the error and the missing messages about it. I always got problems with Tensorflow but the error was correct and the messages were explicative enough to debug.

Now, in the only AMD CPU server we have in Alacant, I got the error after downgrading Tensorflow from 1.8 to 1.6, as it seems to be the only affected version. I bet that you also have this version installed in your server.

So, tomorrow I will test the last trick you posted and figure out a fix for this.

Definitely, you found a tangled bug.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

That's good that you managed to reproduce the problem.

So mainly the problem was that my environment was messed up. I ran the "make install" of bitextor on a different server, so didn't notice the keras/tensorflow issue.

I would say that bug in bitextor is that it hid the error message, making it hard to see what was happening. So it would be good if we could find a way to avoid swallowing error messages.

Another enhancement request would be to have some progress indication in bitextor. pv can be useful for this, or just more logging.

from bitextor.

lpla avatar lpla commented on May 26, 2024

I think I just fixed this problem and related future ones just adding some unnoficial bash strict mode variables. I didn't see any regression using this mode in my tests and bitextor.sh now reports errors properly on pipelines. Could you try these last commits?

Btw, just for historical purposes, the tensorflow error is explained here tensorflow/tensorflow#17411

About the progress bar, it is a really nice idea that we could design and include in our next major release, but we need to think a good and uniform way to print that global or partial progress as the way we manage the data flow in bitextor changes depending on the chosen parameters/modules, because right now some of those modules are not "pipeables" by their design.

from bitextor.

bhaddow avatar bhaddow commented on May 26, 2024

Great, thanks, that's much more informative now!

Agreed that there is significant work in adding progress indicators.

from bitextor.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.