Giter Site home page Giter Site logo

Comments (11)

kosloot avatar kosloot commented on May 28, 2024 1

still puzzling...
Somehow its seems that programs get slower with increasing size of the OUTPUT files.
Checked this with 'mblem' too.
First thought: 'pv' get in the way. But also without pv is slows down.
Next guess: using cout and redirecting slows it down.
So I added an output parameter to mblem, to write directly to a file.
But to no avail... at least, nor really.
But for output files < 1.5 Gb it seems reasonable. So splitting seems the way to go.
Bit still....
This is nerve-wrecking....

from frog.

kosloot avatar kosloot commented on May 28, 2024 1

Hi,
thanks for the update.
to start of: I am surprised about the numbers files you mention. We are used to processing 1000 files per hour or so. So probably your files are very, very large?

The idea of thread starving might be interesting to look at, but:
drum-roll...
I am working on a re-implementation of Frog, aiming at speed, memory footprint and less locking.
So fixing the current Frog is not a high priority.
The work is going well, but slow. A first release is scheduled a month (or two...) form now.

Feel free to checkout the new_datastructure branch of frog and test it.
You will also need the foliareader branch for libfolia then.
NO WARRANTY

from frog.

kosloot avatar kosloot commented on May 28, 2024

Well, i am not sure what is happening here, but my first guess is that the process gets low on memory.
Do you have enough RAM to run this kind of large input?
frog will construct a large FoLiA document, covering the whole input. I wouldn't be surprised if it needs 50G of RAM or more.

This is the result of an unfortunate design decision long ago.
For the time being, it might be a solution to split the input file into smaller parts.

You also might experiment with the -n option (one sentence per line)
or the --threads option, setting it to 1

from frog.

gmjonker avatar gmjonker commented on May 28, 2024

Frog is currenly using 577MB of memory, see screenshot.
I will try the -n and --threads=1.
If it doesn't help, I will split indeed.
screen shot 2018-05-14 at 17 43 39

from frog.

kosloot avatar kosloot commented on May 28, 2024

Hmm, this is strange.
it might be that Frog is deadlocking on the threads.
is it producing any output?
if you have the rights, you could try 'strace -p 25773'
if the only thing it shows is a futex-wait then your are probably lost.

Side note:
Is this problem reproducible? Or just once in a while?

Is all else fails, is it possible to give me the offensive file?

from frog.

gmjonker avatar gmjonker commented on May 28, 2024

Frog is still producing output, but just very slowly.
The behaviour is always the same.
I should also mention that I'm running frog in proycon/lamachine
I'm now testing with --threads=1

from frog.

kosloot avatar kosloot commented on May 28, 2024

hmm. OK. getting out of clues then....
Is the 2.5 Gb available for me? for analyzing.

from frog.

gmjonker avatar gmjonker commented on May 28, 2024

--threads=1 doesn't help.

from frog.

asharkinasuit avatar asharkinasuit commented on May 28, 2024

An update perhaps: I've been running frog 0.15 (based on ucto 0.13.2, libfolia 1.13, timbl 6.4.12, ticcutils 0.19, mbt 3.3.2) for the past several days on the COW corpus. I knew there was no way that was all fitting in memory, so I split it into files of 10k sentences each and asked frog to run on them using frog --testdir=. -n --skip=cmnp --xmldir=. --nostdout (just to be complete; the point is probably the -n --threads stuff, but I can say I only need the lemmas and didn't want to slow things down with a parse.)
I let it run over the weekend and it looks like it processed about 80 files on Friday, 70 on Saturday and 40 on Sunday. The new clue to the problem I would contribute is that it started with 8 threads, but this morning I found it using only one thread (by CPU usage anyway). Memory is not a problem on the machine I'm using, but maybe the threads are dying or getting killed for some reason? Not sure this relates to the above problems, since there apparently threads don't help, but maybe this is still worth looking into.

PS Sorry I didn't check to see if threads died or just CPU use reduced; I can let it run this week and check when it slows down again.

from frog.

asharkinasuit avatar asharkinasuit commented on May 28, 2024

Yea the files are just under a megabyte each, 10k sentences a pop. I suppose we'll see how it goes when the new version comes out.

from frog.

kosloot avatar kosloot commented on May 28, 2024

Well, the new implementation is around for quite a while now. Don't know if it made any difference.
But at the moment work on Frog is on a hold. except for real bugs.
If work will ever be picked up is dependent on real demand, accompanied by some currency.
So closing this issue for now.

from frog.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.