Giter Site home page Giter Site logo

Comments (23)

MCOfficer avatar MCOfficer commented on May 18, 2024 2

It feels wrong to use a database as data exchange format, but i can't seem to find any arguments against it. weird.

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024 1

See https://github.com/KoalaBear84/OpenDirectoryDownloader/releases/tag/v1.9.2.6 for intermediate TXT saving.

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

One things still is good and that's the parent / child structure which you can't really nicely do in any line based thing like txt/csv etc.

Yeah, we are aware of that. I thought about either explicitly listing the 'path' for each file (maybe using ids) or adding special lines that indicate the 'path' of each following file (url). Could be pseudo-nested, or explicit.

Thought about this issue before and wanted to completely rewrite it all to use SQLite, than RAM also doesn't matter while still scanning. The SQLite part is too much effort for NOW, but will try hopefully just around new year, no promises.

I'm not familiar with SQLite, isn't it a database? If it's just a file format that is easy to parse, that would be nice, but I believe a database would make ODD more complicated and harder to use/wrap.

The Reddit part is already logged in History.log.

I'll check it out tomorrow. Does it only contain the reddit markdown stuff? The reason I'm asking is because parsing stdout to find and extract that table is more of a workaround than a solution ^^

The TXT part I added (ugly) and will be released in some minutes :)

You're awesome! <3

Already using CSV for other tools so I could also easily add that. Will do when I have some time.

Don't rush this. CSV was simply one format that came to mind for having easily-parsable files with items that contain meta-info. There might be better formats than this.
I'm not opposed to something completely custom, if that's more efficient :)

from opendirectorydownloader.

MCOfficer avatar MCOfficer commented on May 18, 2024 1

Here's an idea making use of JSONLines. It's not pretty, but I don't believe one can actually represented nested structures in a streaming matter:

{ "type": "root", "url": "https://OD/", "subdirectories": ["public"] }
{ "type": "file", "url": "https://OD/test.txt", "name": "test.txt", "size": 351 }
{ "type": "directory", "url": "https://OD/public", "name": "public", "subdirectories": [] }

No matter how you put it, it will be pretty hard to rebuild a nested structure from this dataformat, but that's what json is for.

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

I just put a 'proof of concept' together here:

Edit naughty-sinoussi-m4kjt

I'm using a jsonlines-based format, where the first line contains the general info about the scan, the second line contains the directory structure and the following lines contain meta-info about the directories and files.
What's important is that a respective parent directory has to come before any of its child directories and/or files.

The tool can take the regular JSON output and parse it into the new file format, for testing purposes. Only works with small files (<50kB), due to the limitations discussed above.

It can also take the new file format and parse it into the old format, proving that the new format perfectly preserves all info and the OD structure, without many of the previous drawbacks.
If applied correctly, the parser could be optimized in a way that allows to process indefinitely-large files (depending on use-case).

Would love to hear your thoughts on this @KoalaBear84 @MCOfficer :D

Edit: The file format is just an example. We could use that one, but if there are even better ways to do it, I'm all for it!

from opendirectorydownloader.

MCOfficer avatar MCOfficer commented on May 18, 2024 1

The file format would work - i lack the experience to design something better, tbh.

One thing i found counterintuitive is that each file has an ID, which is actually the ID of its parent directory. Should either be named accordingly, or be moved to a per-file basis.

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

One thing i found counterintuitive is that each file has an ID, which is actually the ID of its parent directory. Should either be named accordingly, or be moved to a per-file basis.

Yeah, those keys could (should) be renamed. I'll think about a better naming scheme tonight!

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

Okay, I've renamed the parameter to directoryId for type==file and kept it as id for type==dir. This should be clear enough now.

I also fixed a bug that caused the directory names to get lost in translation, now the only difference between the original file and the reconstructed file are the IDs and some slight reordering.

From where I'm standing, the only thing left to do is implementing this in C#/.NET.
I'm just not sure how the JSON output is constructed right now and if it can be easily changed to the new format...

If @KoalaBear84 could point me in the general direction, I'd be willing to contribute a PR to offer the new format alongside the currently available ones.


On a different note:
Once the new format is implemented and tested, we could still keep the normal JSON available, because that's just easier to work with in most cases.
And even though putting out huge JSON files obviously isn't a problem for ODD, but we could add a warning/confirmation "dialog" if the user tried to save a very large scan in JSON-format and offer them to save it in the new format instead.
And then for applications like our automated scanners, we could specify a flag to that it's always put out in the new format and we don't need to worry about two formats :)

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

Ah okay, I took a quick look at it but didn't think it through xD

Makes sense 👌🏻

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024 1

No, this isn't any promise, as I see it's even more work than I thought. I have to rewrite a lot of code, Handle all the parent/subdirectory things with database keys/ids.

But I've made a start. What I expected was right, it does get slower, this is because for every URL it wants to process it checks if it's already in it.

Besides that, it needs to insert every directory and url on disk, which also takes time.

For now it looks like an open directory with 950 directories and 15.228 files goes from 9 seconds to 16 seconds scanning/processing time. But... That is with still all queueing in memory, but all of that has to be rewritten to use the SQLite too.

So.. Started it as a test, But 95% yet to be done, and this already took 4 hours to check 😮

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024 1

Yes, HashSet is also the same in C#, have used it before, funny is that I didn't use this in the current implementation. I also want to make a 'resume' function, that you can pause the current scan, because you need to restart the server/computer, and continue the previous scan. HashSet is probably a good choice for this problem.

Indeed, I have used "Data Source=:memory:" as well, was my first test, that inserted 10.000 urls in 450ms. Then changed to using disk, which takes 120 SECONDS for 10.000 urls, but, that was before performance optimizations 😇

I think that it will be fast enough. Especially when the OD will become very big, and we have no more memory issues. Also writing the URLs file will not depend on memory anymore and will be a lot faster when the OD has a lot of URLs. We can just query the database, and all will stream from database to file.

Refactored some more now, rewrote all to native SQLite thing. Hopefully more news 'soon'.

Ahh, looks like the 5 SQLite library dlls needed are only 260 kB, I expected MB's 😃

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024 1

SQLite DBs can be in-memory. You can just write to your in-memory database, and then persist it to the disk once the scan is complete.

I believe our goal was to reduce memory usage, yes? 😉

Maybe a combination of HashSet and SQLite really is the way to go, combining speed with efficiency...

But I guess @KoalaBear84 knows best 😇

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024 1

Also linking this issue to #20 😜

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024

Hey @Chaphasilor and @MCOfficer

Yes. I of course ran into issues like this. Biggest JSON is 6.23GB which still works on my local machine with 48GB RAM, but yes, it's bad 😂

JSON has some positive things, but the size part and the RAM part isn't any of those. One things still is good and that's the parent / child structure which you can't really nicely do in any line based thing like txt/csv etc.

Thought about this issue before and wanted to completely rewrite it all to use SQLite, than RAM also doesn't matter while still scanning. When I see the queue turns to 100.000 I mostly stop scanning 🤣 The SQLite part is too much effort for NOW, but will try hopefully just around new year, no promises.

The Reddit part is already logged in History.log. Error is not in a seperate log, but maybe could already be done by adding your own nlog.config, but not sure about that, because I changed that lately for the single file releases.

The TXT part I added (ugly) and will be released in some minutes :)

Already using CSV for other tools so I could also easily add that. Will do when I have some time.

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024

No matter how you put it, it will be pretty hard to rebuild a nested structure from this dataformat, but that's what json is for.

I believe IDs might help us out:

Just rebuild only the structure of the OD with JSON, as compact as possible. Include just the dir names along with a random ID. Without all the files and meta-info. Put it as the first line.
It might be long, but even with thousands of subdirectories it should still be parsable.

And then below that, for each ID, add the necessary meta info.
If the type is dir, the ID refers to the actual directory.
If the type is file, the ID refers to the parent directory.

If I'm not missing something obvious, this should make it possible to rebuild the nested structure?

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024

True. I'll take a look at it another time. Too much going on right now with the homeschooling part as an extra added bonus 😂

Also want to rewrite to a SQLite version. Then it doesn't matter at all how big a directory is. Now all directory structure is build up in memory exactly like it is in the JSON. But it's not particularly great for processing, as we all have experienced. Who would have imagined OD's with 10K directories and 10M files 🤣

And because SQLite has an implementation is nearly every language, it is portable.

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024

And because SQLite has an implementation is nearly every language, it is portable.

Does that mean SQLite can dump it's internal DB so we can import it somewhere else? Or how does SQLite help with exporting and importing scan results?

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024

SQLite is just a database, you can use a client for every programming language and read it, and import it wherever you want.

from opendirectorydownloader.

MCOfficer avatar MCOfficer commented on May 18, 2024

But I've made a start. What I expected was right, it does get slower, this is because for every URL it wants to process it checks if it's already in it.

I assume you are using one sqlite db for every scanned OD. in that case, you could maintain a HashSet (or whatever C#'s equivalent is) of all visited URLs, which is significantly faster to check against.

Besides that, it needs to insert every directory and url on disk, which also takes time.

This idea may be more complicated and i lack the experience to judge it, but:
SQLite DBs can be in-memory. You can just write to your in-memory database, and then persist it to the disk once the scan is complete.

This may also make the HashSet unnecessary, as in-memory DBs are typically blazingly fast. I'm not sure how much faster they are when reading though, because reading can be cached quite efficiently.

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024

Hmm. For the performance optimization I use a "write-ahead log", this works great, but 'pauses' every 500 or 1000 records/inserts.

I was thinking, I might want to have some sort of queue for inserting the files, and process directories on the fly and do the files in a separate thread, this way we maybe can have both.

Also a note for myself 😃

from opendirectorydownloader.

Chaphasilor avatar Chaphasilor commented on May 18, 2024

Coming back to this issues, did I understand it correctly that when using disk-based SQLite, the memory usage would be "near"-zero, no matter how large the OD?
This would indeed be a very compelling argument, but should probably be optional unless it's really needed. Using up a few MBs or even GBs for scanning twice as fast might be useful in some cases...

from opendirectorydownloader.

maaaaz avatar maaaaz commented on May 18, 2024

I also want to make a 'resume' function, that you can pause the current scan, because you need to restart the server/computer, and continue the previous scan.

Yes, a resume feature would be awesome !

from opendirectorydownloader.

KoalaBear84 avatar KoalaBear84 commented on May 18, 2024

Well.. I sort of gave up on resume. Currently the whole processing of urls is depending on all data being in memory because it looks 'up' 4 levels of directories to see if we are in a recursive / endless loop..

It's very hard to rewrite everything, which costs a lot of time which I don't have / want to spend. 🙃

from opendirectorydownloader.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.