Giter Site home page Giter Site logo

Multithreading about borg HOT 93 OPEN

borgbackup avatar borgbackup commented on May 22, 2024 47
Multithreading

from borg.

Comments (93)

lassepe avatar lassepe commented on May 22, 2024 28

@abebeos @boris22x. You are flooding the inbox of 21+ people that are subscribed to this issue. Please: stay on topic and watch your language. Both of you are being unnecessarily rude and aggressive in this thread.

Your behavior does not contribute to this issue being fixed any earlier; quite the opposite: with your behavior you make this thread unpleasant to read and even more unpleasant to work on. Open-source developers are humans and most people don't feel encouraged to work on something because of a random person's rant. Please, don't be that open-source user.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 17

Releasing 1.2 is the next goal, 1 milestone ticket still open.
After that there will be fixing of whatever was not discovered in beta/rc.
Then, some crypto improvements and then multithreading, see milestones.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 13

No.

from borg.

seeker avatar seeker commented on May 22, 2024 12

To any repository owners / moderators: Perhaps hide off topic comments on this issue (including my own responses) to keep it clean?

from borg.

bluet avatar bluet commented on May 22, 2024 8

No updates for a while here, hope everything's going well.
Now people have more CPU cores but still stuck on one core.

(just added some more to the bounty)

from borg.

deathtrip avatar deathtrip commented on May 22, 2024 8

hashbackup looks actually very nice (especially the -p0 option to... switch off multithreading). Why do you insist to enhance borg-backup? What does it have that hasbackup hasn't?

Hashbackup is proprietary software, with a bunch of marketing speak and no documentation on the inner workings of it on their website.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 7

If you are here just because of the bounty, I guess you better search for more attractive bounties elsewhere.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 7

multithreading (and maybe other parallelism) is next milestone after borg2.

from borg.

boris22x avatar boris22x commented on May 22, 2024 6

I have multithreaded (hopefully) borg-compatible backups working in my implementation
source: #37 (comment)

@l29ah, impressive. I assume that it would take less time to validate your implementation than to tweak borg-backup towards multithreading.

@boris22x , hashbackup looks actually very nice (especially the -p0 option to... switch off multithreading). Why do you insist to enhance borg-backup? What does it have that hasbackup hasn't?

After the comment from Ronny I do not plan to spend any single $ supporting borgbackup. I wanted to do it because of nostalgy, but if someone is just blind and telling people to f*ck off when they tell them that the software does not meet 2023 requirements it is not worth a single cent.

from borg.

FelixSchwarz avatar FelixSchwarz commented on May 22, 2024 6

Nah, I was just here to work on some bounties, but borg-backup is a sinking ship.

Terribly amateurish processes and contributors.

Me following @boris22x, and out of here.

Thank you @abebeos for making your point. Now I think we can go back to actually using this issue to track progress/share information about borgbackup's multithreading implementation.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024 5

There is quite a nice bounty on this! I'd like to cash in on it. anyone thats been active for awhile can summarize what needs to be done? I have some quarantine time on my hands i would like to monetize on :P

from borg.

duven87 avatar duven87 commented on May 22, 2024 5

What is the status of the issue?

from borg.

RonnyPfannschmidt avatar RonnyPfannschmidt commented on May 22, 2024 5

It's rare to see such malevolent and miss informed comments

from borg.

boris22x avatar boris22x commented on May 22, 2024 5

I stand by my comment and it's abundantly clear that spending any more time on the perpetrators is waste of it.

It is demotivating to support you in any way if you ignore basic functionality requests and you think in 2023 there is no need for multicore implementation. But as mentioned above I already moved on due to this limitation.

from borg.

boris22x avatar boris22x commented on May 22, 2024 5

hashbackup looks actually very nice (especially the -p0 option to... switch off multithreading). Why do you insist to enhance borg-backup? What does it have that hasbackup hasn't?

Hashbackup is proprietary software, with a bunch of marketing speak and no documentation on the inner workings of it on their website.

While borgbackup is stuck with the functionality in the year 2000, hashbackup is trying to solve the problems users have. If I cannot use borgbackup to backup my data at home or in the company I will look for another working solution. And I wonder why you keep ignoring your users, but it is your choice guys. Good luck with that approach and continue to use your single core CPUs and floppy drives... ;) Unsubscribing from the discussion.

from borg.

seeker avatar seeker commented on May 22, 2024 5

@FelixSchwarz Might be an idea to flag/report abebeos comments. Only contributors, collaborators and owners seem to be able to do this: Reporting a comment

from borg.

alexander-rieder avatar alexander-rieder commented on May 22, 2024 4

Any news on this?

from borg.

wzyboy avatar wzyboy commented on May 22, 2024 3

Regarding large (in size and/or in number of files) backups. Here is my use case.

In the past, I only used BorgBackup for not-very-busy servers. But recently I decided to try to back up a CI server. With lots of node_modules/ (Node.js) and vendor/ (Go) directories, the server has tons of small files.

The initial backup (~42 GiB) took only 42 minutes. That's about 1 GiB per minute. Considering the large number of files and network transfer time (I'm backing up from a data center near San Jose, CA to AWS us-east-1 N. Virginia region over SSH), it's quite impressive.

Here is the most recent backup:

Time (start): Tue, 2018-03-20 01:46:24
Time (end):   Tue, 2018-03-20 02:06:34
Duration: 20 minutes 9.91 seconds
Number of files: 6134998
Utilization of max. archive size: 2%

Even with single-core, the speed is acceptable. However, the fact that only one of the many cores of the CI server reaches 100% usage during the backup makes think "what a waste of CPU cores" :-)

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 3

We're short after borg 1.2 "hydrogen" release, so current focus is on:

  • fixes for 1.2.x
  • work on next milestone "helium" (mostly crypto and some more cleanups, likely borg 1.3 when released)

That work will take quite a while, so planning on next milestone "lithium" that comes after that also has quite some time still.

I intentionally split the multithreading stuff off into a separate milestone because that likely will cause quite a lot of changes and also potential bugs and profits from some crypto improvements done before it. Not having too much stuff in one milestone also makes more frequent releases possible (see how long 1.2.0 took after 1.1.0).

from borg.

boris22x avatar boris22x commented on May 22, 2024 3

We made a very simple decision. Instead of trying to deal with low Borg backup performance related to single-core implementation, we switched to another solution that is capable to utilize all CPU cores. I took the same decision for my personal backups. I have to agree with @abebeos, having in 2023 a single-core implementation is a project-management issue where the project management clearly does not understand user requirements in 2023. If the Borg backup project management thinks that a single-core implementation is sufficient in 2023, then Borg backup could be used mostly by home users to backup a few GBs of data but not by power users requiring backing up several TBs of data and definitely not by enterprise requiring backing up several hundreds of TBs of data on a daily basis. @RonnyPfannschmidt I have no idea who defines priorities for Borg backup, but I would see instead of i.e. recompress functionality as a priority multi-core implementation. This is the way how I feel as a user that left Borg backup due to the missing multi-core implementation. Good luck!

from borg.

infectormp avatar infectormp commented on May 22, 2024 3

@boris22x CPU with 128 cores is expensive. Have you supported borg in any way, apart from criticism?

from borg.

boris22x avatar boris22x commented on May 22, 2024 3

@boris22x CPU with 128 cores is expensive. Have you supported borg in any way, apart from criticism?

So if a user will tell you that you have a problem you will tell them to F*CK OFF as they have not paid you? Do you realize that on your main project site, there is no support button that would allow anyone to send you i.e. money via Patreon or any similar platform? Do you realize that if I go to "Pay services" (https://www.borgbackup.org/support/commercial.html) I still do not see a way to support the project? I am using Hetzner, I was using borgbase.com (until I stopped because of the missing multi-core implementation), whether you get anything from them I have no idea. I understand that you feel criticized, but not everyone is a developer that could work on the project on the DEV site. And if you guys do not have a Support us by paying money on your main website then why do you complain about missing support?

from borg.

l29ah avatar l29ah commented on May 22, 2024 3

I have multithreaded (hopefully) borg-compatible backups working in my implementation
source: #37 (comment)

@l29ah, impressive. I assume that it would take less time to validate your implementation than to tweak borg-backup towards multithreading.

Try it. I think so as well, but i'm not a Python developer.

from borg.

enkore avatar enkore commented on May 22, 2024 2

Just an idea

Traditional multi-threading (shared data, locking, queuing etc.) is often neither simple to develop nor test. I have been using ZeroMQ to effectively multi-thread things in a few situations now, and feel that it makes things much easier with essentially no relevant overhead (for inproc://), especially when using zerocopy.

I understand that in borg files are essentially processed by a pipeline of algorithms:

reading => chunking => hashing / encrypting => storage

This might lend itself well to the pipeline pattern in zmq, e.g. every one of these stages is handled by one or more nodes (=threads). Work distribution, queuing etc. are handled by zmq transparently.

This zmq approach to multithreading is not unlike microservices. It would probably mean extensive refactoring for Borg. OTOH it would lead to Borg scaling very well with available CPU resources.

from borg.

knutov avatar knutov commented on May 22, 2024 2

What is the current state of this issue?

With borg create --compression auto,zstd,3 on modern hardware (ssd) backup process speed is limited by speed of one cpu core. And it can be much faster on multiple cores.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024 2

See the milestones. Next release (1.2) will not yet have it.

It's quite a lot of work to restructure borg for this, better funding would be desirable also.

from borg.

br-olf avatar br-olf commented on May 22, 2024 2

Do you need help on this issue?

from borg.

l29ah avatar l29ah commented on May 22, 2024 2

Ouch.
Now it seems like rewriting borg in haskell would be easier than implementing the multithreading. At least, it seems like a manageable task for $1500.

from borg.

l29ah avatar l29ah commented on May 22, 2024 2

I have multithreaded (hopefully) borg-compatible backups working in my implementation (https://github.com/l29ah/hyborg), but i got too lazy to write integration tests to verify that it backs up all the corner cases correctly just as borg itself. You can try it if you feel like.

from borg.

RonnyPfannschmidt avatar RonnyPfannschmidt commented on May 22, 2024 2

I stand by my comment and it's abundantly clear that spending any more time on the perpetrators is waste of it.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024 1

I'm ready to start work on it. I've sent Borg an email and waiting for a confirmation, but have yet to receive that.

from borg.

FelixSchwarz avatar FelixSchwarz commented on May 22, 2024 1

I'm not TW (obviously) but my impression was that this issue ("multithreading") is mostly an epic "umbrella" issue. There are a lot of nuances involved (and probably some larger changes to borg's internal architecture).

I'm pretty sure the current bounty (USD 1500) is very very low compared to the regular rates of a professional software developer who needs to spend time on this (otherwise @ThomasWaldmann and others would have solved this a long time ago).

From what I can see #929 is a better place to start. The first step would be to define realistic intermediate goals (there are several operations which could benefit from using multiple CPU cores) and agree on ways to measure the impact.

from borg.

enkore avatar enkore commented on May 22, 2024 1

Implementing multithreading isn't actually that much work, my prototype which mostly worked fine from a few years ago should still be in my repo – just a few hundred lines of diff or so. But actually making it go faster, especially when processing small chunks, is way more involved (pure Python isn't going to cut it, especially these days). IIRC it was about 20-30 % faster while using about 100 % more CPU when processing large files, and 20 % or so slower when processing small files. That would have been on a quad-core Xenon.

The design discussed 2016-2017 should be solid and scalable to 8+ cores, but many threads will need to run their consume/produce-loops without the GIL, i.e. only in native code. It's basically just a lot of plumbing and testing work.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024 1

I reacted to this one because from all the ones i checked out, this one seemed the fastest to accomplish. Just curious though, this bounty has been active for 5 years, is it never meant to be solved?

If you are here just because of the bounty, I guess you better search for more attractive bounties elsewhere.

from borg.

FabioPedretti avatar FabioPedretti commented on May 22, 2024 1

Interesting approach by restic to scale tasks based on whether an operation is CPU or IO-bound: restic/restic#3611

from borg.

boris22x avatar boris22x commented on May 22, 2024 1

I understand the possible practical problems with single-core implementation, it is not feasible to back up such huge repositories. But having a multicore system (28 or even 64 cores), a lot of RAM (we have a minimum of 2TB here) fast storage I do not really see the practical problems.
https://borgbackup.readthedocs.io/en/latest/faq.html#usage-limitations - nothing here
https://borgbackup.readthedocs.io/en/latest/internals/data-structures.html#indexes-caches-memory-usage - memory usage is explained here
https://borgbackup.readthedocs.io/en/1.0.11/internals.html - if I take the medium size of an item entry is 2kB (~100MB size files or more ACLs/xattrs), the limit will be ~32 million files/directories per archive resulting in maximum archive size 3PB+ which is for an enterprise backup sufficient. And yes, we can have thousands of small repositories because of software limitations or we use can use as before the other software.

from borg.

boris22x avatar boris22x commented on May 22, 2024 1

@boris22x and which solution are you using now?

To be constructive here - we switched to hashbackups that seems to be capable to utilize more than a single core.

from borg.

boris22x avatar boris22x commented on May 22, 2024 1

Do you realize that on your main project site, there is no support button that would allow anyone to send you i.e. money via Patreon or any similar platform?

https://github.com/borgbackup/borg#helping-donations-and-bounties-becoming-a-patron

https://www.borgbackup.org/ - This is your main project site you get by google search "borgbackup", github is the DEV site. If not, then sorry, again, there is bad communication towards the users.

Sorry, I found the button there...

from borg.

boris22x avatar boris22x commented on May 22, 2024 1

@ThomasWaldmann - what would be the $$$ amount you need to get the multi-core support done?

from borg.

enkore avatar enkore commented on May 22, 2024 1

summary where multithreading [or its plans] stand as of 2023?

Nowhere in particular (I think). Note that most discussions and plans, and also the prototypes, are for borg create only, not other operations like borg prune (which can also be parallelized to some extent, but has non-trivial time/space trade-offs and other issues, primarily prefetching).

from borg.

struthio avatar struthio commented on May 22, 2024 1

multithreading (and maybe other parallelism) is next milestone after borg2.

Any time estimate ?

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

About the question in the commit comment about longer "user time" when running multithreaded compare to single-threaded:

The answer might be that this was a dual core cpu with hyperthreading. While this looks like 4 cores to the OS (and Python), when really using it like 4 cores, each of these cores might appear as a bit slower CPU than when just using a single core, which might explain the longer "user time" needs.

It still was quite faster on the wall clock, so we don't worry.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

Some things are TODO there:

  • remote repo tests are broken (and thus currently disabled)
  • that delayer thread is strange, better solution?
  • likely AES counter uniqueness is broken
  • needs way more analysis / optimization
  • fix the tests, they are quite broken / hang right now

Also, the bounty should be higher, this is quite a lot of work.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

Interesting idea. Can you say what the pros of this approach are compared to Python's Queue module (which is in stdlib and threadsafe)? The Queue module is what the multithreading code uses currently.

from borg.

enkore avatar enkore commented on May 22, 2024

Ah I see how you approached it, it is conceptually already very close. I also see the "delayer problem" now.

On a technical level the zmq approach is language agnostic, so it's [it can be] straightforward to implement entire stages in C with no Python/GIL involvement at all (and the ability to still do this zerocopy: separate, pure C-thread runs a node in the same process, so inproc:// can be used between Python and C parts, without ever copying the data).

from borg.

stilsch avatar stilsch commented on May 22, 2024

Regarding compression:
We are using https://github.com/madler/pigz for parallel compressing of petabyte of archives. Maybe it could help here. We never had an issue - and the performance on multicore systems is awsome!

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

@schuft69 please see the ticket about compression about why this is not that simple in case of borg.

from borg.

enkore avatar enkore commented on May 22, 2024

Each chunk (0 byte ... 8 MB) is compressed independently, so parallel compressors tend to not work well there; instead we'll run separate stages (hashing, encryption, compression) in parallel and maybe multiple operations in parallel as well (e.g. compressing multiple chunks independently using independent compressors in parallel).

We're working on drafts regarding this topic; we will put more work forward here once 1.1 reaches freeze / RC status.

from borg.

sjuxax avatar sjuxax commented on May 22, 2024

The multithreaded branch hasn't been updated since March 2016, about 1.25 years, and is missing about half of the commits in the current HEAD. How much of that approach remains viable?

Lack of some MT support makes borgbackup basically unusable for me. Even basic support would make things much more bearable (for example, compressing multiple chunks simultaneously).

I'd like to make contributions to help, but I'm not sure where to start right now. What's the current status on MT draft/direction?

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

@sjuxax that was just an experiment and should not be used.

real, better working MT is planned for borg 1.2.

coding work on 1.2 has not started yet, we are still finishing 1.1. there is some planning, see milestones and tickets here.

from borg.

enkore avatar enkore commented on May 22, 2024

The multithreading branch was never intended to be merged; it was one of now several tests. The Borg 1.2 entry in the (project management) wiki might be of interest. Some further tests and planning has been conducted, but partly not published yet (some of that stuff is not in English and has to be translated / rewritten first).

The current plan (this is from March) looks basically like this:

borg-mt1 2

Grey boxes are individual actors / threads. Arrows indicate channels: orange are queue-like, green is RPC, violet is queue-like for metadata and blue is the same for errors.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

@sjuxax btw, why is it unusable without multithreading for you?

from borg.

sjuxax avatar sjuxax commented on May 22, 2024

I'm mainly interested in using it to back up large, frequently-changing targets. Even if the job would normally complete without wrapping around on itself (i.e., the prior job wouldn't finish before the next interval), it's not worth the extra cost and risk to keep the backup going in the background for much longer than it would take other backup targets to complete the job.

If I was backing up around 10G of stuff, it probably wouldn't be a big deal. I'm interested in backing up things that are much larger, almost always >= 100G with many small files and some very large files (5-10G+). I have not tested lately, but back in April when I last tested, I definitely felt it was too slow for regular use.

As an example, I'm backing up a 5T filesystem right now (from disks that like to choke every few hours). While I recognize other limitations in borgbackup may make a 5T backup implausible, it would be convenient to be able to use it for more common tasks in the 100G-500G range, but without multithreading, it's just too slow for me (especially the compression stage; I'm not encrypting within borgbackup).

I've tried several things over the last little bit, including SquashFS, fsarchiver, and others. Right now I am using ZFS on loopback with compression and deduplication enabled and rsyncing the tree over. That seems to be working the best, and it makes it easy to resume the process when the disks hang.

from borg.

enkore avatar enkore commented on May 22, 2024

5T works quite well; the largest backup set I know of that was "publicly admitted" to use Borg is >40T. What's clear of course is that Borg 1.0 and 1.1 have fundamental limitations on their processing speed... whether this is a nasty problem for a big backup set has more to do with it's workload and less with the sheer size of it. Fast-changing sets require the most processing, while other workloads are usually less problematic. Files that don't change are faster to process than files that do; bigger files are slower than smaller files (<=one chunk) etc.

But yeah, I'm aware of these limitations and I'll / we'll try to address them with 1.2.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

Well, you need to consider the amount of change between backups. If it is too slow to backup the changes, yeah, then you have a problem.

The total amount (as long as it is within the limitations) doesn't matter much, borg skips unchanged files rather quickly (be careful whether inode numbers are stable for your fs and always mount at same mountpoint).

Assuming you have enough backup space (after considering how much you safe due to historical dedup), you can use lz4 compression (you won't need multiple threads then to get compression fast).

Encryption is also fast, IF you have AES-NI.

It's not perfect yet (thus our plans for 1.2), but quite ok - not only for small backups.

from borg.

enkore avatar enkore commented on May 22, 2024

Here's my prototype from March which roughly corresponds to the current plan.

Easy refactorings like FSOProcessors and MetadataCollector seem prudent to me for a start.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

hmm, guess it makes more sense if you do a PR with that and I review it. And you could finish reviewing the crypto-aead stuff, so we can merge it first, so I don't need to rebase it again / fix conflicts again.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

btw, the branch has only few changesets with huge changes. i imagined a bit smaller steps / less risk approach.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

work on this (besides planning, experiments) has not started yet and likely some crypto stuff needs to be solved first (e.g. avoiding AES counter mode issues).

from borg.

br-olf avatar br-olf commented on May 22, 2024

OK sure.

Anyways here is my idea:
The chunker should fill multiple chunk queues, one for each thread we want to spawn.
Then we can process each queue with a separate thread.
This way we can parallelize the whole pipeline (hashing, cache lookup, compressing and maybe even encrypting).

This would require the cache to be thread save or read only and will probably result in RAM demands scaling with the number of threads.

from borg.

Justinzobel avatar Justinzobel commented on May 22, 2024

Added some more cash to the bounty. Hope we can get this feature soon!

from borg.

Justinzobel avatar Justinzobel commented on May 22, 2024

I say if Jean wants to give it a go and has nothing better to do then give it their best. :)

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

Yeah, but i am an actual engineer. Would deff solve for the bounty, even if it takes me a whole week. So i mailed asking for detailed info on completion conditions etc to solve it properly. I will start once i have a reply and a go ahead on this platform :)

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

I'm TW (obviously) but my impression was that this issue ("multithreading") is mostly an epic "umbrella" issue. There are a lot of nuances involved (and probably some larger changes to borg's internal architecture).

I'm pretty sure the current bounty (USD 1500) is very very low compared to the regular rates of a professional software developer who needs to spend time on this (otherwise @ThomasWaldmann and others would have solved this a long time ago).

From what I can see #929 is a better place to start. The first step would be to define realistic intermediate goals (there are several operations which could benefit from using multiple CPU cores) and agree on ways to measure the impact.

Yeah that was my first impression as well. Would be easier to have clear and concise goals to claim to the bounty. While i am applying for jobs and interviews i could work on it. Seems right up my alley to be honest.

from borg.

FelixSchwarz avatar FelixSchwarz commented on May 22, 2024

Would deff solve for the bounty, even if it takes me a whole week.

I won't claim much knowledge about borg's internals (I did contribute only tiny bits to borg) so maybe I'm wrong but I am pretty confident that a week is not enough. I guess with all the required architectural discussions, getting agreement on the preferred approach, coding, testing and benchmarking this is easily a month of work (at least). I'd happy to be proved wrong though :-)

That's why working on smaller steps towards "multithreading" would be more beneficial. IMHO the first approach is unlikely to fully implement the desired features but with multiple developers each contributing some pieces we might get a version of borgbackup which utilizes the power of multicore machines.

Anyway: Outsiders discussing this in this ticket is unlikely to get real progress. If @jean-phillipe88 wants to start working on this I propose you read the relevant tickets in this tracker and then come up with a high-level step-by-step plan of the changes (and probably some more in-depth description of the necessary changes of the first step).

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

Would deff solve for the bounty, even if it takes me a whole week.

I won't claim much knowledge about borg's internals (I did contribute only tiny bits to borg) so maybe I'm wrong but I am pretty confident that a week is not enough. I guess with all the required architectural discussions, getting agreement on the preferred approach, coding, testing and benchmarking this is easily a month of work (at least). I'd happy to be proved wrong though :-)

That's why working on smaller steps towards "multithreading" would be more beneficial. IMHO the first approach is unlikely to fully implement the desired features but with multiple developers each contributing some pieces we might get a version of borgbackup which utilizes the power of multicore machines.

Anyway: Outsiders discussing this in this ticket is unlikely to get real progress. If @jean-phillipe88 wants to start working on this I propose you read the relevant tickets in this tracker and then come up with a high-level step-by-step plan of the changes (and probably some more in-depth description of the necessary changes of the first step).

Yeah i can see why you think a month would be the minimum time duration using this approach. And also why the bounty is way to low to attract any serious interest. I would not really discuss any architectures or getting agreements, but just push to a fork for review to see if the results are worth it.

I'm used to being hired to upgrade someone else's code going in with zero knowledge beforehand. Never worked with large groups trying to contribute, and i see allot of problems getting even small changes would take forever using this approach.

But don't misunderstand me, i came here because of the bounty. If it is unlikely that i could earn it, i probably won't delve into it.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

I mean, this issue has been open for 5 years now. Is it ever meant to be marked as SOLVED?

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

that is what i thought indeed.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

the plan to work on multithreading is for helium milestone, after doing crypto work.

current master branch is still hydrogen and not released yet, but in alpha.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

Does that mean the bounty is currently inactive?

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

There is no "inactive" state for bounties. Once they are set, they are active.

But that does not necessarily mean they can be done at any time (without investing a lot of time on also solving the prerequisites).

Also, there are better and worse defined / scoped bounties and for new contributors I would rather suggest working on some smaller scope bounty / project.

There are lots of open tickets and this one is definitely one of the harder to solve ones.

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

That is why I sent you an email requesting a clear and concise definition of the prerequisites for fulfilling the bounty requirements. I am a pro developer, and i came across this because of the bounty, and i am fairly confident in my ability to solve for it.

Extra work on the way to it is possible, and if the work has bounties on the way, great; but honestly, i am a bounty hunter :)

from borg.

jean-phillipe88 avatar jean-phillipe88 commented on May 22, 2024

Did you get my email btw?

There is no "inactive" state for bounties. Once they are set, they are active.

But that does not necessarily mean they can be done at any time (without investing a lot of time on also solving the prerequisites).

Also, there are better and worse defined / scoped bounties and for new contributors I would rather suggest working on some smaller scope bounty / project.

There are lots of open tickets and this one is definitely one of the harder to solve ones.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

To all backers of the bounty on bountysource for this issue, you need to urgently get active or your backing may get lost to bountysource.

See #5230 for details (I posted a copy of my email there, which you can slightly modify and reuse for the email you need to write to them).

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

About my previous comment: no need to get active any more, this was resolved, see #5230.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

As a general comment:

This is no easy issue and thus not suited for new contributors who are unfamiliar with the codebase and the required changes that need to implemented before even starting to work on this issue.

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

@l29ah guess you're way too optimistic, but of course it depends on how quick you are in haskell and what you expect per hour.

if it is 0 per hour, you'll have infinite time. :-)

from borg.

l29ah avatar l29ah commented on May 22, 2024

The documentation of internals of borgbackup is just too good, i have to give writing a parallel borg-compatible backup tool a try :)

from borg.

enkore avatar enkore commented on May 22, 2024

Do one better - learn from the issues documented in those docs and design something that avoids these issues ;)

from borg.

boris22x avatar boris22x commented on May 22, 2024

5T works quite well; the largest backup set I know of that was "publicly admitted" to use Borg is >40T. What's clear of course is that Borg 1.0 and 1.1 have fundamental limitations on their processing speed... whether this is a nasty problem for a big backup set has more to do with it's workload and less with the sheer size of it. Fast-changing sets require the most processing, while other workloads are usually less problematic. Files that don't change are faster to process than files that do; bigger files are slower than smaller files (<=one chunk) etc.

But yeah, I'm aware of these limitations and I'll / we'll try to address them with 1.2.

I would love to use borg for all my backups, but I have ~300TB of data to be backed up and with the current single-core implementation it is just not feasible (weeks? months? no idea).

from borg.

flxai avatar flxai commented on May 22, 2024

@l29ah @ThomasWaldmann Thanks. Is it possible to build on this to create a PR?

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

@flxai I like Python and am not going to switch borgbackup to Haskell.

@boris22x I guess you don't want to put 300TB into a single repo. And if you have multiple/many repos, you can run multiple/many borg in parallel. Initial backup will still be an effort, but future backups will be quicker.

from borg.

boris22x avatar boris22x commented on May 22, 2024

Why not? The 300TB of data is highly compressible so IMHO it makes sense to put it into one repository. If there are some technical reasons it would be good to state somewhere in the documentation that borgbackup supports backups up to xTB of data. The other solution I am using to back up the 300TB does the job well. The current single core limitation limits us to use it on a bigger scale (~5PB of data).

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

In borg < 2.0, there is a documented archive size limitation (which is quite high, but check it yourself).

What I meant are rather practical problems, like e.g. borg check taking rather long for huge repos, borg client and server needing a lot of memory, repo locking, etc. - multiple smaller repos are just easier to deal with.

from borg.

infectormp avatar infectormp commented on May 22, 2024

@abebeos I suspect you think this is easy. So, you can fork the project, roll it back 8 years and start adding multithreading yourself. Also, please rollback all libraries and python itself to an 8-year-old state.

from borg.

infectormp avatar infectormp commented on May 22, 2024

@boris22x and which solution are you using now?

from borg.

boris22x avatar boris22x commented on May 22, 2024

@abebeos I suspect you think this is easy. So, you can fork the project, roll it back 8 years and start adding multithreading yourself. Also, please rollback all libraries and python itself to an 8-year-old state.

I do not think anyone thinks that this is easy, but I can say for myself I feel that this has had no priority for 8 years. While we have CPUs with 128 cores, Borg backup can utilize only 1 core. If you do not see this as a priority, OK.

from borg.

infectormp avatar infectormp commented on May 22, 2024

@boris22x and what solution are you using now that utilises all CPU cores?

from borg.

infectormp avatar infectormp commented on May 22, 2024

Do you realize that on your main project site, there is no support button that would allow anyone to send you i.e. money via Patreon or any similar platform?

https://github.com/borgbackup/borg#helping-donations-and-bounties-becoming-a-patron

from borg.

boris22x avatar boris22x commented on May 22, 2024

Do you realize that on your main project site, there is no support button that would allow anyone to send you i.e. money via Patreon or any similar platform?

https://github.com/borgbackup/borg#helping-donations-and-bounties-becoming-a-patron

https://www.borgbackup.org/ - This is your main project site you get by google search "borgbackup", github is the DEV site. If not, then sorry, again, there is bad communication towards the users.

from borg.

grinapo avatar grinapo commented on May 22, 2024

A few sidenotes (with preissued apology to all the readers of this issue).

  • Issues (and possibly part of the issues) can be converted (or directed to) discussions. It may be much better way to actually discuss the topics related to, but not directly affecting, this issue.
  • Hiding discussion comments may be helping onlookers (like myself) to see how the actual issue is progressing.
  • Could someone summary where multithreading [or its plans] stand as of 2023? I am right now looking at a prune, running on 100% cpu for more than 3 hours now, with about 5% disk utilisation. (Granted, info says that the repo contains about 5PB of data, which is both amusing and scaring, details are in #7766)

from borg.

ThomasWaldmann avatar ThomasWaldmann commented on May 22, 2024

borg2 might take quite a while if some storage related changes get implemented before release (see "breaking" label). but these changes likely would be also useful for more parallelism.

from borg.

debuglevel avatar debuglevel commented on May 22, 2024

just a side note for the few guys using borg in Windows:
some kind of parallelism might also greatly improve borg on Windows. I did not do any comparisons and benchmarks (so please be gentle :D), but NTFS seems to be painfully slow on single-threaded tools.

I'm also using Microsoft Robocopy (which seems to be a rather weird rsync clone) to synchronize around a million files once an hour - and it is actually quite fast (38 minutes). Turns out, it spins up 8 threads by default ;-).

from borg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.