Giter Site home page Giter Site logo

Long clone times about manopt HOT 17 CLOSED

jbriales avatar jbriales commented on August 19, 2024
Long clone times

from manopt.

Comments (17)

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello Jesus,

Thanks for your input on this, this is a very good observation. I'm fairly illiterate when it comes to GIT, so I appreciate any other input of this kind you may have.

I read up on submodules and github pages this morning. Here's my understanding:

  1. It would be easy to move /web and /releases to separate, public repositories while preserving the git history, using filter-branch. That would address your suggestion directly.

  2. As you pointed out, it is possible to include /web and /releases as submodules of the main manopt repository, which would have the advantage of making the main repo look exactly the same as it did before (this means I would not need to change the release scripts, some webpage links etc.), and my understanding is that when someone would clone manopt, that would not, by default, clone the contents of those submodules. But you advised against that approach -- can you tell me more about why?

  3. Another issue I've been pondering (without action) for a while is: the website manopt.org is hosted with OVH, and that host is blocked in China, which is a problem for many users. One possibility could be to move the website to GitHub Pages. My understanding is that the webpages of a github repo need to be in that repo to be available on that repo's website (seems reasonable). It seems having the website as a submodule is fine for that purpose. So, I'd be tempted to go down this path, but I'm interested in hearing what you think about (2) above first.

Thanks!
Nicolas

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello again Jesus,

I went ahead and did the following:

  1. I filter-branch'd /releases into a new repo, called manopt-releases; then I removed /releases from the main manopt repo. I did not add it back as a submodule, because we don't really need it for anything: it's just for safe-keeping. I also changed the release script accordingly.

  2. I created a new repo called m2html (that's the third-party package I use to generate Matlab documentation), moved the m2html code there (it was formerly in manopt/reference/m2html), and finally added m2html as a submodule of manopt. This means that now, by default, when you clone manopt it won't pull the m2html code (you'll just get an empty folder unless you explicitly pull from there). And that's just as well, because manopt users should never need that code anyway.

Having removed /releases already reduces the size of the repo quite a bit. Admittedly, /web is the bigger one: I didn't make any changes to that one yet because ramifications run deeper Hoping you'll have some insight to share on that, re my previous message.

Best,
Nicolas

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello Jesus,

I now filter-branch'd /web out of the manopt repo and into a new repo called manopt-web.

For now, I also activated GitHub Pages on manopt-web to see how it works. If it works well, I will try to redirect manopt.org to the corresponding url, i.e., https://nicolasboumal.github.io/manopt-web/ -- still need to figure that out.

I deleted /web from the main manopt repo, and it seems that there is no need to add it as a submodule. The reason I was considering that is because it would have given a nicer url, namely, https://nicolasboumal.github.io/manopt/ (not manopt-web), but in retrospect I think I now understand that the proper way would have been to create a repo called manopt.github.io instead of manopt-web. If I can manage to connect manopt.org to it anyway, it doesn't matter.

One observation though: it seems that when I clone the manopt repo, it's still large even though I removed /web and /releases. I suspect that's because the history of those folders is still in there, and that gets cloned as well. Is there any way to deal with that?

Best,
Nicolas

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

So, apparently, that thing about naming the repo manopt.github.io might be incorrect: that might be only for personal / organization webpages, as opposed to project webpages which is the case here.

It seems that the proper way might have been to simply create a new branch called gh-pages, and to commit the website only to that branch rather than to the master branch. If clones indeed only copy the master branch, then that would avoid copying the website documents. Not sure if that's accurate. In any case, having things in two different repos is not a big deal.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

I redirected the domain name manopt.org to the GitHub Pages repo manopt-web. Hopefully, this now works everywhere (could only test from my own computer, and through a number of VPNs), in particular I hope the website is now accessible in China.

If anything seems off, could you kindly let me know? Thanks for prompting these changes!

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

(I seem to be getting some issues with HTTPS, in that when I connect to the website a first time I'm told it's not secure, but then subsequent connections are fine; not too sure if that was a fluke or if something is misconfigured.)

from manopt.

jbriales avatar jbriales commented on August 19, 2024

Hi Nicolas! Sorry for the delay in replying...
I see you already addressed many of the steps involved.

Regarding why keeping /web and /releases independent and not as submodules, it was mainly for the sake of keeping the repository as minimal as possible (that is, reduced to the actual functionality for manopt users and almost nothing else 😄)

Unfortunately I have zero experience on using GitHub Pages. It is in my long term to-do list to learn about it, but I never got the opportunity/urgency to do it. I already see from your comments that you figured out way more details than I knew about (like having a gh-pages branch, which I have seen elsewhere). There must be plenty of examples, I checked e.g. the OpenGV library.
As you say, having everything web-related in gh-pages would keep the default master clone light.
I see the web folder is actually still in the main repository, so if you want to move web from there to a gh-pages branch, that should be doable.

As for the HTTPS issue, it also appears as Not secure for me. Otherwise I can access it. But I don't know much about developing/maintaining websites :(

Finally, about the clone size, you suspected right. With the changes you did you cleaned the current structure (which is already good, ensuring that it keeps scalable e.g. adding new release backups in the separate repository), but the previous git history still holds all those files. When you clone master from the repository, the .git folder contains all the information gathered in master through history.
In order to actually decrease the current repository size, you need to REWRITE HISTORY, which comes with some caveats. Yet, again, it's doable. In this sense, I'd recommend:

  • Have a look at this doc, which gives a full perspective on the approach, involved tools, caveats and workarounds.
  • For removing a complete directory from history, check this stackoverflow answer (I have used it before myself).
    The key step is doing:
    git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch <directory_to_remove> -r' --prune-empty --tag-name-filter cat -- --all
    This basically creates a list of all files removed in the directory and forces their deletion across history with filter-branch.
    And later git push origin master --force, where --force will be necessary to crush the Github repository content with the new content and history in your local repository.
    In any case, be careful when doing anything with '--force'. Always make sure you understand (moderately at least!) what's going to happen, and create a local backup copy of the entire repository before doing any hard change, just in case.

Let me know if you have any further questions or I can help anyhow! 👍

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello Jesus,

Thanks for your response.

I see the web folder is actually still in the main repository

That was an oversight on my part, I forgot to push changes on my computer. It's resolved now: /web is no longer on the master branch of manopt; all web files are on manopt-web (with history preserved.)

As for the HTTPS issue, it also appears as Not secure for me. Otherwise I can access it.

Did you mean that your browser asked you to explicitly allow it to go to manopt.org without encryption, or did you mean that you get to the website directly, only your browser indicates with a logo in the url field that the connection is not secure?

In order to actually decrease the current repository size, you need to REWRITE HISTORY, which comes with some caveats.

I suspected as much. Thanks for the instructions -- I will consider it, but, as you said, I should be careful :).

As a side note, I contacted a few manopt users from China, and they all confirmed they can now access the website :).

(Just saw now that the domain name started redirecting to the former hosting service (which is apparent because there I moved all files in a "backup" folder), which is very odd... perhaps I need to wait 24h for the DNS update to propagate.)

Thanks!
Nicolas

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello again,

Just writing some things down here, because this took a few hours to figure out.

The issue I had: http://manopt.org and http://www.manopt.org were working fine; also, https://manopt.org was working fine, but https://www.manopt.org was not: typing in that URL, the browser would show a "security alert" page. This was an issue because the site is best known as www.manopt.org, and the top result for manopt in Google is https://www.manopt.org.

Here is what seems to do the trick:

DNS side

  1. I added four A records for manopt.org to the specified IP addresses here at "Configuring A records with your DNS provider".

  2. I added one CNAME record for www.manopt.org pointing to nicolasboumal.github.io (there is supposed to be a dot appended to that, but I believe my provider does that automatically because I was not allowed to add it myself). Formerly, I had it redirect to manopt.org; I don't know which one it should be, but for now things appear to work. I may have to change that though, because not sure everything synched. I was very confused about this one, because the particular website is at nicolasboumal.github.io/manopt-web, but apparently that is irrelevant for the CNAME record.

GitHub side

In the options of the manopt-web repo, I did this:

  1. Selected the source as the gh-pages branch

  2. Set the custom domain as www.manopt.org -- I think this was the real culprit. I had originally set it as manopt.org, under the impression that this was mandatory (I only ever saw examples of that form), but resolution was coincidental with that change.

  3. Clicked "Enforce HTTPS", though not sure if that was part of the solution.

-- hopefully, this does the trick.

Best,
Nicolas

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Hello Jesus,

I'm finally getting back to this. I let it sit for a while because this seemed like risky business and not particularly urgent, but let's give it a try. I'm keeping track of everything here for future reference.

I checked out my local master branch and made sure there were no dangling, uncommitted changes etc. My local branch is up to date with the remote branch.

I ran this to check the size of the repo:
git count-objects -vH
The result was: about 40Mb in the pack, and 40Mb of garbage.

Following your recommendations, I ran these commands to erase folders web/ and releases/ recursively:

git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch web -r' --prune-empty --tag-name-filter cat -- --all
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch releases -r' --prune-empty --tag-name-filter cat -- --all

Then I ran the garbage collector:
git gc --aggressive

This generated the following message:
"Unlink of file '.git/objects/pack/pack-###some code###.pack' failed. Should I try again? (y/n)"
Answering "y" kept failing. Closing Matlab resolved this issue.

Then I re-assessed the size of the repo:
git count-objects -vH
The good news is that the garbage was gone, but the pack was still pretty much occupying 40Mb, not much less.

To figure out what is taking space, I first ran this command, taken from a conversation here:
git verify-pack -v .git/objects/pack/pack-###some other code###.idx | sort -k 3 -n | tail -10

Then, I ran this command a few times using the sha1's that came out of the previous command:
git rev-list --objects --all | grep [first few chars of the sha1 from previous output]

It turned out that a long long time ago there was a repository called publicrelations/ that contained a video, so I removed that too:
git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch publicrelations -r' --prune-empty --tag-name-filter cat -- --all

Then this again:
git gc --aggressive
git count-objects -vH

But the pack still occupies ~40Mb...

I thought maybe I need to force push for things to somehow "sink in":
git push origin master --force

But it doesn't have an effect on the pack size.

Playing some more with the commands to see what takes up space in the pack, I found that there were also some .mat files in a long gone "tests/test_RSVRG/data/pca" folder that were in the few hundreds on Kb, but that's not much of a concern. I ran the filter-branch command on that folder anyway just for good measure, but I think the real issue is that the videos still occupy space in the pack even though they were written out of history (or at least, so I think).

I'll try to finish this later. Hopefully, this didn't break anything. Maybe insofar as cloning size is concerned it's already solved; this I don't know.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

I think the actual size of the repo may have been reduced as planned, even though the pack size on my local disk is still ~40Mb (in fact, it appears to go up by some amount after each filter-branch).

As it stands, SourceTree gives me a weird looking history where each commit appears twice:

image

Maybe I need to delete my local copy and clone from remote to reset this. I'll try this next.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Deleting my local files and cloning anew from remote did not solve the issue: each commit still appears twice, and the pack size is about 41Mb (a bit larger than before the all the changes described here). On the bright side, it doesn't seem like I broke anything, but hard to say for sure.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

Ok, the double history appears because there is a remote branch whose history hasn't been rewritten. Presumably, that's also the reason memory usage didn't go down. Somehow, that branch kept all the tags, so I am moving them to the current master branch one by one. One small issue is that, historically, the last commit of a release was a website update: those commits no longer appear in the rewritten history, so I'm placing the "release" tags on the last commit that affected anything in the actual release package.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

There are substantial issues now with the branch bmsymfixedrankpolarfactory which corresponds to an old pull request that was never merged. I should have handled this before any history rewrites -- won't happen twice.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

In order to salvage the commits from Bamdev's PR #12 , I followed the instructions here:

http://blog.asquareb.com/blog/2014/06/19/making-a-git-pull-request-for-specific-commits/

Here is what I effectively did:

  • Check out master branch.
  • Create new branch (and check it out): bmsymfixedrankpolarfactory_new
  • Add specific commits to it that correspond to the changes that happened in the former bmsymfixedrankpolarfactory branch; This should work because those changes affected a very limited number of files that do not exist in any other branch:

git cherry-pick 0cb5565c56cd040ab45908719866c3995d2e7d51^..2d4dc6aec4a6da0d4ca07680dd3e51eae5f8d47a
git cherry-pick d439a395a3bc7a2bf894419c6265f3dbd67b0a78
git cherry-pick ceb33881742b97a1b6a6d285e179a5c6488efbb5
git cherry-pick 80a83e9d6a6a9fb8b601eb8978063308397ffacf -m 1
git cherry-pick 3e0186981a84e7923d4606143067b209a0505572^..2c2a47878c2bc8e1fe458d4ebe16df0772a1574d

(The ^ before .. indicates that the first commit should be included too.)

(The "-m 1" option is because that commit is a merge, so we have to specify who's the mainline, that is, specify the parent; which parent is which can be seen in SourceTree by selecting the merge commit. That particular command said that this commit is empty, which seems fine.)

(There was some trial-and-error involved, in particular because there is a merge into that branch we try to save.. anyway: "git cherry-pick abort" is very helpful to stop and try something else.)

PR #24 is the new home for those changes that were salvaged.

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

The old branch from PR #12 is now deleted (after moving all relevant contents to PR #24).

This does indeed clean up the "double history" issue as pictured in the screenshot above.

However, even after git gc --aggressive --prune=now, the pack size is still about 40Mb (it went down slightly from the ~41Mb). I don't know what I can do about this now. Maybe this time it is an issue about my local copy that no longer exists on the remote..

from manopt.

NicolasBoumal avatar NicolasBoumal commented on August 19, 2024

This stack overflow discussion has a great command line to get a human-readable list of everything in a repo, sorted by size:

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| sort --numeric-sort --key=2 \
| cut -c 1-12,41- \
| $(command -v gnumfmt || echo numfmt) --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Run on the current version of Manopt, it suggests that the largest file is 93Kib, which strongly suggests that there is something wrong with my local pack file. And indeed, cloning the manopt repo from remote to a separate local folder, I get a small pack file now: the full folder is ~3Mb. I'm closing this issue now.

Side note: tools such as Git Extensions also allow to get that list with one of the tools: it's much slower, but allows one to delete a file from the repo too (so, I guess, rewriting history as needed).

from manopt.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.