Giter Site home page Giter Site logo

Comments (7)

agitter avatar agitter commented on July 17, 2024 1

I don't think the problem is the size of a single manuscript file. Rather, the branch also tracks archived versions of every historical version of the manuscript we've deployed before so we can maintain permalinks. Check out the contents of https://github.com/greenelab/covid19-review/tree/gh-pages/v We have about 600 past copies of the manuscript archived there.

Bringing in @dhimmel in case the permalink and archiving is a possible cause.

from covid19-review.

rando2 avatar rando2 commented on July 17, 2024

Oh no!!! If that's the case, I wonder if it's time to move to true book-like HTML formatting, with a TOC and "read next section" links at the bottom. It is so hard to load anyways... in terms of how to do that, I can play around with see how the suggestion from this post generates locally!

from covid19-review.

rando2 avatar rando2 commented on July 17, 2024

They really need a 😱 react for posts like this... that makes sense and would definitely explain the time outs!

from covid19-review.

dhimmel avatar dhimmel commented on July 17, 2024

Looking at the raw CI logs for this build:

2022-09-26T18:46:40.7358609Z Created deployment for c269dc06246f43081bfbfb4e8ae789a0f745d01b
2022-09-26T18:46:40.7360771Z {"page_url":"https://greenelab.github.io/covid19-review/","status_url":"https://api.github.com/repos/greenelab/covid19-review/pages/deployment/status/c269dc06246f43081bfbfb4e8ae789a0f745d01b","preview_url":""}
2022-09-26T18:46:40.7363072Z 
2022-09-26T18:46:46.0259664Z Current status: deployment_in_progress
...
2022-09-26T18:56:40.2791010Z Current status: deployment_in_progress
2022-09-26T18:56:45.5190618Z Current status: 
2022-09-26T18:56:45.5191563Z Timeout reached, aborting!
2022-09-26T18:56:45.5240173Z ##[error]Timeout reached, aborting!
2022-09-26T18:56:45.8961141Z Deployment cancelled with https://api.github.com/repos/greenelab/covid19-review/pages/deployment/cancel/c269dc06246f43081bfbfb4e8ae789a0f745d01b

So the deployment_in_progress step is likely limited to 10 minutes. What about creating a branch from gh-pages to preserve the existing versioned outputs. Then edit gh-pages to delete most of the versions?

from covid19-review.

agitter avatar agitter commented on July 17, 2024

https://github.com/orgs/community/discussions/35197 provides more details. The artifacts grew to 10 GB in size, which leads to the 10 min timeout @dhimmel detected. We can monitor the artifacts size from the actions pages such as https://github.com/greenelab/covid19-review/actions/runs/3129872205

Archiving the old versions of the manuscript files isn't too hard. We already have a Zenodo repository linked to releases of this GitHub repository, so we could create a release from the gh-pages branch and then delete most of the versions.

That would destroy our old permalinks, which is unfortunate. We could manually try to preserve the old versions that correspond to releases (e.g. the arXiv preprints), but would miss some. I don't see a general solution though if we are going to continue hosting the manuscript on GitHub pages.

Maybe we set the old permalinks to redirect to the Zenodo DOI? That would be better than a 404.

@dhimmel do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo, or does this review just push the Manubot workflow to the extreme?

from covid19-review.

dhimmel avatar dhimmel commented on July 17, 2024

That would destroy our old permalinks, which is unfortunate

Slightly unfortunate, but you could do something in between like just delete the images directory.

We could manually try to preserve the old versions that correspond to releases

Yeah, I don't think Manubot creates permalinks for git tags, but that would be a nice feature if it did.

do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo

possibly, I think it's a reason to recommend embedding images by link if you plan to have large images and many commits. The insights from the discussions#35197 might be valuable in USAGE. As well as what you end up deciding in terms of pruning things.

from covid19-review.

agitter avatar agitter commented on July 17, 2024

I'm documenting my process to prune the gh-pages branch here.

Checkout gh-pages locally, confirm I have the output from the last commit, and create a local copy for safekeeping.

$ git checkout origin/gh-pages
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ cp -R . ../gh-pages-archive-2022-10-22

Our repository also had Zenodo archiving enabled, so make a tag and release to archive the gh-pages contents before pruning.

$ git tag -a gh-pages-2022-10-22 -m "Archive gh-pages branch 2022-10-22"
$ git push origin gh-pages-2022-10-22

Zenodo created an archive of the release that is 8.2 GB compressed. (I also noticed at https://help.zenodo.org/ that Zenodo now supports metadata in a .zenodo.json file in the GitHub repo, which was always one of my gripes with archiving GitHub releases on Zenodo and something we may want for this repo) I downloaded the zip and checked that a few of the versioned PDFs look good. Time to start deleting!

I start by checking the size of the contents and iterative delete until it is back to a reasonable size.

$ du -sh .
20G     .
$ rm v/*/images/*
$ du -sh .
14G     .
$ rmdir v/*/images
$ rm v/*/*.pdf
$ du -sh .
9.9G    .

It's still huge even after removing images and pdfs. Time to remove entire manuscripts arbitrarily.

$ rm -rf v/0*
$ rm -rf v/1*
$ du -sh .
9.8G    .

Removing those HTML files is a reminder the disk usage must be elsewhere. It's in the .git subdirectory, which I am not touching.

$ du -sh v/
827M    v/
$ du -sh .git/
8.9G    .git/

Let's blast a few more HTML manuscripts. My favorite number is "5" so it stays.

$ rm -rf v/2*
$ rm -rf v/3*
$ rm -rf v/4*
$ rm -rf v/6*
$ rm -rf v/7*
$ rm -rf v/8*
$ rm -rf v/9*
$ du -sh v/
377M    v/
$ ls -l v/*/*.html | wc -l
249

I'm stopping here. If we address the problem below we have a reasonable artifact size and many past version of manuscripts left (HTML only though).

I can restore complete archives from my local copy, and anyone could do this by downloading the zip from Zenodo. I'm only restoring the two versions we refer to in the manual references for now and the latest version. I could restore more later that correspond to releases or other special versions.

$ cp -R ../gh-pages-archive-2022-10-22/v/910dd7b7479f5336a1c911c57446829bef015dbe v/910dd7b7479f5336a1c911c57446829bef015dbe
$ ls v/910dd7b7479f5336a1c911c57446829bef015dbe
$ cp -R ../gh-pages-archive-2022-10-22/v/32afa309f69f0466a91acec5d0df3151fe4d61b5 v/32afa309f69f0466a91acec5d0df3151fe4d61b5
$ ls v/32afa309f69f0466a91acec5d0df3151fe4d61b5
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ cp -R ../gh-pages-archive-2022-10-22/v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3 v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/  index.html  index.html.ots  manuscript.pdf  manuscript.pdf.ots
$ du -sh v/
478M    v/

I noticed I broke the symbolic links for v/latest.

git status excerpt
        deleted:    v/latest/images/4.2-summary-R-M-smallMoleculeDrugs.pdf
        deleted:    v/latest/images/4.3-summary-R-M-biologicsDrugs.pdf
        deleted:    v/latest/images/4.3.1-summary-R-M-moreTocilizumab.pdf
        deleted:    v/latest/images/4.3.2.1-summary-R-U-moreMonoclonal.pdf
        deleted:    v/latest/images/4.3.4.1.1-summary-L-M-DNAVaccine.pdf
        deleted:    v/latest/images/4.3.4.1.2-summary-L-L-RNAVaccine.pdf
        deleted:    v/latest/images/FIgX1.jpg
        deleted:    v/latest/images/N000-overview.pdf
        deleted:    v/latest/images/N000-overview.png
        deleted:    v/latest/images/N001-LifeCyclePlusDrugs.pdf
        deleted:    v/latest/images/N001-LifeCyclePlusDrugs.png
        deleted:    v/latest/images/N002-Vaccines.pdf
        deleted:    v/latest/images/N002-Vaccines.png
        deleted:    v/latest/images/SARS_CoV_2.png
        deleted:    v/latest/images/Summary.pdf
        deleted:    v/latest/images/cell-lines-moi-partB.afdesign
        deleted:    v/latest/images/cell-lines-moi.afdesign
        deleted:    v/latest/images/cell-lines-moi.jpg
        deleted:    v/latest/images/covid-19-review-workflow-figure.pdf
        deleted:    v/latest/images/covid-19-review-workflow-figure.png
        deleted:    v/latest/images/covid-19-review-workflow-figure.svg
        deleted:    v/latest/images/covid-19-review-workflow-horizontal-cropped.pdf
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.pdf
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.png
        deleted:    v/latest/images/covid-19-review-workflow-horizontal.svg
        deleted:    v/latest/images/diagnostics.png
        deleted:    v/latest/images/ebmdatalab-trials-original.png
        deleted:    v/latest/images/genome-structure.png
        deleted:    v/latest/images/github.svg
        deleted:    v/latest/images/interests.png
        deleted:    v/latest/images/orcid.svg
        deleted:    v/latest/images/summary-M-M-Covid19Mechanism.pdf
        deleted:    v/latest/images/thumbnail.png
        deleted:    v/latest/images/twitter.svg
        deleted:    v/latest/manuscript.pdf

I restored those and then commit the other changes. Had to do the taboo git add . because of problems with my other attempts to add by pattern. I made an absolute mess of the commits and pushing them to origin because I hadn't checked things out locally properly. Eventually, the commit made it.

$ git checkout v/latest/*
$ git checkout v/latest/images/*
$ git add .
$ git commit -m "Prune most old versioned manuscripts"
$ git log
commit 62720cec39d92945ce6733925bb35218947541e4 (HEAD)
Author: Anthony Gitter <[email protected]>
Date:   Sat Oct 22 16:29:34 2022 -0500
    Prune most old versioned manuscripts
$ git checkout --track origin/gh-pages
$ git branch prune-gh-pages 62720cec
$ git checkout prune-gh-pages
$ git branch --set-upstream-to origin/gh-pages prune-gh-pages
Branch 'prune-gh-pages' set up to track remote branch 'gh-pages' from 'origin'.
$ git push origin HEAD:gh-pages
Enumerating objects: 499, done.
Counting objects: 100% (499/499), done.
Delta compression using up to 8 threads
Compressing objects: 100% (250/250), done.
Writing objects: 100% (250/250), 15.17 KiB | 706.00 KiB/s, done.
Total 250 (delta 249), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (249/249), completed with 249 local objects.
To https://github.com/greenelab/covid19-review.git
   483a8dde..62720cec  HEAD -> gh-pages

If you browse gh-pages you'll see the pruned versioned manuscripts. And now the GitHub Pages deploy process works again so https://greenelab.github.io/covid19-review/ shows our latest manuscript!

We still should do this before closing the issue or merging too many more changes to the manuscript:

embedding images by link if you plan to have large images and many commits

Every time we push to gh-pages, we are creating a new copy of all the images in content/images. For this project that is a lot of copies of a lot of images. We could move these to external-resources even though they are not really external. @rando2 could you work on that? It may be a while before I can do manuscript maintenance again.

from covid19-review.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.