Comments (7)
I don't think the problem is the size of a single manuscript file. Rather, the branch also tracks archived versions of every historical version of the manuscript we've deployed before so we can maintain permalinks. Check out the contents of https://github.com/greenelab/covid19-review/tree/gh-pages/v We have about 600 past copies of the manuscript archived there.
Bringing in @dhimmel in case the permalink and archiving is a possible cause.
from covid19-review.
Oh no!!! If that's the case, I wonder if it's time to move to true book-like HTML formatting, with a TOC and "read next section" links at the bottom. It is so hard to load anyways... in terms of how to do that, I can play around with see how the suggestion from this post generates locally!
from covid19-review.
They really need a 😱 react for posts like this... that makes sense and would definitely explain the time outs!
from covid19-review.
Looking at the raw CI logs for this build:
2022-09-26T18:46:40.7358609Z Created deployment for c269dc06246f43081bfbfb4e8ae789a0f745d01b
2022-09-26T18:46:40.7360771Z {"page_url":"https://greenelab.github.io/covid19-review/","status_url":"https://api.github.com/repos/greenelab/covid19-review/pages/deployment/status/c269dc06246f43081bfbfb4e8ae789a0f745d01b","preview_url":""}
2022-09-26T18:46:40.7363072Z
2022-09-26T18:46:46.0259664Z Current status: deployment_in_progress
...
2022-09-26T18:56:40.2791010Z Current status: deployment_in_progress
2022-09-26T18:56:45.5190618Z Current status:
2022-09-26T18:56:45.5191563Z Timeout reached, aborting!
2022-09-26T18:56:45.5240173Z ##[error]Timeout reached, aborting!
2022-09-26T18:56:45.8961141Z Deployment cancelled with https://api.github.com/repos/greenelab/covid19-review/pages/deployment/cancel/c269dc06246f43081bfbfb4e8ae789a0f745d01b
So the deployment_in_progress
step is likely limited to 10 minutes. What about creating a branch from gh-pages
to preserve the existing versioned outputs. Then edit gh-pages
to delete most of the versions?
from covid19-review.
https://github.com/orgs/community/discussions/35197 provides more details. The artifacts grew to 10 GB in size, which leads to the 10 min timeout @dhimmel detected. We can monitor the artifacts size from the actions pages such as https://github.com/greenelab/covid19-review/actions/runs/3129872205
Archiving the old versions of the manuscript files isn't too hard. We already have a Zenodo repository linked to releases of this GitHub repository, so we could create a release from the gh-pages
branch and then delete most of the versions.
That would destroy our old permalinks, which is unfortunate. We could manually try to preserve the old versions that correspond to releases (e.g. the arXiv preprints), but would miss some. I don't see a general solution though if we are going to continue hosting the manuscript on GitHub pages.
Maybe we set the old permalinks to redirect to the Zenodo DOI? That would be better than a 404.
@dhimmel do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo, or does this review just push the Manubot workflow to the extreme?
from covid19-review.
That would destroy our old permalinks, which is unfortunate
Slightly unfortunate, but you could do something in between like just delete the images directory.
We could manually try to preserve the old versions that correspond to releases
Yeah, I don't think Manubot creates permalinks for git tags, but that would be a nice feature if it did.
do you think this is a general issue for large Manubot projects worth discussing in the rootstock repo
possibly, I think it's a reason to recommend embedding images by link if you plan to have large images and many commits. The insights from the discussions#35197 might be valuable in USAGE. As well as what you end up deciding in terms of pruning things.
from covid19-review.
I'm documenting my process to prune the gh-pages
branch here.
Checkout gh-pages
locally, confirm I have the output from the last commit, and create a local copy for safekeeping.
$ git checkout origin/gh-pages
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/ index.html index.html.ots manuscript.pdf manuscript.pdf.ots
$ cp -R . ../gh-pages-archive-2022-10-22
Our repository also had Zenodo archiving enabled, so make a tag and release to archive the gh-pages
contents before pruning.
$ git tag -a gh-pages-2022-10-22 -m "Archive gh-pages branch 2022-10-22"
$ git push origin gh-pages-2022-10-22
Zenodo created an archive of the release that is 8.2 GB compressed. (I also noticed at https://help.zenodo.org/ that Zenodo now supports metadata in a .zenodo.json
file in the GitHub repo, which was always one of my gripes with archiving GitHub releases on Zenodo and something we may want for this repo) I downloaded the zip and checked that a few of the versioned PDFs look good. Time to start deleting!
I start by checking the size of the contents and iterative delete until it is back to a reasonable size.
$ du -sh .
20G .
$ rm v/*/images/*
$ du -sh .
14G .
$ rmdir v/*/images
$ rm v/*/*.pdf
$ du -sh .
9.9G .
It's still huge even after removing images and pdfs. Time to remove entire manuscripts arbitrarily.
$ rm -rf v/0*
$ rm -rf v/1*
$ du -sh .
9.8G .
Removing those HTML files is a reminder the disk usage must be elsewhere. It's in the .git
subdirectory, which I am not touching.
$ du -sh v/
827M v/
$ du -sh .git/
8.9G .git/
Let's blast a few more HTML manuscripts. My favorite number is "5" so it stays.
$ rm -rf v/2*
$ rm -rf v/3*
$ rm -rf v/4*
$ rm -rf v/6*
$ rm -rf v/7*
$ rm -rf v/8*
$ rm -rf v/9*
$ du -sh v/
377M v/
$ ls -l v/*/*.html | wc -l
249
I'm stopping here. If we address the problem below we have a reasonable artifact size and many past version of manuscripts left (HTML only though).
I can restore complete archives from my local copy, and anyone could do this by downloading the zip from Zenodo. I'm only restoring the two versions we refer to in the manual references for now and the latest version. I could restore more later that correspond to releases or other special versions.
$ cp -R ../gh-pages-archive-2022-10-22/v/910dd7b7479f5336a1c911c57446829bef015dbe v/910dd7b7479f5336a1c911c57446829bef015dbe
$ ls v/910dd7b7479f5336a1c911c57446829bef015dbe
$ cp -R ../gh-pages-archive-2022-10-22/v/32afa309f69f0466a91acec5d0df3151fe4d61b5 v/32afa309f69f0466a91acec5d0df3151fe4d61b5
$ ls v/32afa309f69f0466a91acec5d0df3151fe4d61b5
images/ index.html index.html.ots manuscript.pdf manuscript.pdf.ots
$ cp -R ../gh-pages-archive-2022-10-22/v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3 v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
$ ls v/02880ba2701ec7fc0d81d37f7df9331d8f4bc4f3
images/ index.html index.html.ots manuscript.pdf manuscript.pdf.ots
$ du -sh v/
478M v/
I noticed I broke the symbolic links for v/latest
.
git status excerpt
deleted: v/latest/images/4.2-summary-R-M-smallMoleculeDrugs.pdf
deleted: v/latest/images/4.3-summary-R-M-biologicsDrugs.pdf
deleted: v/latest/images/4.3.1-summary-R-M-moreTocilizumab.pdf
deleted: v/latest/images/4.3.2.1-summary-R-U-moreMonoclonal.pdf
deleted: v/latest/images/4.3.4.1.1-summary-L-M-DNAVaccine.pdf
deleted: v/latest/images/4.3.4.1.2-summary-L-L-RNAVaccine.pdf
deleted: v/latest/images/FIgX1.jpg
deleted: v/latest/images/N000-overview.pdf
deleted: v/latest/images/N000-overview.png
deleted: v/latest/images/N001-LifeCyclePlusDrugs.pdf
deleted: v/latest/images/N001-LifeCyclePlusDrugs.png
deleted: v/latest/images/N002-Vaccines.pdf
deleted: v/latest/images/N002-Vaccines.png
deleted: v/latest/images/SARS_CoV_2.png
deleted: v/latest/images/Summary.pdf
deleted: v/latest/images/cell-lines-moi-partB.afdesign
deleted: v/latest/images/cell-lines-moi.afdesign
deleted: v/latest/images/cell-lines-moi.jpg
deleted: v/latest/images/covid-19-review-workflow-figure.pdf
deleted: v/latest/images/covid-19-review-workflow-figure.png
deleted: v/latest/images/covid-19-review-workflow-figure.svg
deleted: v/latest/images/covid-19-review-workflow-horizontal-cropped.pdf
deleted: v/latest/images/covid-19-review-workflow-horizontal.pdf
deleted: v/latest/images/covid-19-review-workflow-horizontal.png
deleted: v/latest/images/covid-19-review-workflow-horizontal.svg
deleted: v/latest/images/diagnostics.png
deleted: v/latest/images/ebmdatalab-trials-original.png
deleted: v/latest/images/genome-structure.png
deleted: v/latest/images/github.svg
deleted: v/latest/images/interests.png
deleted: v/latest/images/orcid.svg
deleted: v/latest/images/summary-M-M-Covid19Mechanism.pdf
deleted: v/latest/images/thumbnail.png
deleted: v/latest/images/twitter.svg
deleted: v/latest/manuscript.pdf
I restored those and then commit the other changes. Had to do the taboo git add .
because of problems with my other attempts to add by pattern. I made an absolute mess of the commits and pushing them to origin because I hadn't checked things out locally properly. Eventually, the commit made it.
$ git checkout v/latest/*
$ git checkout v/latest/images/*
$ git add .
$ git commit -m "Prune most old versioned manuscripts"
$ git log
commit 62720cec39d92945ce6733925bb35218947541e4 (HEAD)
Author: Anthony Gitter <[email protected]>
Date: Sat Oct 22 16:29:34 2022 -0500
Prune most old versioned manuscripts
$ git checkout --track origin/gh-pages
$ git branch prune-gh-pages 62720cec
$ git checkout prune-gh-pages
$ git branch --set-upstream-to origin/gh-pages prune-gh-pages
Branch 'prune-gh-pages' set up to track remote branch 'gh-pages' from 'origin'.
$ git push origin HEAD:gh-pages
Enumerating objects: 499, done.
Counting objects: 100% (499/499), done.
Delta compression using up to 8 threads
Compressing objects: 100% (250/250), done.
Writing objects: 100% (250/250), 15.17 KiB | 706.00 KiB/s, done.
Total 250 (delta 249), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (249/249), completed with 249 local objects.
To https://github.com/greenelab/covid19-review.git
483a8dde..62720cec HEAD -> gh-pages
If you browse gh-pages
you'll see the pruned versioned manuscripts. And now the GitHub Pages deploy process works again so https://greenelab.github.io/covid19-review/ shows our latest manuscript!
We still should do this before closing the issue or merging too many more changes to the manuscript:
embedding images by link if you plan to have large images and many commits
Every time we push to gh-pages
, we are creating a new copy of all the images in content/images
. For this project that is a lot of copies of a lot of images. We could move these to external-resources
even though they are not really external. @rando2 could you work on that? It may be a while before I can do manuscript maintenance again.
from covid19-review.
Related Issues (20)
- References missing in PDF HOT 19
- Revisions for Diagnostics manuscript HOT 11
- New Paper (Other): [Title]
- New Paper (Vaccine): Plausibility of Claimed Covid-19 Vaccine Efficacies by Age: A Simulation Study
- New Paper (Diagnostic): The Usefulness of Antigen Testing in Predicting Contagiousness in COVID-19
- New Paper (Other): Inflammasome activation in infected macrophages drives COVID-19 pathology
- New Paper (Other): Insights on the evolution of Coronavirinae in general, and SARS-CoV-2 in particular, through innovative biocomputational resources
- New Paper (Other): The Huanan Seafood Wholesale Market in Wuhan was the early epicenter of the COVID-19 pandemic
- New Paper (Other): The molecular epidemiology of multiple zoonotic origins of SARS-CoV-2
- Figure for Diagnostics Manuscript
- Need to appeal arXiv rejection of the novel vaccines manuscript HOT 23
- Revisions to Novel Vaccines manuscript HOT 5
- Revisions for Traditional Vaccines Manuscript HOT 4
- External resources workflow broke on 2023-01-13
- "Commit" not recognized in build.sh HOT 3
- Correct PubMed metadata for traditional vaccines manuscript HOT 2
- New Paper (Diagnostic): Real-world performance of SARS-Cov-2 serology tests in the United States, 2020
- ClinicalTrials.gov website updates HOT 8
- New Paper (Diagnostic): Comparison of the analytical and clinical sensitivity of thirty-four rapid antigen tests with the most prevalent SARS-CoV-2 variants of concern during the COVID-19 pandemic in the UK
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from covid19-review.