manubot / rootstock Goto Github PK
View Code? Open in Web Editor NEWClone me to create your Manubot manuscript
Home Page: https://manubot.github.io/rootstock/
License: Other
Clone me to create your Manubot manuscript
Home Page: https://manubot.github.io/rootstock/
License: Other
The anchor
script (https://github.com/greenelab/manubot-rootstock/blob/master/build/assets/anchors.js) is not yet listed in the licensing section of the Readme.
I think it would be good to mention this somehow, maybe with "unless noted explicitly" or something similar?
Some consider the lack of page numbers to be disturbing.
Refs jgm/pandoc#265
It could be useful to have a simple way to add info text in a highly visible banner like "Work in progress" or "Published, peer-reviewed version at [...]" to the head of the HTML file with some simple config.
(From #127)
I was browsing recent pandoc commits and saw jgm/pandoc@c7e3c1e, refs jgm/pandoc#3909 and jgm/pandoc#3906.
We should look into WeasyPrint and Prince.
This could help with the lack of SVG image export in wkhtmltopdf
as well as the some of the aesthetics issues. In addition, our conda install of wkhtmltopdf
is linux only.
I suggest changing the CSS to left justify table captions, since the main text is left justified and the figure captions are as well. For example, only the table caption in this document is center justified: https://greenelab.github.io/meta-review/v/b8eeea542ce238bbcaf2023add2aecb86ef726bd/
It's not immediately obvious where to change the CSS to accomplish this, but I didn't look thoroughly.
I started thinking about more detailed documentation for someone who wanted to create a new manuscript using this repository as a template. They could fork through GitHub, but that would only support a single manuscript per user.
The process I'm trying is roughly:
manubot-rootstock
to my desired new manuscript nameorigin
and manubot-rootstock
as upstream
initialize.sh
to create the necessary remote branchesdeploy.sh
Is there are more streamlined process we could recommend? Am I missing any steps?
I don't think manubot-rootstock
is a terrible repo name, but it may not be the best. The two biggest problems I see are that it's:
Here are some other names I jotted down:
Alphabetically sorted so I don't bias others with my ranking. Since this would be a slightly disruptive change, we should only make it if we feel any of these names is much better.
CC @cgreene @agitter @slochower. I do like the name "Manubot" for the overall system and the python package. That's way all of these names stick to the manu* theme.
Idyll stands for Interactive Document Language and is a "markup language for interactive documents." The current description reads:
Idyll extends the ubiquitous Markdown format to enable the creation of dynamic, interactive narratives for the web. The language and toolchain aim to empower journalists, researchers, and technical experts to create compelling content using familiar tools and processes.
Idyll can be used to create explorable explanations, to power blog engines and content management systems, and to generate dynamic technical reports. The tool can generate standalone webpages or be embedded inside of your existing site.
Taking a look at an example was helpful. See Idyll on GitHub at idyll-lang/idyll
.
@cgreene met the Idyll folks recently and wondered whether it'd be helpful for the Deep Review in greenelab/deep-review#842.
This issue is for discussing whether there is synergy between Idyll and Manubot, and whether there's an opportunity to integrate them in some form.
CC @mathisonian @AndrewGYork @marciovm.
@AndrewGYork is also working on interactive papers hosted via GitHub (example).
Based on some of the points already discussed on deep review and greenelab/meta-review#75 (comment), I think adding a few additional variables to the metadata would help Manubot be a little more flexible. Some ideas of what we might want to allow:
For the last three, I think we could implement sensible defaults in the jinja template to use if not specified. For example, corresponding author status may be set to "no" unless it is explicitly set to "yes."
The printed page margin was a bit too small on the top for the Sci-Hub manuscript. PeerJ applied their own banner which overlaps with some of the text. See https://peerj.com/preprints/3100v1.pdf
For example,
The other margins looked fine.
In deep review, the issues and pull requests were a critical part of the manuscript. I'd like to discuss strategies for archiving some of this metadata.
One initial thought would be to have the build script take a snapshot of the issues and pull requests at the time of the build, ideally with some caching. The deploy script could push them to a new branch, perhaps adding a timestamp. I haven't thought through the technical aspects of this. I expect it is feasible using some of the tools or APIs here.
cc @cgreene
In deep review (greenelab/deep-review#845 ), we had a pair of citations without a ;
separator [@url:https://eprint.iacr.org/2017/281.pdf @tag:Papernot2017_pate]
. The second paper was numbered in the reference list but not actually cited in text, which led to inconsistent reference numbering:
The skipped reference number 161 is @tag:Papernot2017_pate
. See the permalink for more context. As a reader, I would expect that @tag:Papernot2017_pate
is numbered based on the first appearance in the text.
Jake VDP wrote an astronomy paper (github source) that published to gh-pages
(http://jakevdp.github.io/multiband_LS/) via gh-publisher
. While each of those steps is a little clunky, one awesome feature of this page is that it has a "Send Feedback" button which then opens up a GitHub issue! This is a great way to create a dialogue with the manuscript authors and readers.
EDIT: Added link to gh-publisher
Add:
<script src="https://hypothes.is/embed.js" async></script>
Doesn't work natively with the PDF files, sadly.
Currently author parsing is disabled in this repo. I'm thinking of simplifying the TSV format and how it gets added to the manuscript. Basically, here would be the columns:
I was thinking of removing the approve
column, and going for each author submits a PR to add their name, hence approving.
Unlike the system for the deep review, the build system, would not try to condense affiliations or funding across authors. In other words, each author would get their details printed next to their name. There would be more duplication of text, but this system will be more reliable. Additionally, we may eventually move to putting much of this info in tooltips for the HTML version.
@agitter what do you think. Feel free to disagree!
Building on @dhimmel's post on author versus numeric citation styles, another advantage of author-based citations in the current version of Manubot is that it is easier to find where a reference is cited. I can search for Pantcheva, 2018
more easily than 13
, for instance, especially if 13
is cited as 12-14
or appears in numeric parts of the text.
A nice feature for numeric citations might a form of "show context" that some journals use. https://www.nature.com/articles/ncomms12989#references is an arbitrary example. The context consists of snippets of the manuscript where the reference was used plus links back to those locations.
This would also give us one way to address #117. We could assert that the reference number is an increasing function of the reference's first context.
When you cite a news or blog URL, you might want to reference the archive.org snapshop of the URL.
Can the @url:
identifier send a request to archive.org and get that URL to cite in Manubot?
See blog post: https://medium.com/@RaoOfPhysics/89bd3f2ce0fd
Currently we cite multiple documents like:
Several groups [@doi:10.1371/journal.pone.0032235 @doi:10.1109/TCBB.2014.2343960 @doi:10.1038/srep11476] initiated
Prior to pandoc, this gets converted to:
Several groups [@1AlhRKQbe; @ZzaRyGuJ; @UpFrhdJf] initiated
Then post pandoc conversion, it will look like:
Several groups [30,192,193] initiated
Note how we have to add semicolons to separate each reference. We figured this out at lierdakil/pandoc-crossref#110. It would be nice to align our format with the pandoc-citeproc
format. This presumably would also allow us to make non-bracketed citations like:
@doi:10.1371/journal.pone.0032235 was the first group
This would presumably render to
Qi et al 2012 was the first group
However, I haven't found the actual docs for the markdown citation formatting supported by pandoc-citeproc
(docs). Tagging @lierdakil and @slochower in case they have any insights.
To work-around PDF build issues (#120) and for quicker local development a BUILD_PDF
flag like BUILD_DOCX
might be useful.
This would require skipping "manuscript.pdf" in webpage.py
, would that be a problem?
@slochower welcome to manubot-rootstock... which is meant to be forked when creating a new manuscript. Still a work in progress.
See previous discussions at greenelab/deep-review#354 (comment) and greenelab/deep-review#558.
It seems like the best way to number and reference tables and figures will be with pandoc-tablenos
and pandoc-fignos
, which are both python packages by @tomduck that we can add to the environment:
They can be enabled in the pandoc conversion script with:
--filter pandoc-fignos
--filter pandoc-tablenos
Since we're also using jinja2 templating, we could do the conversion prior to pandoc if there is a compelling reason.
@slochower do you want to submit the PR? I'm thinking the initial use case we should target is markdown tables and figures embedded via absolute URL (let's save the relative image path case for later).
Also @slowchower, any idea how figure and table captions work?
CC @agitter.
I've been playing around with manually building a manuscript based off this template and noticed that if I have absolutely zero references in my document, I get a build error. If I add a reference in any section (e.g., putting [@doi:10.1126/science.1127344] as a placeholder in my abstract), then the error goes away.
$ bash build/build.sh
Retrieving and processing reference metadata
Using metadata cache: True
Traceback (most recent call last):
File "references.py", line 111, in <module>
ref_df['standard_citation'], ref_df['citation_id'] = zip(*result)
ValueError: not enough values to unpack (expected 2, got 0)
I haven't debugged the code, but I think result
(calculated on line 109, just above the error) is empty when there are no references. Would a simple check if result not None: ...
before line 111 be a workaround?
result = ref_df.citation.apply(
get_standard_citatation, cache=metadata_cache, override=overrides)
(FWIW, I do get the "potentially misformatted references" error in any case, but the build continues successfully after I add the placeholder. The warning from the templates in the front matter.)
In several places, the PDF rendering looks (subjectively) worse than the HTML output. (I'm not sure if I'll have time to work on this during the week, but I wanted to drop this here in case someone else has time before me.)
Overall, I think the margins of the PDF could be adjusted. The relatively short title already wraps in the PDF.
There are places where the HTML has spaces between the text and the references, but the PDF output does not. I'm not sure why this happens.
Code style could be formatted as monospaced in the PDF output.
Tables look much better in HTML than PDF (shading and banding).
The SVG example figure is missing (known problem: #14).
It may be nice in the future to produce statistics about how many documents have been authored with Manubot and this rootstock or refer to more examples. @dhimmel has https://github.com/dhimmel/rephetio-manuscript/ and were examples listed in #62.
I haven't been able to think of a non-invasive way to track this. Does anyone else have ideas? Is this worthwhile?
It is difficult to read a long manuscript with the current style settings.
It might be useful to build on the work of other projects which convert Markdown into the usual academic style:
https://github.com/ickc/markdown-latex-css
https://github.com/thomaspark/pubcss/ // https://thomaspark.co/project/pubcss/demo/acm-sig-sample-web-theme.html
https://gist.github.com/killercup/5917178
etc
The current default math used in our pandoc build command is severely limited: see the "TeX math in HTML" section of the pandoc demos. Pandoc has support for several more advanced methods for math rendering in HTML.
The question is which one to choose? I've seen MathJax used before in scholarly publishing. However, KaTex is faster to render. There are also several more options.
@slochower did you look into the math options at all for b03e1c3?
Currently, Manubot uses style.csl
a slightly modified version of proceedings-of-the-royal-society-b.csl
. While this style is decent, I have some ideas for an optimal style. And of course, authors can always switch the style to that of whatever journal they'd like.
The style I envision uses numbers for citations, i.e. renders likeblah blah [1-5,7].
. Non bracketed citations could show author name like: Pippi, Hippi, et al [7] wrote
.
Bibliographic entries would look something like:
10.7287/peerj.preprints.3100v1
Ideally, author names would be in smaller text and hyperlink to ORCID records when available. The smallness of text here is an exaggeration (limited formatting options).
Compared to historical bibliographic formats, the following points are stressed:
There's a webapp to generate a custom CSL style. I've found it a bit difficult to use, but its probably the way to go.
One question is whether to print out the URL rather than hyperlink the title. The benefit of showing the URL would be for readers who have printed the PDF. However, if a reader is at a computer, they could always go back to the digital version with the hyperlink.
Suggestions welcome.
Have you seen https://github.com/ewanmellor/gh-publisher? What lessons can we learn from them?
EDIT. Example: http://drphilmarshall.github.io/Ideas-for-Citizen-Science-in-Astronomy/
At the OpenCon do-a-thon, we've had 2 users experience potentially faulty substitutions. Rather than rebranding their README to USER/REPO, their README.md
is rebranded to USER/USER. Possibly introduced in #84?
The two examples are https://github.com/zambujo/manubot/commit/10397d6a05235c3517ac981b9b3c67920c226b9a are broadwym/manu1@64954e5.
Interestingly one user did not have the issue: https://github.com/schliebs/open_manuscript/commit/77da6c844ac061061c03b93721e7eade90fabd99, making me wonder whether its user error or not.
SETUP.md
currently uses:
sed "s/greenelab/$OWNER/g" README.md > tmp && mv -f tmp README.md
sed "s/manubot-rootstock/$REPO/g" README.md > tmp && mv -f tmp README.md
@vsmalladi any ideas what could be happening?
OpenTimestamps is now on PyPI (announcement). Install with:
pip install opentimestamps-client
We should also update python-bitcoinlib
to v0.8.0.
The ReScience journal could be a potential use case for manubot-rootstock
. From https://arxiv.org/abs/1707.04393:
The main inconvenience of the GitHub platform is its almost complete lack of support for the publishing steps, once a submission has successfully passed the reviewing process. At this point, the submission consists of an article text in Markdown format plus a set of code and data files in a git repository. The desired archival form is an article in PDF format plus a permanent archive of the submitted code and data, with a Digital Object Identifier (DOI) providing a permanent reference. The Zenodo platform allows straightforward archiving of snapshots of a repository hosted on GitHub, and issues a DOI for the archive. This leaves the task of producing a PDF version of the article, which is currently handled by the managing editor of the submission, in order to ease the technical burden on our authors
I highlight text with my mouse as I'm reading this. Apparently, there are lots of us who do this.
This is really annoying on Manubot HTML outputs because the highlight popup comes up every time. One time I clicked it by mistake and now there's no way for me to get rid of my highlight and I feel like a jackass who highlighted some unimportant text.
It'd be great if I could a) toggle the highlight-popup and b) un-highlight.
Just a suggestion: https://about.gitlab.com/features/gitlab-ci-cd/ :)
Oftentimes, it's important (and required in scholarly publishing) to show the changes between two versions of a manuscript. It would be ideal if Manubot users could "track changes" between two manuscript versions.
Pandoc doesn't have builtin support for diffs: jgm/pandoc#2374. Other options would be:
manuscript.md
as a text file (perhaps using diff
, prettydiff
, or rich-text-diff
)It would be helpful to describe the usage of manual-references.json
in references/README.md
. I can make a pull request myself (eventually).
As commented by @arielsvn in greenelab/scihub-manuscript#51 (comment):
there seems to be an encoding issue with the bitcoin symbol on the Discussion section. I noticed it on the pdf, and the same happens with the markdown file, at least on my computer.
This is likely due to the unicode character (₿, U+20BF
) a recent addition as part of Unicode 10.0, released June 2017. Note this release has other important symbols/emojis such as 🧟 (Zombie) and 🧖 (Person in Steamy Room).
For me, on Chrome on Ubuntu 17.10, the bitcoin sign renders in the HTML but not the PDF. I'm assuming the PDF gets a certain font embedded on Travis CI, which doesn't have the latest characters. Note that when I generate the PDF locally, the bitcoin signs do render.
So @arielsvn, I think we may want to look into the following solutions:
@arielsvn you probably know best what to do here.
Some of the commands in SETUP.md fail on macOS. IIRC, these commands are:
TRAVIS_ENCRYPT_ID=`grep \
--only-matching --perl-regexp \
--regexp='(?<=encrypted_)[a-zA-Z0-9]+(?=_key)' \
travis-encrypt-file.log`
sed --in-place "s/f2f00aaf6402/$TRAVIS_ENCRYPT_ID/g" deploy.sh
sed --in-place "s/greenelab/$OWNER/g" README.md
sed --in-place "s/manubot-rootstock/$REPO/g" README.md
The issue is likely that the mac versions of these utilities don't support the same long arguments. What a shame.
sh build/build.sh
fails on MAC OS as the following:
ln --symbolic
and rm --recursive
do not work. When I changed them to ln -s
and rm -r
, respectively, they are fine.
However, then it complains about pango
. I manually installed it using homebrew and pango
was not an issue anymore.
Then the build was completed with no errors but warnings:
WARNING: Ignored `-ms-text-size-adjust: 100%` at 78:5, unknown property.
WARNING: Ignored `-webkit-text-size-adjust: 100%` at 79:5, unknown property.
WARNING: Ignored `-moz-box-sizing: content-box` at 204:5, unknown property.
WARNING: Ignored `-webkit-appearance: button` at 379:5, unknown property.
WARNING: Ignored `cursor: pointer` at 380:5, the property does not apply for the print media.
WARNING: Ignored `cursor: default` at 389:5, the property does not apply for the print media.
WARNING: Ignored `-webkit-appearance: textfield` at 410:5, unknown property.
WARNING: Ignored `-moz-box-sizing: content-box` at 411:5, unknown property.
WARNING: Ignored `-webkit-box-sizing: content-box` at 412:5, unknown property.
WARNING: Ignored `-webkit-appearance: none` at 423:5, unknown property.
WARNING: Invalid or unsupported selector 'button::-moz-focus-inner,
input::-moz-focus-inner ', Unknown pseudo-element: -moz-focus-inner
WARNING: Invalid or unsupported selector '*:not("#mkdbuttons") ', (<FunctionBlock not( … )>, ':not() only accepts a simple selector')
WARNING: Ignored `-webkit-font-smoothing: subpixel-antialiased` at 486:5, unknown property.
WARNING: Ignored `-moz-border-radius: 3px` at 491:5, unknown property.
WARNING: Ignored `-webkit-border-radius: 3px
` at 492:5, unknown property.
WARNING: Ignored `-webkit-font-smoothing: subpixel-antialiased` at 528:5, unknown property.
WARNING: Ignored `cursor: text
` at 529:5, the property does not apply for the print media.
WARNING: Ignored `word-break: break-all` at 733:5, unknown property.
WARNING: Ignored `word-break: break-word` at 734:5, unknown property.
WARNING: Ignored `-webkit-hyphens: auto` at 735:5, unknown property.
WARNING: Ignored `-moz-hyphens: auto` at 736:5, unknown property.
And generated PDF has squares only.
Do you have any idea on why this might be happening?
Ran into a deploy error when setting up a manuscript at the OpenCon doathon:
bad decrypt
140040671200928:error:0606506D:digital envelope routines:EVP_DecryptFinal_ex:wrong final block length:evp_enc.c:520:
It seems like it would be better to specify the ordering of the markdown files by having a separate file.
As it is now it looks like people would have to rename several files if they wanted to change the ordering or add some content in the middle.
I'm excited to see this standalone manuscript repository!
I have a general question in regards to journal submissions. Many journals require Word or LaTex formats for submission. Have you thought about how manuscripts written in this markdown format can be submitted to a journal with those requirements? Would one use pandoc outside of the automatic build to do a one time conversion to Word or LaTeX?
See for example this Sci-Hub Manuscript PDF. The Paper Size according to the PDF's properties is A4, Portrait (8.26 × 11.68 inch). This caused an issue when I printed the PDF where some final lines on a page were omitted.
This StackOverflow notes how to change the page to Letter (8.5 × 11). I just want to confirm this is a change we want to make. I didn't realize there were multiple paper sizes, both prevalent, in this unstandardized world!
The gh-pages
branch is responsible for the GitHub Pages site and contains output HTML, PDF, CSS, image, and OTS files. Currently, new manuscript builds overwrite the files, which are in the root directory of this branch:
I propose instead creating a directory structure, so all past outputs on gh-pages
are preserved through versioned directories. The version would be the master
commit that the build was based on (i.e. $TRAVIS_COMMIT
). For example, I commit f165f60 to master. The outputs that currently go to the root directory of gh-pages
would instead go to the v/f165f609f33b11fdf71a0db6435d4dd159f23973
directory (v
for version). The latest HTML and PDF manuscript would stay available at their current URLs, probably via symbolic links (see here for how symlinks act with GitHub Pages).
We could use redirects, so v/freeze
redirects to the latest versioned directory.
The benefits of this change are twofold:
You can view outdated versions of the HTML manuscript. Right now, you can only see the rendered HTML for the latest version.
The OpenTimestamp .ots
files need to be upgraded. Until they're upgraded, they depend on a calendar service for verification. Currently, we haven't upgraded timestamps, which creates the possibility that we may be unable to prove existence if the calendar goes down. Note that the timestamps can only be upgraded after the bitcoin transaction confirms, which could be days. That's why we don't specify --wait
in our builds. Anyways, previously I was planning on rewriting the gh-pages
history to upgrade timestamps in past commits. However, rewriting history is dangerous. It would be preferable to be able to upgrade past timestamps without rewriting history, which this proposal would enable.
The main disadvantage I can think of is repository size, since more files are being tracked. However, I'm not sure it'd be any bigger, since all files are currently in the git history at some point. According to this:
even if you have multiple files with the same contents but different names or in different locations or from different commits only one copy would ever be saved but with several pointers to it in each commit tree.
Shallow cloning would lose its savings, but I'm not sure we care.
One final point to consider is that a single commit will sometimes be deployed multiple times (say if the CI build is rerun). They will not always be the same. For the same source commit, I think we'd use the latest build.
I propose we symlink github-pandoc.css
to the output/
directory so that local building and viewing of webpage/index.html
or output/manuscript.html
(I know those are symlinks of each other) loads the CSS. Viewing the HTML from either webpage/
or output/
currently can't find the CSS because the browser follows the symlink into output/
and therefore doesn't find webpage/github-pandoc.css
. Does that make sense? A simple ln -s ../webpage/github-pandoc.css
in output/
fixes the issue.
This CircleCI blog describes Markdown Proofer for validating YAML blocks in Markdown files. It is written in Go, which we could get from conda, but it may not cleanly integrate into our test environment. I'm also uncertain whether it could be applied directly to YAML files like metadata.yaml
.
Nevertheless I thought it was worth monitoring.
This page provides some nice examples of the CSL metadata for different document types. Would be nice to add to docs.
Described in Formatting Open Science: agilely creating multiple document formats for academic manuscripts with Pandoc Scholar:
In this article we demonstrate the feasibility of writing scientific manuscripts in plain markdown (MD) text files, which can be easily converted into common publication formats, such as PDF, HTML or EPUB, using Pandoc. The simple syntax of Markdown assures the long-term readability of raw files and the development of software and workflows. We show the implementation of typical elements of scientific manuscripts—formulas, tables, code blocks and citations—and present tools for editing, collaborative writing and version control. We give an example on how to prepare a manuscript with distinct output formats, a DOCX file for submission to a journal, and a LATEX/PDF version for deposition as a PeerJ preprint. Further, we implemented new features for supporting ‘semantic web’ applications, such as the ‘journal article tag suite’—JATS, and the ‘citation typing ontology’—CiTO standard.
The GitHub repo for this project is pandoc-scholar/pandoc-scholar
. Created by @tarleb.
Let's see if there's anything from Pandoc Scholar we should incorporate here or learn from.
Quoting from https://greenelab.github.io/scihub-manuscript/
0000-0002-9925-9623 · Department of Applied Bioinformatics, Institute of Cell Biology and Neuroscience, Goethe University Frankfurt · Funded by nan
Would be better to have jinja2 omit blank fields entirely. In other words remove " · Funded by nan"
I just found out about Manubot, can you tell the differences between Manubot and alternatives, like:
At the moment, PDFs get pushed to PeerJ. But you could use the GitHub-Zenodo integration to snapshop the whole repo and give it a DOI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.