Giter Site home page Giter Site logo

manubot / manubot Goto Github PK

View Code? Open in Web Editor NEW
421.0 16.0 39.0 52.43 MB

Python utilities for Manubot: Manuscripts, open and automated

Home Page: https://manubot.org

License: Other

Python 96.02% Lua 0.06% HTML 3.59% TeX 0.33%
python publishing markdown citations pandoc csl-json manubot

manubot's Introduction

Python utilities for Manubot: Manuscripts, open and automated

documentation PyPI Code style: black

GitHub Actions CI Tests Status AppVeyor Windows Build Status

Manubot is a workflow and set of tools for the next generation of scholarly publishing. This repository contains a Python package with several Manubot-related utilities, as described in the usage section below. Package documentation is available at https://manubot.github.io/manubot (auto-generated from the Python source code).

The manubot cite command-line interface retrieves and formats bibliographic metadata for user-supplied persistent identifiers like DOIs or PubMed IDs. The manubot process command-line interface prepares scholarly manuscripts for Pandoc consumption. The manubot process command is used by Manubot manuscripts, which are based off the Rootstock template, to automate several aspects of manuscript generation. The manubot ai-revision command is used to automatically revise a manuscript based on a set of AI-generated suggestions. See Rootstock's manuscript usage guide for more information.

Note: If you want to experience Manubot by editing an existing manuscript, see https://github.com/manubot/try-manubot. If you want to create a new manuscript, see https://github.com/manubot/rootstock.

To cite the Manubot project or for more information on its design and history, see:

Open collaborative writing with Manubot
Daniel S. Himmelstein, Vincent Rubinetti, David R. Slochower, Dongbo Hu, Venkat S. Malladi, Casey S. Greene, Anthony Gitter
PLOS Computational Biology (2019-06-24) https://doi.org/c7np
DOI: 10.1371/journal.pcbi.1007128 · PMID: 31233491 · PMCID: PMC6611653

The Manubot version of this manuscript is available at https://greenelab.github.io/meta-review/.

Installation

If you are using the manubot Python package as part of a manuscript repository, installation of this package is handled though the Rootstock's environment specification. For other use cases, this package can be installed via pip.

Install the latest release version from PyPI:

pip install --upgrade manubot

Or install from the source code on GitHub, using the version specified by a commit hash:

COMMIT=d2160151e52750895571079a6e257beb6e0b1278
pip install --upgrade git+https://github.com/manubot/manubot@$COMMIT

The --upgrade argument ensures pip updates an existing manubot installation if present.

Some functions in this package require Pandoc, which must be installed separately on the system. The pandoc-manubot-cite filter depends on Pandoc as well as panflute (a Python package). Users must install a compatible version of panflute based on their Pandoc version. For example, on a system with Pandoc 2.9, install the appropriate panflute like pip install panflute==1.12.5.

Usage

Installing the python package creates the manubot command line program. Here is the usage information as per manubot --help:

usage: manubot [-h] [--version] {process,cite,webpage,ai-revision} ...

Manubot: the manuscript bot for scholarly writing

options:
  -h, --help            show this help message and exit
  --version             show program's version number and exit

subcommands:
  All operations are done through subcommands:

  {process,cite,webpage,ai-revision}
    process             process manuscript content
    cite                citekey to CSL JSON command line utility
    webpage             deploy Manubot outputs to a webpage directory tree
    ai-revision         revise manuscript content with language models

Note that all operations are done through the following sub-commands.

Process

The manubot process program is the primary interface to using Manubot. There are two required arguments: --content-directory and --output-directory, which specify the respective paths to the content and output directories. The content directory stores the manuscript source files. Files generated by Manubot are saved to the output directory.

One common setup is to create a directory for a manuscript that contains both the content and output directory. Under this setup, you can run the Manubot using:

manubot process \
  --skip-citations \
  --content-directory=content \
  --output-directory=output

See manubot process --help for documentation of all command line arguments:

usage: manubot process [-h] --content-directory CONTENT_DIRECTORY
                       --output-directory OUTPUT_DIRECTORY
                       [--template-variables-path TEMPLATE_VARIABLES_PATH]
                       --skip-citations [--cache-directory CACHE_DIRECTORY]
                       [--clear-requests-cache] [--skip-remote]
                       [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Process manuscript content to create outputs for Pandoc consumption. Performs
bibliographic processing and templating.

options:
  -h, --help            show this help message and exit
  --content-directory CONTENT_DIRECTORY
                        Directory where manuscript content files are located.
  --output-directory OUTPUT_DIRECTORY
                        Directory to output files generated by this script.
  --template-variables-path TEMPLATE_VARIABLES_PATH
                        Path or URL of a file containing template variables
                        for jinja2. Serialization format is inferred from the
                        file extension, with support for JSON, YAML, and TOML.
                        If the format cannot be detected, the parser assumes
                        JSON. Specify this argument multiple times to read
                        multiple files. Variables can be applied to a
                        namespace (i.e. stored under a dictionary key) like
                        `--template-variables-path=namespace=path_or_url`.
                        Namespaces must match the regex `[a-zA-
                        Z_][a-zA-Z0-9_]*`.
  --skip-citations      Skip citation and reference processing. Support for
                        citation and reference processing has been moved from
                        `manubot process` to the pandoc-manubot-cite filter.
                        Therefore this argument is now required. If citation-
                        tags.tsv is found in content, these tags will be
                        inserted in the markdown output using the reference-
                        link syntax for citekey aliases. Appends
                        content/manual-references*.* paths to Pandoc's
                        metadata.bibliography field.
  --cache-directory CACHE_DIRECTORY
                        Custom cache directory. If not specified, caches to
                        output-directory.
  --clear-requests-cache
  --skip-remote         Do not add the rootstock repository to the local git
                        repository remotes.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

Manual references

Manubot has the ability to rely on user-provided reference metadata rather than generating it. manubot process searches the content directory for files containing manually-provided reference metadata that match the glob manual-references*.*. These files are stored in the Pandoc metadata bibliography field, such that they can be loaded by pandoc-manubot-cite.

Cite

manubot cite is a command line utility to produce bibliographic metadata for citation keys. The utility either outputs metadata as CSL JSON items or produces formatted references if --render.

Citation keys should be in the format prefix:accession. For example, the following example generates Markdown-formatted references for four persistent identifiers:

manubot cite --format=markdown \
  doi:10.1098/rsif.2017.0387 pubmed:29424689 pmc:PMC5640425 arxiv:1806.05726

The following terminal recording demonstrates the main features of manubot cite (for a slightly outdated version):

manubot cite demonstration

Additional usage information is available from manubot cite --help:

usage: manubot cite [-h] [--output OUTPUT]
                    [--format {csljson,cslyaml,plain,markdown,docx,html,jats} | --yml | --txt | --md]
                    [--csl CSL] [--bibliography BIBLIOGRAPHY]
                    [--no-infer-prefix] [--allow-invalid-csl-data]
                    [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
                    citekeys [citekeys ...]

Generate bibliographic metadata in CSL JSON format for one or more citation
keys. Optionally, render metadata into formatted references using Pandoc. Text
outputs are UTF-8 encoded.

positional arguments:
  citekeys              One or more (space separated) citation keys to
                        generate bibliographic metadata for.

options:
  -h, --help            show this help message and exit
  --output OUTPUT       Specify a file to write output, otherwise default to
                        stdout.
  --format {csljson,cslyaml,plain,markdown,docx,html,jats}
                        Format to use for output file. csljson and cslyaml
                        output the CSL data. All other choices render the
                        references using Pandoc. If not specified, attempt to
                        infer this from the --output filename extension.
                        Otherwise, default to csljson.
  --yml                 Short for --format=cslyaml.
  --txt                 Short for --format=plain.
  --md                  Short for --format=markdown.
  --csl CSL             URL or path with CSL XML style used to style
                        references (i.e. Pandoc's --csl option). Defaults to
                        Manubot's style.
  --bibliography BIBLIOGRAPHY
                        File to read manual reference metadata. Specify
                        multiple times to load multiple files. Similar to
                        pandoc --bibliography.
  --no-infer-prefix     Do not attempt to infer the prefix for citekeys
                        without a known prefix.
  --allow-invalid-csl-data
                        Allow CSL Items that do not conform to the JSON
                        Schema. Skips CSL pruning.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

Pandoc filter

This package creates the pandoc-manubot-cite Pandoc filter, providing access to Manubot's cite-by-ID functionality from within a Pandoc workflow.

Options are set via Pandoc metadata fields listed in the docs.

usage: pandoc-manubot-cite [-h] [--input [INPUT]] [--output [OUTPUT]]
                           target_format

Pandoc filter for citation by persistent identifier. Filters are command-line
programs that read and write a JSON-encoded abstract syntax tree for Pandoc.
Unless you are debugging, run this filter as part of a pandoc command by
specifying --filter=pandoc-manubot-cite.

positional arguments:
  target_format      output format of the pandoc command, as per Pandoc's --to
                     option

options:
  -h, --help         show this help message and exit
  --input [INPUT]    path read JSON input (defaults to stdin)
  --output [OUTPUT]  path to write JSON output (defaults to stdout)

Other Pandoc filters exist that do something similar: pandoc-url2cite, pandoc-url2cite-hs, & pwcite. Currently, pandoc-manubot-cite supports the most types of persistent identifiers. We're interested in creating as much compatibility as possible between these filters and their syntaxes.

Manual references

Manual references are loaded from the references and bibliography Pandoc metadata fields. If a manual reference filename ends with .json or .yaml, it's assumed to contain CSL Data (i.e. Citation Style Language JSON). Otherwise, the format is inferred from the extension and converted to CSL JSON using the pandoc-citeproc --bib2json utility. The standard citation key for manual references is inferred from the CSL JSON id or note field. When no prefix is provided, such as doi:, url:, or raw:, a raw: prefix is automatically added. If multiple manual reference files load metadata for the same standard citation id, precedence is assigned according to descending filename order.

Webpage

The manubot webpage command populates a webpage directory with Manubot output files.

usage: manubot webpage [-h] [--checkout [CHECKOUT]] [--version VERSION]
                       [--timestamp] [--no-ots-cache | --ots-cache OTS_CACHE]
                       [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Update the webpage directory tree with Manubot output files. This command
should be run from the root directory of a Manubot manuscript that follows the
Rootstock layout, containing `output` and `webpage` directories. HTML and PDF
outputs are copied to the webpage directory, which is structured as static
source files for website hosting.

options:
  -h, --help            show this help message and exit
  --checkout [CHECKOUT]
                        branch to checkout /v directory contents from. For
                        example, --checkout=upstream/gh-pages. --checkout is
                        equivalent to --checkout=gh-pages. If --checkout is
                        ommitted, no checkout is performed.
  --version VERSION     Used to create webpage/v/{version} directory.
                        Generally a commit hash, tag, or 'local'. When
                        omitted, version defaults to the commit hash on CI
                        builds and 'local' elsewhere.
  --timestamp           timestamp versioned manuscripts in webpage/v using
                        OpenTimestamps. Specify this flag to create timestamps
                        for the current HTML and PDF outputs and upgrade any
                        timestamps from past manuscript versions.
  --no-ots-cache        disable the timestamp cache.
  --ots-cache OTS_CACHE
                        location for the timestamp cache (default:
                        ci/cache/ots).
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

AI-assisted academic authoring

The manubot ai-revision command uses large language models from OpenAI to automatically revise a manuscript and suggest text improvements.

usage: manubot ai-revision [-h] --content-directory CONTENT_DIRECTORY
                           [--model-type MODEL_TYPE]
                           [--model-kwargs key=value [key=value ...]]
                           [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]

Revise manuscript content using AI models to suggest text improvements.

options:
  -h, --help            show this help message and exit
  --content-directory CONTENT_DIRECTORY
                        Directory where manuscript content files are located.
  --model-type MODEL_TYPE
                        Model type used to revise the manuscript. Default is
                        GPT3CompletionModel. It can be any subclass of
                        manubot_ai_editor.models.ManuscriptRevisionModel
  --model-kwargs key=value [key=value ...]
                        Keyword arguments for the revision model (--model-
                        type), with format key=value.
  --log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}
                        Set the logging level for stderr logging

The usual call is:

manubot ai-revision --content-directory content/

The parameters --model-type and --model-kwargs are used for debugging purposes. For example, since the tool splits the text into paragraphs, you might want to see if paragraphs were detected correctly. The tool incurs a cost when using the OpenAI API, so this could be important to check for text with complicated structure.

manubot ai-revision \
  --content-directory content/ \
  --model-type DummyManuscriptRevisionModel \
  --model-kwargs add_paragraph_marks=true

Development

Environment

Create a development environment using:

conda create --name manubot-dev --channel conda-forge \
  python=3.11 pandoc=2.11.3.1
conda activate manubot-dev  # assumes conda >= 4.4
pip install --editable ".[webpage,dev]"

Commands

Below are some common commands used for development. They assume the working directory is set to the repository's root, and the conda environment is activated.

# run the test suite
pytest

# install pre-commit git hooks (once per local clone).
# The pre-commit checks declared in .pre-commit-config.yaml will now
# run on changed files during git commits.
pre-commit install

# run the pre-commit checks (required to pass CI)
pre-commit run --all-files

# commit despite failing pre-commit checks (will fail CI)
git commit --no-verify

# regenerate the README codeblocks for --help messages
python manubot/tests/test_readme.py

# generate the docs
portray as_html --overwrite --output_dir=docs

# process the example testing manuscript
manubot process \
  --content-directory=manubot/process/tests/manuscripts/example/content \
  --output-directory=manubot/process/tests/manuscripts/example/output \
  --skip-citations \
  --log-level=INFO

Release instructions

PyPI

This section is only relevant for project maintainers. GitHub Actions deploys releases to PyPI.

To create a new release, bump the __version__ in manubot/__init__.py. Then, set the TAG and OLD_TAG environment variables:

TAG=v$(python setup.py --version)

# fetch tags from the upstream remote
# (assumes upstream is the manubot organization remote)
git fetch --tags upstream main

# get previous release tag, can hardcode like OLD_TAG=v0.3.1
OLD_TAG=$(git describe --tags --abbrev=0)

The following commands can help draft release notes:

# check out a branch for a pull request as needed
git checkout -b "release-$TAG"

# create release notes file if it doesn't exist
touch "release-notes/$TAG.md"

# commit list since previous tag
echo $'\n\nCommits\n-------\n' >> "release-notes/$TAG.md"
git log --oneline --decorate=no --reverse $OLD_TAG..HEAD >> "release-notes/$TAG.md"

# commit authors since previous tag
echo $'\n\nCode authors\n------------\n' >> "release-notes/$TAG.md"
git log $OLD_TAG..HEAD --format='%aN <%aE>' | sort --unique >> "release-notes/$TAG.md"

After a commit with the above updates is part of upstream:main, for example after a PR is merged, use the GitHub interface to create a release with the new "Tag version". Monitor GitHub Actions and PyPI for successful deployment of the release.

Goals & Acknowledgments

Our goal is to create scholarly infrastructure that encourages open science and assists reproducibility. Accordingly, we hope for the Manubot software and philosophy to be adopted widely, by both academic and commercial entities. As such, Manubot is free/libre and open source software (see LICENSE.md).

We would like to thank the contributors and funders whose support makes this project possible. Specifically, Manubot development has been financially supported by:

manubot's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

manubot's Issues

Error getting citation data for doi:10.6084/m9.figshare.5346577

I am getting an "The resource you are looking for doesn't exist" error trying to fetch one of the citations present in manubot-rootstock:

>>> from manubot.cite.util import citation_to_citeproc
>>> citation_to_citeproc("doi:10.6084/m9.figshare.5346577")
ERROR:root:Error fetching metadata for doi:10.6084/m9.figshare.5346577.
Invalid response from https://data.datacite.org/10.6084%2Fm9.figshare.5346577:
The resource you are looking for doesn't exist.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/site-packages/manubot/cite/util.py", line 225, in citation_to_citeproc
    citeproc = citeproc_retriever(identifier)
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/site-packages/manubot/cite/doi.py", line 46, in get_doi_citeproc
    raise error
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/site-packages/manubot/cite/doi.py", line 42, in get_doi_citeproc
    citeproc = response.json()
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/site-packages/requests/models.py", line 897, in json
    return complexjson.loads(self.text, **kwargs)
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/json/__init__.py", line 354, in loads
    return _default_decoder.decode(s)
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/home/username/anaconda3/envs/manubot/lib/python3.6/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Linux Mint
Anaconda Python 3.6.
Manubot 0.2.2.

NCBI Citation Exporter only covers PubMed Central records

Currently, we use the NCBI Citation Exporter to retrieve bibliographic metadata for PubMed IDs:

https://github.com/greenelab/manubot/blob/855c8491d6e82c88dd126fda901e52c59c78b0d2/manubot/metadata.py#L51-L72

However, it turns out this service only includes records for articles in PubMed Central: see https://github.com/ncbi/citation-exporter/issues/3. In other words, you can query PubMed IDs, but you'll get No resolvable IDs found if the record isn't also in PMC.

Therefore, we should probably switch the existing function to get_pmc_citeproc and create a new function for get_pubmed_citeproc that calls the E-utilities API and creates a CSL item.

If the NCBI Citation Exporter team expands their scope to include all PubMed records, this would no longer be necessary. Will wait for potential developments in https://github.com/ncbi/citation-exporter/issues/3.

Document standalone citation processing

The Manubot package can be broadly useful for citation processing even if users do not need the full functionality. We could document how to map citation identifiers to CSL JSON items with standalone examples like the test cases in test_citations.py.

See greenelab/deep-review#886 for another potential use case.

Well defined transformation pipeline for citation identifiers

We have several types of citation identifiers that potentially required different processing pipelines:

  1. citations extracted from a manuscript
  2. citations extracted from a manual references file. This category could be subdivided based on what CSL JSON field the citation identifier was extracted from
  3. citations passed to the manubot cite command

We should think about which processing steps we want to apply in each case. Ideally, we could have modular functions that we pipe together to make it clear and straightforward what transformations will be applied.

Furthermore, we should consider a common nomenclature to use for identifiers based on which transformation steps have been applied. #113 improves the nomenclature but more is possible.

TypeError: argument of type 'WindowsPath' is not iterable

I got the following error when running pytest:

>           needquote = (" " in arg) or ("\t" in arg) or not arg
E           TypeError: 'WindowsPath' object is not iterable
TypeError: argument of type 'WindowsPath' is not iterable
...\Miniconda3\envs\manubot-dev\lib\subprocess.py:461: TypeError

I fixed this by converting path to str:
Line 461: needquote = (" " in str(arg)) or ("\t" in str(arg)) or not str(arg)
Line 465: for c in str(arg):

.. according to this suggestion.

My programming background is pretty limited and I am not sure whether this solution fits for all, thus I am sharing it here (and not directly committing).

Retrieving citation metadata for Wikidata scholarly article records

Currently, Manubot can produce CSL bibliographic metadata from several resources including DOIs, PubMed, PubMed Central, and arXiv. I was talking to @Daniel-Mietchen at FORCE2018 and the possibility of using Wikidata as a database of metadata came up.

The big benefit is Wikidata provides a single database that anyone can edit. This would be most helpful for articles that aren't in the aforementioned databases or are but have incorrect metadata. Currently, users override faulty metadata in manual-references.json. However, each Manubot instance must repeat the same process of adding the corrected metadata. Wikidata could potentially act as a singular resource to avoid this repetition.

As an example "scholarly article" record on Wikidata, see https://www.wikidata.org/wiki/Q18507561. @Daniel-Mietchen also mentioned WDscholia/scholia#444 (comment) by @fnielsen.

So in short, the question is what's the best way for our Python utility to retrieve metadata for a Wikidata record? We will have to transform any output into CSL JSON, which will has a different schema than Wikidata articles. Are there multiple Wikidata APIs to choose from?

Remove extraneous fields from CSL references

Some of our methods to generate CSL items (reference metadata) produce many extraneous fields. This is most acute for Crossref DOIs, which contain many fields in addition to those part of the CSL specification. Here are some examples:

    "content-domain": {
      "domain": [],
      "crossmark-restriction": false
    },
    "link": [
      {
        "URL": "https://syndication.highwire.org/content/doi/10.1126/science.aaf5675",
        "content-type": "unspecified",
        "content-version": "vor",
        "intended-application": "similarity-checking"
      }
    ],

This creates our CSL references.json file to be unnecessarily large. Some of the fields are helpful, but are unnecessary for the purpose of creating the bibliography.

See the machine-readable schema for CSL here, which includes what fields are supported.

My proposal would be to filter (possibly optional but as default) fields using the CSL Data Schema. Potentially there would even be a way to automatically detect and delete fields that don't meet the schema to avoid downstream issues.

shortDOI citation support

Supporting citations of shortDOIs could be a nice feature. I see the major benefits as saving characters for excessively long DOIs and enabling a way to directly cite DOIs with forbidden characters such as 10.1016/S0933-3657(96)00367-3. At least one user has attempted this workaround in Benjamin-Lee/deep-rules@c76ee4d / Benjamin-Lee/deep-rules#117 (comment).

It seems like we could support a few different syntaxes:

  1. @doi:10/b6vnmd
  2. @doi:b6vnmd
  3. @shortdoi:b6vnmd
  4. @shortdoi:10/b6vnmd

I didn't see much shortDOI documentation online, but there is some provided when viewing a shortened DOI result

Your request was processed. The previously-created shortcut for 10.1016/S0933-3657(96)00367-3 is the handle:
10/b6vnmd
The shortcut HTTP URI is:
http://doi.org/b6vnmd
This shortcut will return the same results as http://dx.doi.org/10.1016/S0933-3657(96)00367-3, and doi:10/b6vnmd can be used in place of doi:10.1016/S0933-3657(96)00367-3.

Given the documentation, it seems that option 1 is the most canonical method. However, method 3 & 4 could help avoid user confusion.


Note that DOI content negotiation for crossref DOIs appears to work:

curl --location --header "Accept: application/vnd.citationstyles.csl+json" https://doi.org/b6vnmd

Citations strings that are segments of other citations strings can cause error

The following source:

To coordinate this effort, we developed a manuscript writing process using the Markdown language, the GitHub software development platform [@url:https://github.com/greenelab/deep-review/], and our new Manubot tool [@url:https://github.com/greenelab/manubot-rootstock @url:https://github.com/greenelab/manubot] for automating manuscript generation.

was converted by manubot to:

To coordinate this effort, we developed a manuscript writing process using the Markdown language, the GitHub software development platform [@1Dv0Jpu5J], and our new Manubot tool [@cTN2TQIL-rootstock @cTN2TQIL] for automating manuscript generation.

See how @url:https://github.com/greenelab/manubot-rootstock became @cTN2TQIL-rootstock because a subcitation_id existed.

Automatically generating JSON CSL (citation metadata) for legal cases

Currently, we support citation of DOIs, arXiv IDs, PubMed IDs, and URLs. For the Sci-Hub paper it would be nice to cite legal documents. Currently, if we wanted to cite a legal document, we'd do it by URL. Hopefully Greycite could dig up correct metadata. If not we'd have to manually create the JSON CSL.

The first question is do cases and their documents have standard identifiers? Let's focus on U.S. cases. The case I'm interested in is 1:15-cv-04282-RWS (Elsevier Inc. et al v. Sci-Hub et al). PlainSite has a nice landing page for this case with PDFs for some documents available. It appears to me that there is a government database called PACER. You must pay to get documents out of PACER. However, since the documents are public domain, anyone can distribute them freely. Now, I don't think we can use Greycite with PACER URLs, since PACER is login-walled. Is there a public database with pages for all US court documents?

@fbennett I saw you wrote about CSL citation of legal cases in 2011. Any insights? FYI, see the deep review to get an idea of what this codebase intends to enable. In short, citing by standard identifiers that trigger automated reference/bibliography creation.

Some APIs:

Update. See the court listener page for the Sci-Hub case and OASIS Legal Citation Markup (LegalCiteM) TC.

Court Listener has pages for individual documents / entries. For example, https://www.courtlistener.com/docket/4355308/1/elsevier-inc-v-sci-hub/. For now I will use these URLs as the unique identifier for cases.

Support citation key aliases

I've build a similar workflow based on Pandoc, make, citation-js, the command line script wcite and its Pandoc filter pwcite. [p]wcite supports citation key aliases in the special Pandoc metadata field citekeys. Here is a minimal example document:

---
citekeys:
  Vrand04: Q18507561
...

Wikidata is a collaborative knowledge base [@Vrand04].

In manubot Markdown this would need to be written as:

Wikidata is a collaborative knowledge base [@wikidata:Q18507561].

Apart from the prefix (this is another issue), manubot should also be able to process this document:

---
citekeys:
  Vrand04: wikidata:Q18507561
...

Wikidata is a collaborative knowledge base [@Vrand04].

This would need to be documented at https://github.com/manubot/rootstock/blob/master/USAGE.md#citations

URL to citation

In kipoi/website#77 @dhimmel and @Avsecz discussed a potential new Manubot feature. Per @dhimmel:

One potential Manubot feature that could be helpful would be "URL to citation" that could perform the following conversion:

https://doi.org/10.1101/gr.227819.117 ⟶ doi:10.1101/gr.227819.117
http://dx.doi.org/10.1101/gr.227819.117 ⟶ doi:10.1101/gr.227819.117
https://doi.org/gd54x8 ⟶ doi:10.1101/gr.227819.117
https://arxiv.org/pdf/1603.09123.pdf ⟶ arxiv:1603.09123

Would also hopefully work with other types of URLs.

Specifying --template-variables-path multiple times

Presently, you can pass manubot a JSON file via the --template-variables-path argument to use for templating. To retain provenance, one preferred way to use this argument is to pass a versioned URL (e.g. with a git commit hash) that points to the JSON file.

However, a single manuscript likely draws from many analyses, which may each have their own JSON outputs. Would it make sense to be able to specify --template-variables-path multiple times to load multiple JSON files for templating?

I think so. My main question is whether each JSON should be namespaced. For example,

--template-variables-path analysis_a=path_a.json
--template-variables-path analysis_b=path_b.json

Namespacing makes it safe for multiple JSON files to contain the same keys. Or should we forgo namespacing (which would be backwards compatible) but risk potential collisions. Also namespaces would add an extra level when specifying a template variable resulting in longer source.

@agitter what do you think?

Open built HTML and PDF automatically

It'd be great to automatically open the .html output in the user's default browser upon completion of the build. I'm sure this should be fairly easy to accomplish in Python somehow.

Even better would be to detect if it is already open in a tab, and automatically refresh that tab. That way, the user could get an essentially auto-updated output preview by doing a "watch" build, and keeping Chrome/Firefox open on the side to see the updated file.

Automatically populating PMID & PMCID fields when generating CSL items for DOIs

Would it be nice for us to automatically fill in PMID and PMCID for DOI CSL items, when possible?

For example, we'd also probably change the default bibliographic style to be like:

Sci-Hub provides access to nearly all scholarly literature
Daniel S Himmelstein, Ariel Rodriguez Romero, Jacob G Levernier, Thomas Anthony Munro, Stephen Reid McLaughlin, Bastian Greshake Tzovaras, Casey S Greene
eLife (2018-03-01) https://doi.org/ckcj
DOI: 10.7554/elife.32822 · PMID: 29424689 · PMCID: PMC5832410

I am thinking this could be helpful for some journals which like to show all IDs when possible. It could also be generally useful and make readers more cognizant of IDs.

Expand flexibility of manual citations

Currently, users can supply manual bibliographic metadata in manual-references.json, which must consist of CSL JSON where each item has an extra standard_citation field.

This can be restrictive in a couple ways. Oftentimes, users may have existing bibliographic metadata in an alternative format, such as .bib files. Additionally, since standard_citation is not a standard CSL JSON field, users must always modify the manual metadata from however it was generated. Finally, when wanting to do a raw citation, they must place raw: before their citation id.

I propose the following changes to enable more input formats than just CSL JSON and relax how standard_citation must be specified:

  1. use pandoc-citeproc --bib2json to convert bibliographies from a variety of formats to CSL JSON
  2. allow specifying multiple manual reference metadata files, possibly of varying types. For example, a user could pass a .bib file as well as a CSL file.
  3. if standard_citation is not specified, fallback to using id to set standard_citation. id is built into CSL JSON and .bib files will not be able to set standard_citation (only id).
  4. when inferring standard_citation from id, if no citation source prefix is supplied (e.g. doi:, pmid:), assume raw: citation.

These features would help address feedback @tpoisot provided us, where his group would like to use existing bibliographic metadata files without any special edits in a Manubot based workflow. @agitter and others, what do you think?

Some of this work is implemented in #99, which proposes a pandoc-filter for cite-by-id. However, I may split it out into a distinct pull request.

Choose service for docs generation

We should choose a service/product to help us write and generate the docs for all of Manubot. It can be a full service that manages the writing of the docs and also hosts it (like readthedocs) or it can just be some kind of tool that generates a static docs website from some simple markdown.

Here are the most popular ones, it seems:

readthedocs
sphinx
gitbook

We could even just do it "manually" with Jekyll, and it might not be much more difficult than a docs-specific generation tool.

ISBN citation of books

While many scholarly books have DOIs, there are lot's of books out there with just ISBNs. As a standardized identifier, they seem appropriate for Manubot book citations. We discussed the issue in greenelab/deep-review#387 and found serveral ISBN lookup services.

However, none of them are aware of the 2012 Open Access Book indicating poor coverage:

Update: I think 9780262302524 is the wrong ISBN, but was listed on the MIT Press website.

@agapow commented:

My impression of worldcat and isbndb is that there's quite a lag before titles get listed (and they may never get listed). Not a solution, but we probably need another way.

So the purpose of this issue is to note that ISBN citation would be valuable and to track any potential solutions.

Add spell-checker

Continuous integration spell-checking would be useful. You'd need a project custom dictionary for words that the spell-checker doesn't know.

pandoc-citeproc fails on empty issued date-parts (duplicate)

I'm getting

Error parsing references: Could not parse RefDate
Error running filter pandoc-citeproc:
Filter returned error status 1
Traceback (most recent call last):
  File "/home/robert/openclimatedata/global-emissions/venv/bin/manubot", line 11, in <module>
    sys.exit(main())
  File "/home/robert/openclimatedata/global-emissions/venv/lib/python3.6/site-packages/manubot/command.py", line 141, in main
    function(args)
  File "/home/robert/openclimatedata/global-emissions/venv/lib/python3.6/site-packages/manubot/cite/cite_command.py", line 98, in cli_cite
    format=args.format,
  File "/home/robert/openclimatedata/global-emissions/venv/lib/python3.6/site-packages/manubot/cite/cite_command.py", line 60, in call_pandoc
    process.check_returncode()
  File "/usr/lib/python3.6/subprocess.py", line 369, in check_returncode
    self.stderr)
subprocess.CalledProcessError: Command '['pandoc', '--filter', 'pandoc-citeproc', '--output', '-', '--to', 'markdown_strict', '--wrap', 'none']' returned non-zero exit status 83.

for

manubot cite --render --format markdown doi:10.5194/gmd-11-369-2018-supplement

Only running cite gives:

[
  {
    "publisher": "Copernicus GmbH",
    "DOI": "10.5194/gmd-11-369-2018-supplement",
    "source": "Crossref",
    "issued": {
      "date-parts": [
        []
      ]
    },
    "URL": "https://doi.org/gfddtf",
    "id": "GbPj2EzP",
    "type": "entry"
  }
]

Maybe it could fail more gracefully for such (obviously lacking) data.

Setting time zone to get the correct build date

I'm getting...

"This manuscript was automatically generated from xxx@355c47e on August 6, 2017."

...for dates in the future (today is August 5th, to be clear). I was particularly confused because builds around 1 PM local time today reported tomorrow's date. From what I can gather, the date template in the front matter is replaced by

# Add date to metadata
today = datetime.date.today()
metadata['date-meta'] = today.isoformat()
stats['date'] = today.strftime('%B %e, %Y')

...which I assume is using the time zone set by the system. I think we can fix this by setting

before_install:
- export TZ=Australia/Canberra

...in travis.yaml but I haven't checked. However, it remains unclear which time zone to set. Ideally there could be a way to set the container time zone from the GitHub/Travis account. What are your thoughts?

Raw citation option to directly cite a CSL entry in manual-references.json

Currently, you can manually specify CSL for a reference in manual-references.json. However, the reference still has to be identified in terms of a standard identifier, such as a URL, DOI, or other resource ID.

This hasn't been a huge issue since most references have a URL of some sort. However, that may not always be the case, such as a personal communication or some physical resource. This issue proposes adding a way to add a "raw" citation that gets directly looked up in manual-references.json. Raw citations must have their CSL defined in manual-references.json.

One reason we didn't add this feature initially is that we wanted to encourage citing things by their standard identifiers. We didn't want users to fallback to citing things without standard identifiers. However, I think the benefits to raw citations outweigh. In addition to allowing citation of URL-less records, it could help us with testing, where we want to evaluate citation of many different types of CSL entries in manual-references.json.

Two citation string implementation come to mind: @raw:raw-id-here or @raw-id-here. The manual-references.json CSL would then have to set id: raw-id-here or standard_citation: "raw:raw-id-here".

@agitter what do you think?

requests-cache does not work fully

This build is for greenelab/scihub-manuscript@ace58a1, which does not change any citations. However, all the DataCite DOI requests failed:

Citeproc retrieval failed for:
doi:10.5061/dryad.q447c/1
doi:10.5061/dryad.q447c/2
doi:10.5281/zenodo.472493
doi:10.6084/m9.figshare.1186832.v23
doi:10.6084/m9.figshare.4542433.v6
doi:10.6084/m9.figshare.4816720.v1
doi:10.6084/m9.figshare.5231245.v1

One of the failing API requests responded:

<h2>This website is under heavy load (queue full)</h2><p>We're sorry, too many people are accessing this website at the same time. We're working on this problem. Please try again later.</p>

The real issue here is why this API call is being made in the first place. requests-cache should be caching these calls, which succeeded in the past.

However, we see these two lines in the log:

requests-cache starting with 0 cached responses
requests-cache finished with 113 cached responses

Maybe the Travis CI cache got reset here. There is also another issue I've noticed where the number of cached requests far exceeds the number of citations, indicating that some requests are being performed and cached multiple times.

`get_citation_strings` seems to be incorrectly matching non-citations with `@`

I am having a problem with non-citations containing @ being matched and then failing. This is surprising because the citation_pattern regex does not match my string (correctly).

I have a file 00-tmp.md that contains the single line:

parm@Frosst is a force field.

If I run manubot:

$ manubot --content-directory=./ --output-directory=./
## ERROR
Citation not splittable: @Frosst

As expected, if I delete the @ sign then this goes away.

I think that is_valid_citation_string is being called, stripping the @ then looking to split at : which isn't in my string, hence the error. Here's the surprising part (to me). The things sent to is_valid_citation_string come from looking for the citation_pattern regex in the text and my string shouldn't match the regex, right?

https://github.com/greenelab/manubot/blob/27456594c407f3fed35804db4bcf3b5bba88a716/manubot/manuscript.py#L11-L17

Passing PMCID to PMID citation parser

I wanted to use Manubot to cite PMCID PMC6063279 and instead received a citation for PMID 6063279. I had accidentally written

manubot cite pmid:PMC6063279

Apparently the Entrez eutils takes the PMCID, strips PMC, and then returns the XML for that numeric PMID. This seems like an error on the part of eutils. Do we want to try to prevent this type of user error? Or should we instead leave the Manubot behavior but notify the eutils maintainers?

Enhancement: Journal widgets

@dhimmel and @zietzm and I were discussing this over lunch. We could have convenient javascript widgets/embeds that people could embed on their websites that show some information about a manubot repository manuscript (or multiple).

If you're not familiar with the concept of widgets, take a look at the soundcloud one.

It could look something like this (very rough mockup):

New Project

There could be multiple sizes/types for different contexts and needed brevity. The theme could also be easily customizable to suit the user's particular webpage.

To use it, it would just be a small script that you'd copy and paste into your website code (like the google analytics one) with a manubot repository as an argument, and then would scrape the relevant or desired metadata from that repository, and display it nicely in a badge inline on the webpage.

This could be useful for individuals wanting to show off their new manubot manuscript, say in a blog post. It could even be useful for organizations trying to start their own journals, who want a pre-built widget that takes a list of paper repos as an argument and shows a searchable list of paper badges.

Something to think about in the long term future.

Manubot cite: ability to output formatted bibliography

Suggested by @slochower in #42 (comment):

A super handy addition would be an option to output a formatted citation using the built-in style.csl. One challenge is that we currently leverage Pandoc to tie together the JSON for the references and the CSL. I suppose we could make a wrapper that writes the JSON to a file, writes a temporary Markdown file with the citation ids, calls Pandoc to render the temporary file, and then prints the rendered Markdown to stdout. That may be too clumsy to implement, though.

Brilliant! I agree this would be super useful, for cases ranging prototyping to just wanted to generate a reference.

One option would be to use citeproc-py. I am also guessing there is a pandoc / pandoc-citeproc solution that is more direct than having to mess with creating a source markdown file. However, that would create a non-python dependency, which may be difficult for many casual users.

manubot cite date error with PMID citation

I get an error with the following command

manubot cite pmid:29028984

Error:

conda\envs\manubot-dev\lib\site-packages\manubot\cite\pubmed.py", line 177, in extract_publication_date_parts
    date_parts.append(month_abbrev_to_int[month])
KeyError: '03'

This is manubot v0.2.0 from PyPI in the manubot-dev environment. I didn't try debugging yet.

@dhimmel can you reproduce the error?

bioRxiv CSL metadata does not set container-title

As I noted in greenelab/meta-review#18, bioRxiv references show Cold Spring Harbor Laboratory instead of bioRxiv in the reference list. This is because the JSON file from CrossRef has "container-title":[].

I presume bioRxiv supplies the citation information for the DOI. Would it be worthwhile to contact them to see if we can correct the references by fixing the upstream data?

Citation metadata retrieval using Zotero / MediaWiki infrastructure

Was talking with @adam3smith at FORCE2018. He is an expert with CSL styles, having written hundreds himself.

We were discussing metadata retrieval and he discussed the work going at Zotero, which is bundled into https://github.com/zotero/translation-server. MediaWiki also has a derivative of this, updated less frequently, called Citoid.

translation-server can extract bibliographic metadata from many types of resources / URLs. However, there is no public instance (we'd have to run one). Citoid doesn't update as frequently but has a public API.

No immediate plans to use this infrastructure, but wanted to jot these notes down before I forget. Thanks @adam3smith for all the amazing info!

Migrating repo to the manubot organization

We are now in possession of the @manubot account, which will be the organization where we want to move all Manubot related repositories.

However, we have to be careful that we migrate repos from the @greenelab to @manubot organization in such a way that is not disruptive.

See the GitHub docs on Transferring a repository:

If the transferred repository contains a GitHub Pages site, then links to the Git repository on the Web and through Git activity are redirected. However, we don't redirect GitHub Pages associated with the repository.

All links to the previous repository location are automatically redirected to the new location. When you use git clone, git fetch, or git push on a transferred repository, these commands will redirect to the new repository location or URL. However, to avoid confusion, we strongly recommend updating any existing local clones to point to the new repository URL. You can do this by using git remote on the command line:

Parsing of URL reference appears to mangle specific URL

I tried to cite [@url:https://openforcefield.org].

The bibliography is rendered as:
image

In references.json, I see:

  {
    "type": "webpage",
    "title": "Open Force Field Initiative",
    "abstract": "An open source, open science, and open data approach to better force fields",
    "URL": "//openforcefield.org/",
    "language": "en-us",
    "author": [
      {
        "family": "newpixcom",
        "given": ""
      }
    ],
    "accessed": {
      "date-parts": [
        [
          "2019",
          4,
          19
        ]
      ]
    },
    "id": "OhpH7vfg"

It seems that the URL field is being incorrectly parsed.

(There may be misconfiguration on their end, I'm not sure, but I think that manubots processing of the URL is also incorrect.)

Unrecognized command line arguments: --log-level must be specified before subcommand

I'm mystified why this is happening, but I tried running manubot commit 33e512d with the meta-review manuscript repository and I'm getting a --log-level error. I've no idea why this is happening and at this point, I'd like to know if this is reproducible for others. I can run the command without --log-level.

$ manubot process   \
--content-directory=content \
--output-directory=output  \
--template-variables-path=analyses/deep-review-contrib/deep-review-stats.json  \
--cache-directory=ci/cache   \
--log-level=INFO

usage: manubot [-h] [--version]
               [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}]
               {process,cite} ...
manubot: error: unrecognized arguments: --log-level=INFO

General feedback tidbits

I recently used Manubot to write a fairly complex document, where I asked for feedback from many people and went through many versions, and it illuminated a few pain points. I'm listing my overall impressions here, although an Issue may not be the best venue to discuss these items. Many of the issues here are not Manubot's fault, but they might crop up when people use Manubot in similar circumstances, and therefore it could be useful to know about and document them.

  • Because I was sensitive to the overall length of my document, I decided to use LaTeX to generate a PDF instead of wkhtmltopdf. I spent about 4 hours (across a few weeks) working on the template. Although I'm familiar with LaTeX, I had to do a lot of reading to figure out where to place the variables that pandoc expected -- this was frustrating but is not really our problem. A second, more insidious problem, was that Manubot (I think it was Manubot, but I didn't fully debug this) used a unicode character in one of the citations (maybe it was ö), which led to mysterious pdflatex failures until I switched to xelatex (again necessitating messing with the template). That took at least 2 hours to really figure out what was going on. Another downside is the long compilation times (30 seconds - 1 minute).

  • I never reliably got LaTeX working on Travis. I think this was a font problem. With pdflatex, I loaded a Helvetica clone via \usepackage{helvet} but in xelatex I loaded the system version of Helvetica. I wasn't sure how I could access Helvetica on Travis and was pretty sure I shouldn't put it there, so builds always failed. I could have probably done something different, like just use another font, but it wasn't worth the trouble for me. It was also a pain figuring out which LaTeX packages to install on Travis and I don't think this will work out of the box unless someone installs texlive-full on Travis.

  • I sunk a few hours into using the Word export option and Word templates before I abandoned the effort. At one point, I tried writing a Lua filter that would call a Python script to convert SVGs to PNGs for the Word export, but in the end the whole thing was super fragile and didn't work very reliably. That was frustrating, but again is not a Manubot problem per se. I also wanted figure wrapping, which didn't work in Word.

  • Sometimes, while sending around versions for feedback to different people, I wanted to highlight a particular sentence or phrase. Bold and italics worked fine in the Markdown and persisted in the PDF and HTML copies, but I ran into a bit of trouble if I wanted to highlight in red. (I had already used bold and italics in my document as formatting choices for headings, so asking someone to look at "the bold phrase" was ambiguous). I could use something like \textcolor{red}{text} and that would work in the PDF output but would be completely absent in the HTML output. Then I discovered "native" AST spans, which are handy but require some customization of the output formats (e.g., have a CSS rule to detect and color the text).

  • Along those lines, it is not easy to divide the output with Manubot. For example, sometimes I just wanted to send one section of my document for someone to review (I split each section into its own Markdown document). I could send them the single, raw Markdown file, but there is no easy way to tell Manubot to only render file 01.introduction.md and 08.review-this-section-please.md or whatever.

  • Integrating feedback from people was not ideal. Many people used annotations or stickies or something like that to markup the PDF versions. I usually ended up retyping what they said. It would be great to have a link on the HTML version that could make a repository Issue with the selected text. I think we talked about this at one point, but I can't find it now, perhaps it can be integrated with hypothesis.

  • One great feature of Manubot was that I was able to write a core text and then make a new branch for each customization I needed.

  • Ran into the lack of @isbn identifier, which we've discussed before.

  • I used a more compact CSL (superscript numbers) but ran into inconsistent journal names -- inconsistent abbreviations and inconsistent punctuation. This is something I've seen with all the other citation managers too, unfortunately.

  • Many people like to see the changes since the last time they looked at the document. I spent some time working on a way to compute diffs, extending what I worked on earlier. Briefly, I'd checkout and older version of the manuscript using git, compile that to intermediary .tex files, then run latexdiff. Unfortunately, big edits (especially ones with replaced figures) resulted in confusing and ugly diffs. They were basically only useful to see and think -- okay, this section has changed a lot. I could probably further tune with latexdiff, but I didn't have the inclination. I would like a smooth way to automatically build diffs between any arbitrary versions in the future, though.

Overall, writing with Manubot was reasonably smooth and enjoyable, but I can say that it definitely involved more tinkering than I would have spent using Word or Overleaf, simply because I have existing templates for those systems and I know many of their quirks. However, I think the issues with Manubot and particularly with the pandoc integration will go down over time. On a personal level, I found that using semantic line feeds helped me focus on making each sentence better but led me to create paragraphs that didn't flow as well. I also found myself compiling to PDF or HTML often to check how the document looked.

Setting up a Zotero translation-server instance for citation metadata

In #70, we discussed using some of Zotero's infrastructure for generating citation metadata. Specifically, the translation-server, which has translators for many webpages.

I touched bases with the translation-server community in zotero/translation-server#51 and there didn't seem to be any public instances. Therefore, we could either host a public instance for Manubot users or have Manubot users set up their own instances. While installing the nodejs package is not difficult, it does create a node dependency in an otherwise Python repository.

Therefore @dongbohu and I decided to spin up a translation-server at https://translate.manubot.org. It's currently hosted in a Google Cloud instance and we're still finalizing the setup. I'll let @dongbohu comment with the configuration details. While it'd be great to have a reproducible way to exactly configure this instance (perhaps via Terraform), we opted for the convenience of a traditional instantiation process.

Bibliographic metadata for URLs: alternatives to Greycite

We currently use Greycite to retrieve metadata for URL citations:

https://github.com/greenelab/manubot/blob/855c8491d6e82c88dd126fda901e52c59c78b0d2/manubot/metadata.py#L75-L102

Greycite does a good job identifying metadata (authors, date, title, publisher), but has frequent outages. Greycite has been down for several weeks due to a System Error. When this happens, we fall back to creating CSL metadata with just the URL.

Unfortunately, Greycite does not have an open source codebase and is only minimally maintained. Some details of its operations are available as a preprint. Therefore, we'd like to explore alternatives.

We should see if we can get away without any external dependencies. Perhaps we can create a python module, that can be executed at runtime and cached, to replicate Greycite's functionality.

Do not extract or replace citation strings inside code or code_blocks

Right now our manuscript parsing implementation is unaware of code and code blocks. As a result, citation strings inside of of code or code blocks will be extracted and replaced. This is not the desired behavior -- text formatted as code should not get modified.

For example:

@tag:citation should get extracted and replaced.
However, `@tag:citation` should not.

To solve this issue, we need to be able to easily separate parts of the source markdown that are code or code blocks.

manubot process / build.sh requires CI folder?

Either the regular build or the webpage.py build seems to fail if the CI folder is not present in the directory? This should either not cause it to fail, or at least provide a more deliberate/understandable error.

User's who are just building locally and are not interested in publishing to gh-pages might appreciate this. I ran into this when testing the plugins in my local test directory (stripped of all the non-essential stuff).

Add link checker functionality

As part of the CI process, embedded URLs in a manuscript could be checked to highlight linked figures / tables / objects / references which are no longer available at that URL.

(This is the question I asked at the FORCE18 talk)

Where do we go for help?

Can you add something to suggest where we go to make inquiries about how to use manubot? Not really "issues" like bug reports or feature requests, but if you want other sorts of inquiries as issues that is possible too. The alternative would be a mailing list, discussion forum, or Gitter (not @agitter) somewhere.

CSL JSON type for "computerProgram" not allowed

Does Manubot not accept the CSL type "computerProgram"? I get this type as an output directly from Zotero. The citation is correct; it is for a computer program. Looking at the schema, I see nothing related to computer program. @dhimmel what do you suggest for a type here? I've included the stanza from manual-references.json at the end of this post.

The error returned by manubot is slightly cryptic:

## WARNING
'type' is a required property
requried element missing at: 0/type
## WARNING
'type' is a required property
requried element missing at: 0/type
## WARNING
'type' is a required property
requried element missing at: 0/type
## WARNING
'type' is a required property
requried element missing at: 0/type
## WARNING
'type' is a required property

Maybe it could say that it does recognize the known type.

Entry in manual-references.json:

  {
    "type": "computerProgram",
    "standard_citation": "raw:amber2018",
    "title": "AMBER 2018",
    "creators": [
      {
        "firstName": "D.A.",
        "lastName": "Case",
        "creatorType": "programmer"
      },
[...many more creators...]

This was lifted directly from a JSON export via the Zotero desktop client.

CSL Data: empty date-parts in issued causing pandoc-citeproc to crash

Some Crossref DOI records have their issued date-parts set to null, such as 10.22541/au.149693987.70506124. In these situations, DOI Content Negotiation for CSL JSON returns

    "issued": {
      "date-parts": [
        [
          null
        ]
      ]
    },

In the past, before we pruned invalid CSL using the CSL Data JSON Schema, we addressed this case with the following:

https://github.com/greenelab/manubot/blob/693fbb7758b5922add30ecaa6e30acb98426a977/manubot/cite/citeproc.py#L88-L95

Hence, we'd remove issued if the first element of the date-parts list in Python was None. In #49, we switched to using the JSON Schema to remove invalid fields and removed custom CSL fixing logic. Our hope was that CSL that passed the JSON Schema would be compliant with pandoc-citeproc.

The JSON Schema currently specifies elements in date-parts arrays must be strings or number (excludes null), but does not specify minItems of 1. Hence, our CSL Data pruning removes the null item but keeps the empty list:

    "issued": {
      "date-parts": [
        []
      ]
    },

This causes pandoc-citeproc to crash (as currently happening with manubot cite --render doi:10.22541/au.149693987.70506124 and greenelab/meta-review#101 (comment)):

Error parsing references: Could not parse RefDate
Error running filter pandoc-citeproc:
Filter returned error status 1

This issue was discovered in greenelab/meta-review#101 (comment). The corresponding WIP PR to fix it is #65 (pending a solution).

Manubot logo

I have sketches here for a manubot logo. Showed them to @dhimmel, and we have our preferences, but we'd like to get the input of others. I'm not a graphic designer per se, but I've picked up some knowledge about logo best practices:

  • it should be simple, not too ornate or complex
  • a child should be able to draw the gist of it
  • it should work well with text or without text
  • it should work well as a favicon (small icon size)
  • it should work well in monochrome/print
  • ideally, it should convey some of the meaning behind the product and give some indication of what it is, what it does, or what general industry it belongs to

desktop

These are just hand sketches, of course. Once we pick one, I'll translate it into a better looking vector format. We can also combine some of these ideas with each other, such as having the folded "paper" looking M in between the -<M>- idea.

I looked through the meta review for keywords that represented the project as a whole, such as edit, collaborate, ideas, version control, code, manuscript, document, writing, etc. For some inspiration, I looked at font-awesome icons related to these keywords, such as the following icons:

icons

Naturally, I think a real logo designer who is an expert in that specific field could do a better job, but a truly good one would be quite expensive.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.