podoc / podoc Goto Github PK

View Code? Open in Web Editor NEW

52.0 52.0 8.0 371 KB

[EXPERIMENTAL] pandoc-compatible, Jupyter-aware document conversion library in Python

License: BSD 3-Clause "New" or "Revised" License

Makefile 0.30% Python 97.40% Jupyter Notebook 2.14% Shell 0.15%

podoc's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger rossant sundeepteki ellisonbg willingc pombredanne henfee josephcslater

podoc's Issues

make test fails

Hi, I am very happy to discover this project which would suit my personal workflow very well :)

I failed to install it though! I have Ubuntu 16.04, python3.5. My pandoc might be a bit old (1.16), but I don't think this is the problem.

I installed the requirements specified with pip, and ran python3 setup.py develop --user, but make test gives me this output:

flake8 podoc
py.test podoc --cov podoc --cov-report term-missing
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/_pytest/config.py", line 329, in _getconftestmodules
    return self._path2confmods[path]
KeyError: local('/export/home1/users/ldog/glouvel/install/python3/podoc/podoc')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/_pytest/config.py", line 360, in _importconftest
    return self._conftestpath2mod[conftestpath]
KeyError: local('/export/home1/users/ldog/glouvel/install/python3/podoc/podoc/conftest.py')

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/_pytest/config.py", line 366, in _importconftest
    mod = conftestpath.pyimport()
  File "/usr/local/lib/python3.5/dist-packages/py/_path/local.py", line 668, in pyimport
    __import__(modname)
  File "/export/home1/users/ldog/glouvel/install/python3/podoc/podoc/__init__.py", line 20, in <module>
    from .core import Podoc  # noqa
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 664, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 634, in _load_backward_compatible
  File "/usr/local/lib/python3.5/dist-packages/_pytest/assertion/rewrite.py", line 213, in load_module
    py.builtin.exec_(co, mod.__dict__)
  File "/export/home1/users/ldog/glouvel/install/podoc/podoc/core.py", line 16, in <module>
  File "/export/home1/users/ldog/glouvel/install/python3/podoc/podoc/utils.py", line 198, in <module>
    PANDOC_API_VERSION = get_pandoc_api_version()
  File "/export/home1/users/ldog/glouvel/install/python3/podoc/podoc/utils.py", line 195, in get_pandoc_api_version
    return json.loads(pypandoc.convert_text('', 'json', format='markdown'))['pandoc-api-version']
TypeError: list indices must be integers or slices, not str
ERROR: could not load /export/home1/users/ldog/glouvel/install/python3/podoc/podoc/conftest.py

Makefile:18 : la recette pour la cible « test » a échouée
make: *** [test] Erreur 4

It looks like a path error so I tried export PYTHONPATH=~/install/python3 where I cloned the repository, but it doesn't help.

Any hint on what I am doing wrong? :)

Refactor all of the conversion unit tests

Use the long list of CommonMark spec examples
Ensure that the CommonMark (str) <-> AST (dict) <-> JSON (str) conversion diagram is commutative (6 tests per example)
Generate all tests automatically with py.test

Automatic tests

Unit tests should be automatically created from the files in the test_files/ directories
JSON output should be checked against pandoc JSON output

Take inspiration from RMarkdown/knitr

RMarkdown is a great way to write dynamic technical documents in a clean, VCS-friendly markup format (Markdown). One can integrate bits of R code inline or in blocks, and use dedicated software to execute them, interactively or not.

It would be great to have something like this in Python, thanks to podoc. The Jupyter Notebook could also be made to work with this format.

There are a few aspects.

Code block syntax

Here is an example of a "R Code Chunk":

```{r, echo=FALSE}
summary(cars)
``

[Note: I omit the last backtick at the end of code blocks to avoid rendering bugs on GitHub]

This uses a custom language string in a regular Markdown CodeBlock. This string specifies the language as well as cell-level metadata, for example about whether to render the code source and/or the output.

Currently, podoc uses a similar trick when converting markdown <-> notebook. Here is an example of a code cell with Python source, stdout output, and a result:


```python
print("hello world")
3 * 3
``

```stdout
hello world
``

```result
9
``

This is not ideal, for several reasons:

There is an ambiguity with Python code blocks: are they executable code chunks, or static Markdown code blocks that happen to be in the Python language? We have no way to know. RMarkdown uses eval=FALSE.
We have no way to specify whether to include the output or not in the Markdown document. For example, we may want to strip all code cell outputs in the document. Or we may want to show the output without the underlying code (equivalent to echo=FALSE in RMarkdown).

Instead of using this ad-hoc syntax in podoc, we could use the same syntax as RMarkdown, just replacing r with python.

We also need to support plots here. Currently, there is WIP support for this in podoc. Plots are saved in external PNG files, and automatically integrated as base64 contents in the notebook.

Inline code syntax

RMarkdown supports r 1+1 which is converted to 2, podoc could support python 1+1...

Jupyter Notebook

The Jupyter Notebook could be made to work directly with this format. You would open a Markdown document in this format, and code chunks would be executable code cells. The output/plots would be automatically saved in Markdown when running the notebook.

Command-line interface

You could have a way to run such an executable Markdown document from the command-line, for example with podoc file.md -o file.md --run which could use IPython under the hood.

cc @fperez @bnaul @takluyver @mgeier @willingc @ellisonbg

URLChecker plugin

Filter checking all hypertext links and creating a report of the valid and broken links.

CLI tool with click

We could use the same basic options than pandoc, for consistency:

  -f FORMAT, -r FORMAT  --from=FORMAT, --read=FORMAT
  -t FORMAT, -w FORMAT  --to=FORMAT, --write=FORMAT
  -o FILENAME           --output=FILENAME
                        --data-dir=DIRECTORY

When no output is specified, stdout is used.

Replace "markdown" by "commonmark"

API to write pandoc-compatible AST filters

Polish the internal conversion API

XXPlugin
- load(s): from file/path/string to object
- dump(s): from object to file/string
- read(obj): from AST to object
- write(obj): from object to AST
Most of the time, object is a string, but it could be a dict, a Notebook instance, etc.

PromptPlugin

Filter transforming a code block containing interactive input and output. There are several options:

Transforming to a code block with different input/output formats
Removing the output
Evaluating the input and adding the output
Put the output in a paragraph

podoc and nbconvert

Hi @fperez @ellisonbg @damianavila @Carreau @minrk @bollwyvl @odewahn

I'd like to let you know of a project I've been working on. As it is somewhat related to nbconvert, I thought I'd get in touch with you so that, at the very least, you're aware about these ongoing efforts. I didn't have time to make much progress on this project lately, but I plan to get back to it within the next couple of months (I'll need this library for a few other projects). At this point the code is in a very early stage.

podoc

This pure Python library is called podoc. It provides a unified processing pipeline for converting text documents between markup languages. It is tightly linked to pandoc, but it is not exactly a pandoc clone; it's more like a pandoc companion. It will never support the wide range of formats supported by pandoc. Instead, it will be largely compatible with pandoc so that you can convert one document in a pandoc-supported format into another podoc-supported format, and reciprocally.

pandoc is not a dependency, and the most critical features will be available without it. However, you'll have many more conversion options if you have pandoc installed.

At first, the major supported formats will be Markdown/CommonMark, Jupyter Notebook, O'Reilly Atlas. As such, podoc will eventually replace ipymd (editing Markdown files in the Notebook). Other formats could be implemented as well, notably ODT (already implemented in ipymd), LaTeX, HTML, and so on. podoc can leverage the many existing libraries for the conversions (mistune, CommonMark.py, odfpy, and so on).

A nice consequence of the compatibility with pandoc is that you should be able to open any document supported by pandoc directly in the Notebook! For example, you could read and edit Word, HTML, EPUB etc. documents in the Notebook without any manual conversion.

The key idea of podoc is to use an internal representation of text documents that is independent from any format. This is exactly how pandoc works. In fact, this internal representation is exactly the same as pandoc's internal AST. This is how we achieve compatibility with pandoc.

podoc is based around a very simple core. There is a pipeline that takes a document through various stages: preprocessors, reader, filters, writer, postprocessors. Processors can update documents in their own formats. The reader parses a document and outputs a pandoc AST, and reciprocally for the writer. Filters can manipulate ASTs directly. These components can be arbitrarily combined.

By default, the pipeline does nothing, and you need to activate components to get something that is not useless. There is a simple plugin architecture to activate these elements. For example, a "format" is just a plugin that activates a particular reader/writer.

Components

To give you an idea of the possibilities enabled by this architecture, here are some ideas of components that could be included in podoc (besides the obvious readers/writers for the various supported formats):

Atlas: filter replacing code blocks in a given language by executable <pre> HTML code blocks, and LaTeX equations by <span> HTML blocks.
CodeEval: preprocessor evaluating code enclosed in particular markup syntax (as provided by a regular expression or Jinja blocks). This allows for literate programming, using Python or any other language.
EquationImage: filter that replaces LaTeX equations by PNG images.
Macros: macro preprocessor based on regular expressions. The macro substitutions can be listed in the document metadata or in the user's config file. For example, one could define LaTeX macros for common mathematical symbols.
Prompt: filter transforming a code block containing interactive input and output. This is used in the Jupyter Notebook format. There are several options:
- Using >>> prompt or In [1] or anything else
- Stripping out the output
- Evaluating the input and adding the output
- Put the output in a paragraph
UrlChecker: filter that finds all broken hypertext links and generates a report.

It is expected that users will write their own plugins for their own purposes.

Typical use-case

As an example, here is something I'd like to be able to do with podoc.

I write a technical Markdown document which contains mathematical equations in LaTeX. I can insert Python expressions inline with a Jinja-like syntax, for example the result of this operation was {{ f(x) }} which is executed in some interactive Python context. There are also code blocks with rich output and plots. The figures can be saved inline or in external file images.

I can insert little exercises with some code block to be completed by the user, with the solution hidden by default which appears by clicking on a button (when converting to HTML). Exercises are indicated by code metadata in YAML (as contributed by @bollwyvl in ipymd).

The whole Markdown document can be transparently edited either in a text editor or in the Jupyter Notebook, where the code blocks are executable. Then, I can convert this document into PDF, HTML, ODT/docx (for those many publishers stuck in the 90's), or any other format supported by pandoc. If I want to, the code is executed during the conversion in a Python namespace that I can specify. All URLs are automatically checked, and a report is generated with the broken links. I can publish the document on my Pelican-powered blog, on GitHub/nbviewer, on Atlas, etc. I can also publish an interactive document with thebe/mybinder etc.

If I'm writing a book, I can organize the contents within many Markdown documents and write a Python script to generate the book automatically (cf. Atlas).

The functionality for doing all of this already exists in multiple projects; what's missing is, I think, something that combines all of these tools into a unified and customizable Python pipeline.

To do

The architecture is in place, the test framework is working. For testing, the plan is to have a set of test documents that are converted by every implemented plugin and compared with the AST "ground truth."

What's missing now is the set of plugins implementing the formats (focusing on Markdown, Jupyter Notebook, and Atlas for now). Most of the code is already in ipymd, except the conversion between Markdown and the pandoc AST. This should be relatively straightforward with a good Markdown parser (not sure which to use exactly, I used mistune in ipymd but there are alternatives).

That's all, feel free to ping other persons that might be interested. There might be some overlap with nbconvert and other projects so I'm interested to see how we could perhaps work together toward the same goal.

Resource management

What is already done:

When reading a notebook, a resources dictionary is created. It maps resource filenames to binary data (such as plotting PNG images).
When writing to a notebook, output code blocks that link to images are parsed, the linked images are read in memory, and saved as base64 data directly in the notebook.

Things to do:

The logic to save a resources dictionary to actual files is not implemented yet. For example, when converting a notebook to Markdown, the plot images are not currently saved.

We need to implement this in a generic way, since the conversion is done via notebook -> ast -> markdown. Further, while the Podoc.convert() method knows about the original file path, the conversion functions only work on in-memory documents and know nothing about the underlying files and paths.

Notebook contents manager

OpenDocument plugin

Markdown plugin

Is podoc suitable for Hydrogen-Markdown to Jupyter export?

At the moment my favourite workflow for interactive notebooks is not Jupyter.
I use Atom editor + Hydrogen to write either:

Markdown + code blocks insertions of different Jupyter kernels (commonly in Python). I deploy it with Hydrogen to see results instanty right in the document. Then Knitty-Stitch it to .md.md or .html or pdf as static output archive.
Same as before but it's python code split into cells via Hydrogen markers # %%. It has markdown insertions if needed.

And now I think it would be nice to export it not to static format like pdf but to dynamic Jupyter notebook format.

Is it possible with podoc?

To write Markdown-Stitch document with code blocks, then stitch it so that some code blocks are converted to markdown/html mixture (markdown mostly/only actually) and some are just code examples. And some code blocks are marked for Jupyter (like ``` {.python .iamacell} that is Pandoc default markdown output). And then to convert it to the Jupyter notebook so that only marked code blocks are converted to Jupyter cells (others are left as simple markdown code blocks). And to specify in what language should be the exported notebook (python/R for example).

This project seems to be able to do this stuff, but do not support metadata and kernels specifications. This project seem to support metadata and kernel specifications but not pandoc code blocks attributes...

PS
yep, this imagined workflow utilises Jupyter three times but differently... It buggs me a bit... But I'd like to use Pandoc ecosystem and filters for documents processing so I'd like to stick to Stitch.

Markdown reader with CommonMark-py

Read a Markdown snippet, convert to JSON, and compare with pandoc.

Use latest pandoc on Travis

Extract the .deb URL from the pandoc github latest release page. Finding the first occurrence of <a href=".....deb" with a regex should be enough.

Drop Python 2 dependency

CodeEval plugin

Preprocessor evaluating code enclosed in particular markup syntax (as provided by a regular expression). This allows for literate programming, using Python or any other language.

Support multiple files as input to the CLI tool

Show AST of document from the command line

podoc file.ext -f format --show-ast

Python plugin

To-do list

Use convert_text() and convert_file() methods instead of convert()

Notebook IPlugin

Macros plugin

Macro preprocessor based on regular expressions. The macro substitutions can be listed in the macros metadata array in the document.

O'Reilly Atlas plugin

Document-level and block-level metadata

pandoc supports document-level YAML metadata. Metadata blocks can be anywhere in the document, delimited with ---/***. All blocks are automatically merged.

Jupyter Notebook supports document-level and cell-level metadata (for example, information about slides, or whether a cell is scrolled or not).

I don't think that pandoc supports block-level metadata. I am not aware of a Markdown engine that supports block-level metadata.

The question is: should podoc support only document-level metadata, or also block-level metadata?

If we support block-level metadata, we break compatibility with pandoc and other Markdown engines.

If we don't, supporting slides and other specific formats might be harder.

cc @willingc @bollwyvl