Comments (22)
I imagine that pandoc-crossref is inserting something like this into the AST:
[ Figure
( "fig:mech" , [] , [] )
(Caption
Nothing [ Plain [ Str "The" , Space , Str "caption" ] ])
[ Plain
[ Image
( "" , [] , [ ("style", "height:12.09cm"), ("alt", "alt text")])
[ Str "scheme" ]
( "myfig.jpg" , "" )
]
]
]
The problem is that pandoc's markdown writer will render this as HTML. And then, if you try to go from that markdown to docx, the raw HTML will disappear.
Why does the markdown writer use raw HTML here? I'm not sure. You can disable raw HTML, though, with -t markdown-raw_html
and then you'll get something like
:::: {#fig:mech .figure}
![mech scheme.](Ch3/./img/mech.jpg){style="height:12.09cm"}
::: caption
Figure 1: mech scheme.
:::
::::
and that, I think, will go through to docx.
I think the markdown writer should probably just generate a standard implicit_figures
style figure here, so let's consider this a change request for the markdown writer.
from pandoc.
OK, I see what is going on here.
The HTML you display above was probably the result of rendering this AST element (inserted by pandoc-crossref):
[ Figure
( "fig:mech" , [] , [] )
(Caption
Nothing
[ Plain
[ Str "Figure"
, Space
, Str "1:"
, Space
, Str "mech"
, Space
, Str "scheme."
]
])
[ Plain
[ Image
( "" , [] , [ ( "style" , "height:12.09cm" ) ] )
[ Str "mech" , Space , Str "scheme." ]
( "Ch3/./img/mech.jpg" , "" )
]
]
]
In deciding whether to use an implicit figure, the markdown writer tries to determine whether this representation would capture all of the information in this Figure element. One case in which it wouldn't is the case where the image has an image description/alt text that is different from the figure's caption. (An implicit figure just takes the caption from what would otherwise be the image's alt text.) So the writer tests for this. Notice that the caption and the image description are almost the same in this case: the difference is that the caption also includes the label "Figure 1:". Anyway, it's because of that that we fall back to raw HTML.
I suppose one way around this would be to just check that the suffix of the Caption matches the image description. This might lead to some false positives, but it's probably fairly reliable.
from pandoc.
chapter-wise bibliography
I don't necessarily see if that would prevent you from using native
instead of markdown
as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native
is essentially a snapshot of the AST.
from pandoc.
In retrospect I don't think this is a problem for pandoc-crossref, so you can cancel any request you made there.
from pandoc.
What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).
from pandoc.
I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?
from pandoc.
Honestly, Markdown-to-Markdown conversions were never a target, and in Pandoc, Markdown is not guaranteed to round-trip in the first place. I could make a patch changing the alt text to match the caption though 🤷
from pandoc.
Would you suggest to keep using native as intermediate format even with the new patch ?
I don't know the particulars of your setup, so it's up to you. If you don't really care about the intermediate format, native
or json
would be the best choice if it works, as they're guaranteed to preserve the AST. OTOH, if you want to do some postprocessing on the intermediate files (not with pandoc filters), use whatever you can postprocess 🤷
from pandoc.
As native
preserves the whole AST, it also preserves the result of --citeproc
. So it shouldn't need any qualifiers. For example, the command pandoc --citeproc -t native /tmp/test.md | pandoc -f native -t docx -o /tmp/test.docx
produces the following docx:
test.md
is as follows:
---
references:
- type: article-journal
id: WatsonCrick1953
author:
- family: Watson
given: J. D.
- family: Crick
given: F. H. C.
issued:
date-parts:
- - 1953
- 4
- 25
title: 'Molecular structure of nucleic acids: a structure for
deoxyribose nucleic acid'
title-short: Molecular structure of nucleic acids
container-title: Nature
volume: 171
issue: 4356
page: 737-738
DOI: 10.1038/171737a0
URL: https://www.nature.com/articles/171737a0
language: en-GB
---
@WatsonCrick1953
from pandoc.
Meanwhile, turns out I forgot to update some tests, so that CI build failed. Anyway, I'll just cut a release I guess, and we can do another one if this doesn't work out for some reason. For future reference, 0.3.17.1 (artefacts not yet built, but this time CI should finish fine 🤞)
from pandoc.
Please report this to pandoc-crossref instead.
from pandoc.
Please report this to pandoc-crossref instead.
Thank you for your instruction ! Do you suggest pandoc-crossref should not generate <figure
in markdown output format in the first place ? Does pandoc ignore <figure
in markdown input format ?
from pandoc.
OK. Actually, this may point to something that can be done in pandoc.
from pandoc.
What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).
It seems to me that pandoc-crossref when invoked is responsible for naming the figures (and the tables, and the equations). Would this information help with the decision ? :D
from pandoc.
I think we need feedback from @lierdakil on this.
from pandoc.
I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?
The reason why my workflow depends/depended on intermediate markdowns is chapter-wise references (should be chapter-wise bibliography if I remembered correctly) :D. A few years ago I read from the google discussion group about this idea (I cannot find it since the group is not accessible....)
from pandoc.
Anyway, probably worth making the change regardless.
This should work: lierdakil/pandoc-crossref@5f2b087
There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.
@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Linux-20240504-5f2b087.tar.xz
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-macOS-20240504-5f2b087.tar.xz
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Windows-20240504-5f2b087.7z
P.S. I'll make a release proper probably tomorrow lest I forget.
from pandoc.
chapter-wise bibliography
I don't necessarily see if that would prevent you from using
native
instead ofmarkdown
as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed,native
is essentially a snapshot of the AST.
Many thanks ! Using the command line syntax below to use native as an intermediate format seems very well
./pandoc -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -t native -o Ch3/Ch3_tmp.txt
./pandoc -f native Ch3/Ch3_tmp.txt -o Ch3/Ch3_tmp.docx
from pandoc.
Anyway, probably worth making the change regardless.
This should work: lierdakil/pandoc-crossref@5f2b087
There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.
@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Linux-20240504-5f2b087.tar.xz
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-macOS-20240504-5f2b087.tar.xz
- https://github.com/lierdakil/pandoc-crossref/releases/download/nightlies/pandoc-crossref-master-Windows-20240504-5f2b087.7z
P.S. I'll make a release proper probably tomorrow lest I forget.
I can test and report back. Would you suggest to keep using native as intermediate format even with the new patch ?
from pandoc.
chapter-wise bibliography
I don't necessarily see if that would prevent you from using
native
instead ofmarkdown
as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed,native
is essentially a snapshot of the AST.
In my real use case, the two command lines look like
"${Pandoc}" "${Header}" "${TmpMd2}" -F pandoc-crossref --citeproc --csl="${CiteStyle}" -t markdown-citations -o "${TmpMd3}" --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}" "${Header}" "${TmpMd3}" --fail-if-warnings -L Dry12_for_docx.lua -L skip_placeholder.lua -L mhchem.lua --reference-doc="${RefWordDocx}" -s -o "${MSWord}"
I mean, the first run has a -t markdown-citations
option to generate chapter-wise bibliography. Could you help to suggest if -t native
can work or I should use something like -t native-citations
? Many thanks !
from pandoc.
As shown below, I tried native on my real use case and I got couldn't read native
on my second run.
I could not get a minimal working example in the time being, but will post again if I could get a minimal working example.
"${Pandoc}" "${Header}" "${TmpMd2}" -F pandoc-crossref --citeproc --csl="${CiteStyle}" -t markdown-citations -o "${TmpMd3}" --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}" "${Header}" "${TmpMd3}" --fail-if-warnings -L Dry12_for_docx.lua -L skip_placeholder.lua -L mhchem.lua --reference-doc="${RefWordDocx}" -s -o "${MSWord}"
"${Pandoc}" "${Header}" "${TmpMd2}" -F pandoc-crossref --citeproc --csl="${CiteStyle}" -t native -o "${TmpMd3}" --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}" "${Header}" "${TmpMd3}" -f native --fail-if-warnings -L Dry12_for_docx.lua -L skip_placeholder.lua -L mhchem.lua --reference-doc="${RefWordDocx}" -s -o "${MSWord}"
from pandoc.
Thanks @lierdakil - it looks like this isn't going to require pandoc changes, so I'll close this issue.
from pandoc.
Related Issues (20)
- Allow empty/ default attributes in markdown codeblocks HOT 5
- Problem with html tags when converting to GFM-markdown HOT 1
- Comments with $-- do not work when output HTML HOT 2
- Reading CSV file returns incorrect line break content HOT 1
- Split table cutoff at bottom of page HOT 2
- Typst citations are unresponsive to csl style HOT 4
- RST section title including ".*" and without blank line after adornment characters wrongly renders as inline markup
- Typst: Strong emphasis function cannot take multiline variable HOT 2
- Leak in HTML parsing HOT 4
- Please tag pandoc `3.2` to `3.2.0` HOT 3
- Option to Link Images Rather Than Embed Them For ODT HOT 1
- Feature Request: Support for East Asian Language Tags in DOCX Output HOT 3
- ConTeXt backend crashes unnecessarily
- PDF to Markdown HOT 1
- Unexcepted escape seq in typst template. HOT 13
- Regression: `alerts` extension not working HOT 6
- Non-dropping particles in authors' names apparently not properly handled when using .bib files, but properly handled if using a .json HOT 8
- Issue with `--metadata-file` Handling Multiline String Metadata in Pandoc 3.2-nightly-2024-05-30 HOT 1
- Fenced block quotes HOT 1
- docx comments on tracked-changes insertions not handled properly HOT 6
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.