Giter Site home page Giter Site logo

Comments (22)

jgm avatar jgm commented on June 14, 2024 3

I imagine that pandoc-crossref is inserting something like this into the AST:

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing [ Plain [ Str "The" , Space , Str "caption" ] ])
    [ Plain
        [ Image
            ( "" , [] , [ ("style", "height:12.09cm"), ("alt", "alt text")])
      [ Str "scheme" ]
            ( "myfig.jpg" , "" )
        ]
    ]
]

The problem is that pandoc's markdown writer will render this as HTML. And then, if you try to go from that markdown to docx, the raw HTML will disappear.

Why does the markdown writer use raw HTML here? I'm not sure. You can disable raw HTML, though, with -t markdown-raw_html and then you'll get something like

:::: {#fig:mech .figure}
![mech scheme.](Ch3/./img/mech.jpg){style="height:12.09cm"}

::: caption
Figure 1: mech scheme.
:::
::::

and that, I think, will go through to docx.

I think the markdown writer should probably just generate a standard implicit_figures style figure here, so let's consider this a change request for the markdown writer.

from pandoc.

jgm avatar jgm commented on June 14, 2024 2

OK, I see what is going on here.

The HTML you display above was probably the result of rendering this AST element (inserted by pandoc-crossref):

[ Figure
    ( "fig:mech" , [] , [] )
    (Caption
       Nothing
       [ Plain
           [ Str "Figure"
           , Space
           , Str "1:"
           , Space
           , Str "mech"
           , Space
           , Str "scheme."
           ]
       ])
    [ Plain
        [ Image
            ( "" , [] , [ ( "style" , "height:12.09cm" ) ] )
            [ Str "mech" , Space , Str "scheme." ]
            ( "Ch3/./img/mech.jpg" , "" )
        ]
    ]
]

In deciding whether to use an implicit figure, the markdown writer tries to determine whether this representation would capture all of the information in this Figure element. One case in which it wouldn't is the case where the image has an image description/alt text that is different from the figure's caption. (An implicit figure just takes the caption from what would otherwise be the image's alt text.) So the writer tests for this. Notice that the caption and the image description are almost the same in this case: the difference is that the caption also includes the label "Figure 1:". Anyway, it's because of that that we fall back to raw HTML.

I suppose one way around this would be to just check that the suffix of the Caption matches the image description. This might lead to some false positives, but it's probably fairly reliable.

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 2

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

from pandoc.

jgm avatar jgm commented on June 14, 2024 1

In retrospect I don't think this is a problem for pandoc-crossref, so you can cancel any request you made there.

from pandoc.

jgm avatar jgm commented on June 14, 2024 1

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 1

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 1

Honestly, Markdown-to-Markdown conversions were never a target, and in Pandoc, Markdown is not guaranteed to round-trip in the first place. I could make a patch changing the alt text to match the caption though 🤷

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 1

Would you suggest to keep using native as intermediate format even with the new patch ?

I don't know the particulars of your setup, so it's up to you. If you don't really care about the intermediate format, native or json would be the best choice if it works, as they're guaranteed to preserve the AST. OTOH, if you want to do some postprocessing on the intermediate files (not with pandoc filters), use whatever you can postprocess 🤷

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 1

As native preserves the whole AST, it also preserves the result of --citeproc. So it shouldn't need any qualifiers. For example, the command pandoc --citeproc -t native /tmp/test.md | pandoc -f native -t docx -o /tmp/test.docx produces the following docx:
image

test.md is as follows:

---
references:
- type: article-journal
  id: WatsonCrick1953
  author:
  - family: Watson
    given: J. D.
  - family: Crick
    given: F. H. C.
  issued:
    date-parts:
    - - 1953
      - 4
      - 25
  title: 'Molecular structure of nucleic acids: a structure for
    deoxyribose nucleic acid'
  title-short: Molecular structure of nucleic acids
  container-title: Nature
  volume: 171
  issue: 4356
  page: 737-738
  DOI: 10.1038/171737a0
  URL: https://www.nature.com/articles/171737a0
  language: en-GB
---

@WatsonCrick1953

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024 1

Meanwhile, turns out I forgot to update some tests, so that CI build failed. Anyway, I'll just cut a release I guess, and we can do another one if this doesn't work out for some reason. For future reference, 0.3.17.1 (artefacts not yet built, but this time CI should finish fine 🤞)

from pandoc.

jgm avatar jgm commented on June 14, 2024

Please report this to pandoc-crossref instead.

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

Please report this to pandoc-crossref instead.

Thank you for your instruction ! Do you suggest pandoc-crossref should not generate <figure in markdown output format in the first place ? Does pandoc ignore <figure in markdown input format ?

from pandoc.

jgm avatar jgm commented on June 14, 2024

OK. Actually, this may point to something that can be done in pandoc.

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

What I'm not sure about is what we should do in the case where the suffix matches. Should the image description in the implicit figure include the "Figure 1:" part or not? If it does, then we might get bad results in formats that add a figure number (e.g. latex/pdf).

It seems to me that pandoc-crossref when invoked is responsible for naming the figures (and the tables, and the equations). Would this information help with the decision ? :D

from pandoc.

jgm avatar jgm commented on June 14, 2024

I think we need feedback from @lierdakil on this.

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

I'm a bit confused by the premise: converting to Markdown through pandoc-crossref then converting the output to docx. I don't know what you're trying to do, but it sounds like using native/json as intermediary format would resolve this, no?

The reason why my workflow depends/depended on intermediate markdowns is chapter-wise references (should be chapter-wise bibliography if I remembered correctly) :D. A few years ago I read from the google discussion group about this idea (I cannot find it since the group is not accessible....)

from pandoc.

lierdakil avatar lierdakil commented on June 14, 2024

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

P.S. I'll make a release proper probably tomorrow lest I forget.

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

Many thanks ! Using the command line syntax below to use native as an intermediate format seems very well

./pandoc  -F pandoc-crossref Ch3/Ch3.md --resource-path=Ch3 -t native -o Ch3/Ch3_tmp.txt
./pandoc -f native  Ch3/Ch3_tmp.txt -o Ch3/Ch3_tmp.docx

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

Anyway, probably worth making the change regardless.

This should work: lierdakil/pandoc-crossref@5f2b087

There is a bit of a twist, however. In some cases, pandoc-crossref will add attributes on the Figure element. If that happens, the resulting figure is impossible to represent in Markdown any more, so Pandoc will go back to representing it as raw HTML (if enabled) or nested divs. This does require explicit opt-in via pandoc-crossref configuration, and I don't really see a workaround, so I'm inclined to leave it be.

@jiucenglou if you could test this commit for your use case and report back, that would be nice. Automatic builds will (edit: well, should, can't promise that, CI is a bit flaky) become available at the following links once CI finishes (in an hour or two probably):

P.S. I'll make a release proper probably tomorrow lest I forget.

I can test and report back. Would you suggest to keep using native as intermediate format even with the new patch ?

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

chapter-wise bibliography

I don't necessarily see if that would prevent you from using native instead of markdown as an intermediary format. Does it? Because if not, that's an overall more robust approach, while roundtrips via Markdown are not guaranteed, native is essentially a snapshot of the AST.

In my real use case, the two command lines look like

"${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
"${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

I mean, the first run has a -t markdown-citations option to generate chapter-wise bibliography. Could you help to suggest if -t native can work or I should use something like -t native-citations ? Many thanks !

from pandoc.

jiucenglou avatar jiucenglou commented on June 14, 2024

As shown below, I tried native on my real use case and I got couldn't read native on my second run.
I could not get a minimal working example in the time being, but will post again if I could get a minimal working example.

      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t markdown-citations  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"
      "${Pandoc}"  "${Header}"  "${TmpMd2}"  -F pandoc-crossref  --citeproc    --csl="${CiteStyle}"  -t native  -o "${TmpMd3}"  --wrap=preserve --resource-path=$(dirname "${TmpMd2}")
      "${Pandoc}"  "${Header}"  "${TmpMd3}"  -f native --fail-if-warnings  -L Dry12_for_docx.lua  -L skip_placeholder.lua  -L mhchem.lua  --reference-doc="${RefWordDocx}"  -s -o "${MSWord}"

from pandoc.

jgm avatar jgm commented on June 14, 2024

Thanks @lierdakil - it looks like this isn't going to require pandoc changes, so I'll close this issue.

from pandoc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.