transpect / docx2hub Goto Github PK

View Code? Open in Web Editor NEW

26.0 26.0 14.0 2.18 MB

Converts Microsoft docx to flat hub XML

License: BSD 2-Clause "Simplified" License

XProc 10.78% XSLT 89.18% Shell 0.04%

docx mathml msword office ooxml word

docx2hub's People

Contributors

Stargazers

Watchers

Forkers

bloomsburydigital sbulka senthilmm josesaribeiro yuribashi7 luziyuan002 amit08255 wangxi83 jiaozhichao wenzixin09 nengapi ideastation-x ivanbrrr tuyuan2012

docx2hub's Issues

empty index terms

It appears that inline markup in indexterms results in empty index entries such as <indexterm/>. A minimal sample is attached to this ticket. The first indexterm is empty, the second looks exactly like the first one, but contains no other inline markup.

Processing error

Hi,

I followed these instructions: http://transpect.github.io/getting-started.html

While running:

./calabash/calabash.sh -o result=test1.xml docx2hub/xpl/docx2hub.xpl docx=test1.docx

on a sample docx containing just 3 pages, it took about 15min and 100% CPU to get me to error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (safepoint.cpp:310), pid=859, tid=0x00007fb7ff990700
#  guarantee(PageArmed == 0) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: /home/klo/trans_project/core or core.859
#
# An error report file with more information is saved as:
# /home/klo/trans_project/hs_err_pid859.log
[thread 140428747605760 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
./calabash/calabash.sh: line 126:   859 Aborted                 (core dumped) $JAVA -cp "$CLASSPATH" -Dfile.encoding=UTF-8 "-Dxml.catalog.files=$CATALOGS" -Dxml.catalog.staticCatalog=1 -Duser.language=$UI_LANG $SYSPROPS -Xmx$HEAP -Xss1024k com.xmlcalabash.drivers.$DRIVER -Xtransparent-json -E org.xmlresolver.Resolver -U org.xmlresolver.Resolver $SAXON_PROCESSOR -c $CFG $PIPERACK_PORT "$@"

Are there some other prerequisites not mentioned in the guide?

! LaTeX Error: There's no line here to end.

I have a docx file that lead to the following pdflatex error (I can provide the file by private email).

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 mun-the-ra-pie

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 nicht ge-eig-net

Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/bx/n/10.95 nicht

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 quan-ti-fi-

Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/m/n/10.95 Idelalisib/Rituximab f[]uhrt zu ei-ner

Underfull \hbox (badness 3354) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Verl[]angerung der pro-gres-si-ons-frei-en und des

Underfull \hbox (badness 3536) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Ge-samt[]uberlebenszeit so-wie zu ei-ner Stei-ge-

Underfull \hbox (badness 10000) in alignment at lines 36--36
[][][]

! LaTeX Error: There's no line here to end.

See the LaTeX manual or LaTeX Companion for explanation.
Type H for immediate help.
...

l.40 \newline

Convey information when a footnote label rendered in the text differs from the label in the note

In OOXML: <w:footnoteReference w:customMarkFollows="1" w:id="…"/>, followed by w:t with the label that is rendered in the text.
It is not clear whether DocBook’s label attribute is intended to convey the label rendered in the note or in the text. Since we only use phrase[@role='hub:identifier'] to mark up the note label, we could use @label for the differing text label. Otherwise we can maybe use @xreflabel.

possibly problematic mapping changes, for ex. Wingdings F0E0→U+1F86A

Wingdings F0E0 used to be mapped to a plain right arrow, U+2192. This was done because in most cases, authors use these similar glyphs inconsistently. They don’t care whether they select a right arrow from symbol or from Wingdings. In some cases, as in differently sized box letters, the differences matter. But in most cases, the more or less fancy arrows, boxes, and circles of Wingdings should be converted to the most common Unicode symbol. In the case of F0E0, this is → rather than 🡪.
The newly introduced mapping is highly problematic for our doc→docx conversions that use Cambria for the mapped glyphs by default (unless declared otherwise, in the linked case: use Segoe UI Symbol instead of Cambria; in the case of 🡪 U+1F86A, this glyph doesn’t seem to exist in the default fonts that all users of recent MS Office versions have installed).
Most of these mappings have not been done for purity, they have been done for legacy doc file migration.
All the replacement font instructions have been eliminated.
We cannot use the new mappings in production. You need to introduce a mapping representation that allows us to map either to modern MS Office fonts or to exact Unicode match (if available).
This is a very sensitive area. We only have poor and accidental test coverage for the mappings that are used in doc→docx conversions.
Therefore the new mappings will be used in conversions because they appear to be compatible to the test system.
We need to fix this very quickly or roll back to the the old mappings and create a branch for the new mappings and an option to select MS Office font compatiblity mappings.

Mapping an empty line to a new section

I would like to map an empty line in the source document to produce a new section in tex (I am using docx2tex), or anything that could be post processed, \newline would be fine too. I noticed that this section from the docx file is completely omitted, hence it does not appear in the 24.docx2hub_join-runs.xml file at all.

Hence I would like to replace any w:p with children count 1, being w:pPr, with something?

    <w:p w14:paraId="1E4C5E8A" w14:textId="77777777" w:rsidR="00AE18CA" w:rsidRPr="003A1575" w:rsidRDefault="00AE18CA" w:rsidP="00F03C13">
      <w:pPr>
        <w:spacing w:after="0" w:line="360" w:lineRule="auto"/>
        <w:ind w:firstLine="284"/>
        <w:jc w:val="both"/>
        <w:rPr>
          <w:rFonts w:ascii="Constantia" w:hAnsi="Constantia" w:cs="Courier New"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
    </w:p>

restore mapping of (for example) Wingdings char 00E0

All mappings from Wingdings characters in the 0000…00FF range seem to have disappeared from Wingdings.xml.
Characters in the 0000 range are typically equivalent with the characters in the F000 range.
Include the mapping for the 00 variants, too. At least for the chars that have been included in previous versions of the mapping.

Unable to checkout 'd71a11f6cd39649f0c633eb7099869ad0aa78899' in submodule path 'docx2hub'

git clone https://github.com/transpect/docx2tex --recursive

.........
remote: Total 288 (delta 0), reused 0 (delta 0), pack-reused 287
Receiving objects: 100% (288/288), 94.07 KiB | 0 bytes/s, done.
Resolving deltas: 100% (90/90), done.
Checking connectivity... done.
Submodule path 'xslt-util': checked out '2f8c5ec6b9f7b12331338915e85c57f50ab792dc'
Unable to checkout 'd71a11f6cd39649f0c633eb7099869ad0aa78899' in submodule path 'docx2hub'

...leading to the following later error


cp: ‘/home/ajung/Downloads/myelodysplastische-syndrome-23032016t125157.docx’ and ‘/data/home/ajung/Downloads/myelodysplastische-syndrome-23032016t125157.docx’ are the same file
ERROR: xpl/docx2tex.xpl:100:67:err:XS0052:Cannot import: http://transpect.io/docx2hub/xpl/docx2hub.xpl
ERROR:     cause: I/O error reported by XML parser processing http://transpect.io/docx2hub/xpl/docx2hub.xpl: http://transpect.github.io/docx2hub/xpl/docx2hub.xpl
ERROR: It is a static error if the URI of a p:import cannot be retrieved or if, once retrieved, it does not point to a p:library, p:declare-step, or p:pipeline.
ERROR: Underlying exception: I/O error reported by XML parser processing http://transpect.io/docx2hub/xpl/docx2hub.xpl: http://transpect.github.io/docx2hub/xpl/docx2hub.xpl

WMF equations are not Converted

Sample Wmf.docx
Math object not converted, when it is in WMF image format
Instead of mathml output i get just this

<para><mediaobject><imageobject><imagedata fileref="SampleWmf.docx.tmp/word/media/image1.wmf" css:width="117.5pt" css:height="56.1pt"/></imageobject></mediaobject></para>

keep inline styles in index terms

Inline styles in index terms are not kept. After the mode docx2hub:remove-redundant-run-atts , the inline styles still exist:

<w:r>
  <w:fldChar xml:id="fldChar_d3804e2370" w:fldCharType="begin"/>
</w:r>
<w:r>
  <w:instrText xml:space="preserve"> XE "α</w:instrText>
</w:r>
<w:r role="ZFTiefgestellt">
  <w:instrText>1</w:instrText>
</w:r>
<w:r>
  <w:instrText xml:space="preserve">-Adrenozeptor-Antagonist" </w:instrText>
</w:r>
<w:r>
  <w:fldChar xml:id="fldChar_d3804e2381" w:fldCharType="end"/>
</w:r>

However, after the mode docx2hub:join-instrText-runs just the string value of the node is present:

<w:r role="ZFTiefgestellt">
  <w:instrText docx2hub:fldChar-start-id="fldChar_d3804e2370" docx2hub:field-function-name="XE" docx2hub:field-function-args="&#34;α1-Adrenozeptor-Antagonist&#34;"><quot>"</quot>α1-Adrenozeptor-Antagonist<quot>"</quot>
</w:instrText>
</w:r>

Need strict docx → transitional docx conversion step

Attached .docx is saved as “strict docx.” Strict docx is one of the more brainfucked concepts on top of the other ISO/IEC 29500-1-related madness. It was decided that really standards-compliant OOXML files would have the same namespace prefixes, but different namespace URIs. That makes all XML-based processing tools moot.

So we either need a preprocessing step within docx2hub that replaces the namespaces when creating the single tree or a standalone strict→transitional step.

Simply manipulating namespaces in the single tree document may not be enough for cases where we first unzip the docx file, create a single tree and only selectively overwrite some of the unzipped files with manipulated chunks. Then some files in the archive will have xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" while others have xmlns:w="http://purl.oclc.org/ooxml/wordprocessingml/main". This will probably lead to an error when opening the manipulated docx.

https://social.technet.microsoft.com/Forums/en-US/e969fc0a-9fcd-4efe-bf6d-79ea8c34360f/what-is-the-default-file-format-for-saving-in-ms-office-2013-is-it-still-the-transitional-ooxml-or?forum=officeitpro

Sections

According to my understanding, docx sections (<w:sectPr>) are not currently supported. What would be the effort implementing this future?

"Getting started" not working

I followed the instructions on the getting-start page.

The first Calabash example fails:

 ./calabash/calabash.sh -o result=MyXMLfile.xml docx2hub/xpl/docx2hub.xpl docx=MyWordfile.docx

Exception in thread "main" java.lang.NoClassDefFoundError: javax/activation/DataSource
	at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
	at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3167)
	at java.base/java.lang.Class.getMethodsRecursive(Class.java:3308)
	at java.base/java.lang.Class.getMethod0(Class.java:3294)
	at java.base/java.lang.Class.getMethod(Class.java:2107)
	at com.xmlcalabash.core.XProcRuntime.initializeSteps(XProcRuntime.java:347)
	at com.xmlcalabash.core.XProcRuntime.<init>(XProcRuntime.java:296)
	at com.xmlcalabash.drivers.Main.run(Main.java:100)
	at com.xmlcalabash.drivers.Main.main(Main.java:83)
Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
	at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
	at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
	... 9 more

 ls -la
total 36
drwx------   7 ajung ajung  4096 Nov 29 16:44 .
drwx------ 155 ajung ajung 12288 Nov 29 16:43 ..
drwx------  10 ajung ajung  4096 Nov 29 16:43 calabash
drwx------   8 ajung ajung  4096 Nov 29 16:44 docx2hub
drwx------   9 ajung ajung  4096 Nov 29 16:44 htmlreports
drwx------   2 ajung ajung  4096 Nov 29 16:44 xmlcatalog
drwx------  27 ajung ajung  4096 Nov 29 16:44 xslt-util

broken conversion???

When converting, just after the docx has been unzipped, I get this error:
...
[info] Unzip finished successfully.
Message: Mode: insert-xpath
ERROR: file:/xxxxxxxxxx/docx2hub/xpl/mathtype2mml.xpl:119:49:Undeclared input port 'additional-font-maps' on step tr:mathtype2mml named convert-wmf at file:/xxxxxxx/docx2hub/xpl/mathtype2mml.xpl:119

Note that I call docx2hub:convert passing in
<p:with-option name="mathtype2mml" select="'no'"/>

Treat built-in Word styles consistently no matter the locale

Attached is a minimal document sent by 关宗江 via email.

It contains a Heading 1 paragraph that will display as such if opening in an English Word. However, it contains w:styleId="1" instead of w:styleId="Heading1".

If we used the information given in <w:name w:val="heading 1"/>, we could generate a css:rule/@name that will be processed correctly downstream in conversion processes, in particular when using the default docx2tex configuration.

51 times nested INCLUDEPICTURE causes docx2hub:nest-field-functions may be looping

In atached docx file, there is a field code {INCLUDEPICTURE "008.tif"}. In document.xml, <w:instrText>INCLUDEPICTURE "B08.tif"</w:instrText> is repeated 51 times, nested in properly balanced begin/end fldChars. This caused join-runs.xsl to raise said looping error.

9_2962e_CDV_en_2.docx

Quoted colons should not split indexterms

For example, FÖD 1987\:48 should become <primary>FÖD 1987:48</primary>. Currently, it will be split to <primary>FÖD 1987\</primary><secondary>48</secondary>. Reported by @sgmlguru.