transpect / docx2hub Goto Github PK
View Code? Open in Web Editor NEWConverts Microsoft docx to flat hub XML
License: BSD 2-Clause "Simplified" License
Converts Microsoft docx to flat hub XML
License: BSD 2-Clause "Simplified" License
It appears that inline markup in indexterms results in empty index entries such as <indexterm/>
. A minimal sample is attached to this ticket. The first indexterm is empty, the second looks exactly like the first one, but contains no other inline markup.
Hi,
I followed these instructions: http://transpect.github.io/getting-started.html
While running:
./calabash/calabash.sh -o result=test1.xml docx2hub/xpl/docx2hub.xpl docx=test1.docx
on a sample docx containing just 3 pages, it took about 15min and 100% CPU to get me to error:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# Internal Error (safepoint.cpp:310), pid=859, tid=0x00007fb7ff990700
# guarantee(PageArmed == 0) failed: invariant
#
# JRE version: Java(TM) SE Runtime Environment (8.0_101-b13) (build 1.8.0_101-b13)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.101-b13 mixed mode linux-amd64 compressed oops)
# Core dump written. Default location: /home/klo/trans_project/core or core.859
#
# An error report file with more information is saved as:
# /home/klo/trans_project/hs_err_pid859.log
[thread 140428747605760 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
./calabash/calabash.sh: line 126: 859 Aborted (core dumped) $JAVA -cp "$CLASSPATH" -Dfile.encoding=UTF-8 "-Dxml.catalog.files=$CATALOGS" -Dxml.catalog.staticCatalog=1 -Duser.language=$UI_LANG $SYSPROPS -Xmx$HEAP -Xss1024k com.xmlcalabash.drivers.$DRIVER -Xtransparent-json -E org.xmlresolver.Resolver -U org.xmlresolver.Resolver $SAXON_PROCESSOR -c $CFG $PIPERACK_PORT "$@"
Are there some other prerequisites not mentioned in the guide?
I have a docx file that lead to the following pdflatex error (I can provide the file by private email).
Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 mun-the-ra-pie
Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 nicht ge-eig-net
Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/bx/n/10.95 nicht
Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 quan-ti-fi-
Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/m/n/10.95 Idelalisib/Rituximab f[]uhrt zu ei-ner
Underfull \hbox (badness 3354) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Verl[]angerung der pro-gres-si-ons-frei-en und des
Underfull \hbox (badness 3536) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Ge-samt[]uberlebenszeit so-wie zu ei-ner Stei-ge-
Underfull \hbox (badness 10000) in alignment at lines 36--36
[][][]
! LaTeX Error: There's no line here to end.
See the LaTeX manual or LaTeX Companion for explanation.
Type H for immediate help.
...
l.40 \newline
In OOXML: <w:footnoteReference w:customMarkFollows="1" w:id="…"/>
, followed by w:t
with the label that is rendered in the text.
It is not clear whether DocBook’s label attribute is intended to convey the label rendered in the note or in the text. Since we only use phrase[@role='hub:identifier']
to mark up the note label, we could use @label
for the differing text label. Otherwise we can maybe use @xreflabel
.
Wingdings F0E0 used to be mapped to a plain right arrow, U+2192. This was done because in most cases, authors use these similar glyphs inconsistently. They don’t care whether they select a right arrow from symbol or from Wingdings. In some cases, as in differently sized box letters, the differences matter. But in most cases, the more or less fancy arrows, boxes, and circles of Wingdings should be converted to the most common Unicode symbol. In the case of F0E0, this is → rather than 🡪.
The newly introduced mapping is highly problematic for our doc→docx conversions that use Cambria for the mapped glyphs by default (unless declared otherwise, in the linked case: use Segoe UI Symbol instead of Cambria; in the case of 🡪 U+1F86A, this glyph doesn’t seem to exist in the default fonts that all users of recent MS Office versions have installed).
Most of these mappings have not been done for purity, they have been done for legacy doc file migration.
All the replacement font instructions have been eliminated.
We cannot use the new mappings in production. You need to introduce a mapping representation that allows us to map either to modern MS Office fonts or to exact Unicode match (if available).
This is a very sensitive area. We only have poor and accidental test coverage for the mappings that are used in doc→docx conversions.
Therefore the new mappings will be used in conversions because they appear to be compatible to the test system.
We need to fix this very quickly or roll back to the the old mappings and create a branch for the new mappings and an option to select MS Office font compatiblity mappings.
I would like to map an empty line in the source document to produce a new section in tex (I am using docx2tex), or anything that could be post processed, \newline would be fine too. I noticed that this section from the docx file is completely omitted, hence it does not appear in the 24.docx2hub_join-runs.xml file at all.
Hence I would like to replace any w:p with children count 1, being w:pPr, with something?
<w:p w14:paraId="1E4C5E8A" w14:textId="77777777" w:rsidR="00AE18CA" w:rsidRPr="003A1575" w:rsidRDefault="00AE18CA" w:rsidP="00F03C13">
<w:pPr>
<w:spacing w:after="0" w:line="360" w:lineRule="auto"/>
<w:ind w:firstLine="284"/>
<w:jc w:val="both"/>
<w:rPr>
<w:rFonts w:ascii="Constantia" w:hAnsi="Constantia" w:cs="Courier New"/>
<w:szCs w:val="24"/>
</w:rPr>
</w:pPr>
</w:p>
All mappings from Wingdings characters in the 0000…00FF range seem to have disappeared from Wingdings.xml.
Characters in the 0000 range are typically equivalent with the characters in the F000 range.
Include the mapping for the 00 variants, too. At least for the chars that have been included in previous versions of the mapping.
git clone https://github.com/transpect/docx2tex --recursive
.........
remote: Total 288 (delta 0), reused 0 (delta 0), pack-reused 287
Receiving objects: 100% (288/288), 94.07 KiB | 0 bytes/s, done.
Resolving deltas: 100% (90/90), done.
Checking connectivity... done.
Submodule path 'xslt-util': checked out '2f8c5ec6b9f7b12331338915e85c57f50ab792dc'
Unable to checkout 'd71a11f6cd39649f0c633eb7099869ad0aa78899' in submodule path 'docx2hub'
...leading to the following later error
cp: ‘/home/ajung/Downloads/myelodysplastische-syndrome-23032016t125157.docx’ and ‘/data/home/ajung/Downloads/myelodysplastische-syndrome-23032016t125157.docx’ are the same file
ERROR: xpl/docx2tex.xpl:100:67:err:XS0052:Cannot import: http://transpect.io/docx2hub/xpl/docx2hub.xpl
ERROR: cause: I/O error reported by XML parser processing http://transpect.io/docx2hub/xpl/docx2hub.xpl: http://transpect.github.io/docx2hub/xpl/docx2hub.xpl
ERROR: It is a static error if the URI of a p:import cannot be retrieved or if, once retrieved, it does not point to a p:library, p:declare-step, or p:pipeline.
ERROR: Underlying exception: I/O error reported by XML parser processing http://transpect.io/docx2hub/xpl/docx2hub.xpl: http://transpect.github.io/docx2hub/xpl/docx2hub.xpl
Sample Wmf.docx
Math object not converted, when it is in WMF image format
Instead of mathml output i get just this
<para><mediaobject><imageobject><imagedata fileref="SampleWmf.docx.tmp/word/media/image1.wmf" css:width="117.5pt" css:height="56.1pt"/></imageobject></mediaobject></para>
Inline styles in index terms are not kept. After the mode docx2hub:remove-redundant-run-atts
, the inline styles still exist:
<w:r>
<w:fldChar xml:id="fldChar_d3804e2370" w:fldCharType="begin"/>
</w:r>
<w:r>
<w:instrText xml:space="preserve"> XE "α</w:instrText>
</w:r>
<w:r role="ZFTiefgestellt">
<w:instrText>1</w:instrText>
</w:r>
<w:r>
<w:instrText xml:space="preserve">-Adrenozeptor-Antagonist" </w:instrText>
</w:r>
<w:r>
<w:fldChar xml:id="fldChar_d3804e2381" w:fldCharType="end"/>
</w:r>
However, after the mode docx2hub:join-instrText-runs
just the string value of the node is present:
<w:r role="ZFTiefgestellt">
<w:instrText docx2hub:fldChar-start-id="fldChar_d3804e2370" docx2hub:field-function-name="XE" docx2hub:field-function-args=""α1-Adrenozeptor-Antagonist""><quot>"</quot>α1-Adrenozeptor-Antagonist<quot>"</quot>
</w:instrText>
</w:r>
Attached .docx is saved as “strict docx.” Strict docx is one of the more brainfucked concepts on top of the other ISO/IEC 29500-1-related madness. It was decided that really standards-compliant OOXML files would have the same namespace prefixes, but different namespace URIs. That makes all XML-based processing tools moot.
So we either need a preprocessing step within docx2hub that replaces the namespaces when creating the single tree or a standalone strict→transitional step.
Simply manipulating namespaces in the single tree document may not be enough for cases where we first unzip the docx file, create a single tree and only selectively overwrite some of the unzipped files with manipulated chunks. Then some files in the archive will have xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"
while others have xmlns:w="http://purl.oclc.org/ooxml/wordprocessingml/main"
. This will probably lead to an error when opening the manipulated docx.
According to my understanding, docx sections (<w:sectPr>
) are not currently supported. What would be the effort implementing this future?
I followed the instructions on the getting-start page.
The first Calabash example fails:
./calabash/calabash.sh -o result=MyXMLfile.xml docx2hub/xpl/docx2hub.xpl docx=MyWordfile.docx
Exception in thread "main" java.lang.NoClassDefFoundError: javax/activation/DataSource
at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3167)
at java.base/java.lang.Class.getMethodsRecursive(Class.java:3308)
at java.base/java.lang.Class.getMethod0(Class.java:3294)
at java.base/java.lang.Class.getMethod(Class.java:2107)
at com.xmlcalabash.core.XProcRuntime.initializeSteps(XProcRuntime.java:347)
at com.xmlcalabash.core.XProcRuntime.<init>(XProcRuntime.java:296)
at com.xmlcalabash.drivers.Main.run(Main.java:100)
at com.xmlcalabash.drivers.Main.main(Main.java:83)
Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:604)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
... 9 more
ls -la
total 36
drwx------ 7 ajung ajung 4096 Nov 29 16:44 .
drwx------ 155 ajung ajung 12288 Nov 29 16:43 ..
drwx------ 10 ajung ajung 4096 Nov 29 16:43 calabash
drwx------ 8 ajung ajung 4096 Nov 29 16:44 docx2hub
drwx------ 9 ajung ajung 4096 Nov 29 16:44 htmlreports
drwx------ 2 ajung ajung 4096 Nov 29 16:44 xmlcatalog
drwx------ 27 ajung ajung 4096 Nov 29 16:44 xslt-util
When converting, just after the docx has been unzipped, I get this error:
...
[info] Unzip finished successfully.
Message: Mode: insert-xpath
ERROR: file:/xxxxxxxxxx/docx2hub/xpl/mathtype2mml.xpl:119:49:Undeclared input port 'additional-font-maps' on step tr:mathtype2mml named convert-wmf at file:/xxxxxxx/docx2hub/xpl/mathtype2mml.xpl:119
Note that I call docx2hub:convert passing in
<p:with-option name="mathtype2mml" select="'no'"/>
Attached is a minimal document sent by 关宗江 via email.
It contains a Heading 1 paragraph that will display as such if opening in an English Word. However, it contains w:styleId="1"
instead of w:styleId="Heading1"
.
If we used the information given in <w:name w:val="heading 1"/>
, we could generate a css:rule/@name
that will be processed correctly downstream in conversion processes, in particular when using the default docx2tex configuration.
In atached docx file, there is a field code {INCLUDEPICTURE "008.tif"}
. In document.xml, <w:instrText>INCLUDEPICTURE "B08.tif"</w:instrText>
is repeated 51 times, nested in properly balanced begin/end fldChar
s. This caused join-runs.xsl to raise said looping error.
For example, FÖD 1987\:48
should become <primary>FÖD 1987:48</primary>
. Currently, it will be split to <primary>FÖD 1987\</primary><secondary>48</secondary>
. Reported by @sgmlguru.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.