mwilliamson / python-mammoth Goto Github PK

View Code? Open in Web Editor NEW

762.0 25.0 121.0 978 KB

Convert Word documents (.docx files) to HTML

License: BSD 2-Clause "Simplified" License

Makefile 0.33% Python 99.67%

python-mammoth's People

Contributors

Stargazers

Watchers

Forkers

joshbarr esperyong teserak surjit bumatic akkana pombredanne cuchulainx jumasheff bmo neo-nie kmb232 powny nbstar bsteverink ajparsons potelo mornlight dougmassay gmarink syslabcom samarthbhasin ebitsdev shubhamgoyal azuremiko fiee namangupta01 tsaltena alexwelcing liguo86 zhang9song bird8693 dwasyl oliveris eferm frank14b marginalhours dress-code-it-gmbh es-collection martijnvanbeers jiahenghuang headnet color4 ni87 scarlos aqiang520 cockcrow bosondata zt50tz bigpiventures war21x3b yijian006 mriziq gareththomasnz threfo jayd2446 prec-co zhuangleiscut roshdy-dev wusiqingchun joehenres n-92 antidot zlqm imeta1 lqleeqee tkotz8105 alexysdussier nvminhtu aarbouin technetup pozotron vivek0304 mishafrenkel sypcloud madmaxindian krippto99 dangxuanvuong98 gbtami larryhudson venkata16924b metatr0n vvrepos xtofian heekentertainment martowu tis-wy lokeshburade007 learnexperts caramdache ra2003 191834785 python-repository-hub omanhar djun shashisingh shashimobiux mattl1598 abdnh saorisakura

python-mammoth's Issues

mammoth.convert_to_html got an error

The codes:

with open("test2.docx", "rb") as f: print mammoth.convert_to_html(f).value

and the result:

Traceback (most recent call last):
File "D:/myp/02_project/170601_test_mammoth/test_mammoth/main.py", line 15, in
main()
File "D:/myp/02_project/170601_test_mammoth/test_mammoth/main.py", line 11, in main
print mammoth.convert_to_html(f).value
File "E:\python\lib\site-packages\mammoth_init_.py", line 12, in convert_to_html
return convert(*args, output_format="html", **kwargs)
File "E:\python\lib\site-packages\mammoth_init_.py", line 26, in convert
return options.read_options(kwargs).bind(lambda convert_options:
File "E:\python\lib\site-packages\mammoth\results.py", line 15, in bind
result = func(self.value)
File "E:\python\lib\site-packages\mammoth_init_.py", line 27, in
docx.read(fileobj).map(transform_document).bind(lambda document:
File "E:\python\lib\site-packages\mammoth\docx_init_.py", line 26, in read
]).bind(lambda referents:
File "E:\python\lib\site-packages\mammoth\results.py", line 15, in bind
result = func(self.value)
File "E:\python\lib\site-packages\mammoth\docx_init_.py", line 27, in
read_document(zip_file, body_readers, notes=referents[0], comments=referents[1])
File "E:\python\lib\site-packages\mammoth\docx_init.py", line 59, in _read_document
comments=comments,
File "E:\python\lib\site-packages\mammoth\docx\document_xml.py", line 16, in read_document_xml_element
return body_reader.read_all(body_element.children)
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 42, in read_all
result = self._read_all(elements)
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 474, in _read_xml_elements
return _ReadResult.concat(lists.map(read, elements))
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 469, in read
return handler(element)
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 120, in paragraph
_read_xml_elements(element.children),
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 474, in _read_xml_elements
return _ReadResult.concat(lists.map(read, elements))
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 469, in read
return handler(element)
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 93, in run
_read_xml_elements(element.children).map(add_complex_field_hyperlink),
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 474, in _read_xml_elements
return _ReadResult.concat(lists.map(read, elements))
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 469, in read
return handler(element)
File "E:\python\lib\site-packages\mammoth\docx\body_xml.py", line 148, in read_fld_char
complex_field_stack.pop()
IndexError: pop from empty list

Question: How to get highlighted or shaded text?

Hello,

Thanks for your python-mammoth, which I have just started trying and find that it has a potential.

How can we create custom style map for the highlighted or shaded text. From examining document.xml I see the following for shaded text:
<w:pPr><w:shd w:val="clear" w:color="auto" w:fill="FF0000"/> ;
and this one for highlighted text: <w:highlight w:val="lightGray"/>

I have tried following code without success:

style_map = """ p[style-name='highlight'] => p.myclass1:fresh p[style-name='shd'] => p.myclass2:fresh """
Thanks,

not working for docx to html conversion

mammoth sample-04.docx my.html
Unsupported break type: page
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:instrText
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:fldChar
An unrecognised element was ignored: w:tblPrEx
An unrecognised element was ignored: w:trPr
An unrecognised element was ignored: w:tblPrEx
An unrecognised element was ignored: w:tblPrEx
An unrecognised element was ignored: w:tblPrEx
Unrecognised paragraph style: Legal notice (Style ID: Legalnotice)
Unrecognised paragraph style: Title (Style ID: Title)
Unrecognised paragraph style: Subtitle (Style ID: Subtitle)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised paragraph style: Contributor (Style ID: Contributor)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised paragraph style: Title page info (Style ID: Titlepageinfo)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Title page info description (Style ID: Titlepageinfodescription)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Subtitle (Style ID: Subtitle)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 2 (Style ID: TOC2)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: toc 1 (Style ID: TOC1)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: Legal notice (Style ID: Legalnotice)
Unrecognised run style: Ref term (Style ID: Refterm)
Unrecognised paragraph style: Definition Term (Style ID: DefinitionTerm0)
Unrecognised paragraph style: Definition (Style ID: Definition)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Continue (Style ID: ListContinue)
Unrecognised paragraph style: List Bullet 2 (Style ID: ListBullet2)
Unrecognised paragraph style: List Continue 2 (Style ID: ListContinue2)
Unrecognised run style: Ref term (Style ID: Refterm)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code (Style ID: Code)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Code small (Style ID: Codesmall)
Unrecognised paragraph style: Example (Style ID: Example)
Unrecognised paragraph style: Example (Style ID: Example)
Unrecognised paragraph style: Example small (Style ID: Examplesmall)
Unrecognised paragraph style: Example small (Style ID: Examplesmall)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised run style: Element (Style ID: Element)
Unrecognised run style: Element (Style ID: Element)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised run style: Attribute (Style ID: Attribute)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised run style: Datatype (Style ID: Datatype)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised run style: Keyword (Style ID: Keyword)
Unrecognised run style: Keyword (Style ID: Keyword)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised run style: Variable (Style ID: Variable)
Unrecognised paragraph style: Ref (Style ID: Ref)
Unrecognised run style: Ref term (Style ID: Refterm)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised run style: Hyperlink (Style ID: Hyperlink)
Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: List Bullet (Style ID: ListBullet)
Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1)
Unrecognised paragraph style: AppendixHeading1 (Style ID: AppendixHeading1)
root@surjit:/home/rahul# mammoth sample-04.doc my.html
Traceback (most recent call last):
File "/usr/local/bin/mammoth", line 100, in
main()
File "/usr/local/bin/mammoth", line 35, in main
output_format=args.output_format,
File "/usr/local/lib/python2.7/dist-packages/mammoth/init.py", line 17, in convert
return docx.read(fileobj).map(transform_document).bind(lambda document:
File "/usr/local/lib/python2.7/dist-packages/mammoth/docx/init.py", line 24, in read
zip_file = zipfile.ZipFile(fileobj)
File "/usr/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/usr/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

Some images don't come through, with no resulting img link

We are parsing hundreds of documents and Mammoth is really great, thanks!

One thing though, is that we have a lot of cases where the original word document has a '.wmf' image, but Mammoth doesn't detect it. There is no HTML link to the image in the generated results and the image isn't written to file by the image conversion function (or inline if this function is deactivated).

If I unzip the word file, the image files are there as '.wmf'.

Is there a way to have these images handled?
Where in the Mammoth code is the check for image types please?

thanks a lot!

<i> instead of <em> tag

Hi any possibility to map italic text to <i> tag instead of <em> tag.
Thanks

Option to preserve existing element style_id as class name

Is there an option to preserve an existing .docx style_id (the cleaned up & normalized .docx style name) as a class name on the resulting HTML element, regardless of whether this is defined in a style map?

For example, I'd like to be able to convert various .docx files that use unknown and differing sets of styles, to HTML that preserves style names (whatever they may be) as HTML element class names (on both block and inline elements).

find bold paragraphs

I’m trying to find a style mapping that turns all-bold paragraphs into headings:

p:fresh > b => h2

But I get the error "Did not understand this style mapping, so ignored it: p:fresh > b => h2"
(The same without :fresh)

Where’s my mistake?

Support Formatting of CSL_Citations

MSWord citation manager plugins (such as Mendeley, Papers, etc.) use a free open citation language (CSL, http://citationstyles.org/) that embeds reference information the document.xml container. Properly supporting this standard would be a very useful feature, since most academic writers use citation managers and not native Word references.

installed with pip; mammoth not recognized

I followed the instructions at https://pypi.python.org/pypi/mammoth to install mammoth. when I attempt to run it from the command line I get : mammoth not recognized.

I am not familiar with how to get this recognized.

Finish class LineBreak(Element)

With option to add class for break type

Mammoth error when document has certain links

We have cases where sometimes the link in documents are broken, which we fix automatically another way. The problem is that Mammoth stops on error and doesn't process the document.

Attached are a sample document and the error we see.

Is there a way we can have Mammoth continue running please, so that it processes the full document?

thanks!

Stuff2.docx
log.txt

Docx images are linked, not embeded

Python 2.7
Traceback error:
File "build\bdist.win32\egg\mammoth\docx\document_xml.py", line 233, in _read_blip
relationship_id = element.attributes["r:embed"]
KeyError: 'r:embed'

I have a word document that was originally created from a html document (I know, and now I need html again). So images were created from the href file links. When I try to convert the word document, the KeyError for r:embed occurs. I think this is linked back to the a:blip tag in the xml which has a r:link attribute referencing an image reference rather than a r:embed attribute.

In one case, this is an example (from document.xml):

<pic:blipFill><a:blip r:link="rId8"><a:extLst><a:ext uri="{28A0092B-C50C-407E-A947-70E740481C1C}"><a14:useLocalDpi xmlns:a14="http://schemas.microsoft.com/office/drawing/2010/main" val="0"/></a:ext></a:extLst></a:blip>

And from my document.xml.rels file:

<Relationship Id="rId8" Type="http://schemas.openxmlformats.org/officeDocument/2006/relationships/image" Target="file:///\\serverpath\Images\addins_go.png" TargetMode="External"/>

I'd like to have the image files linked to a file path again in an image tag if that is possible.

If you have any suggestions or need further information, I'll try to get back with you in a timely manner.

chart r:id

Hello. Very glad to use your wonderful mammoth tool.
I have a problem with charts. I understand what mammoth do not convert charts today.
But, i need to put into html chart r:id from docx.

Are there any ways to do this?

Not for Windows

I assume that mammoth is not available for Windows. Pip happily installs it, but it creates an executable in the Scripts directory which doesn't have a .exe file extension and appears to be quite clearly not any kind of Windows executable.

If it isn't could you add in a System requirements section to your readme?

mammoth: error: argument --output-dir: not allowed with argument output

root@DS:/var/www/html/ds/filebox/12/33/docx# mammoth test.docx index.html --output-dir=images
usage: mammoth [-h] [--output-dir [OUTPUT_DIR]]
[--output-format {markdown,html}] [--style-map STYLE_MAP]
path [output]
mammoth: error: argument --output-dir: not allowed with argument output

I can't see where is the problem.

Cross-references don't work in some cases

For some documents, cross-references seem to work just fine, but other they don't come through.

Attached is a new word document where I created a cross reference. The generated HTML doesn't convert these to a link.

test-crossref.docx

thanks!

Allow underscores and hyphens in class names

At the moment these are not supported in style maps.

It would be nice if a custom style map does not return an error for valid chars.

    Did not understand this style mapping, so ignored it:
    p[style-name='15 RomText Open'] => p.15-RomText-Open

Language support

In my projects often the language of a document (or paragraph or text run like quotations) is important for the conversion. I don’t know if this information is in DOCX at all, but if, it would be nice to be able to use it optionally as "lang" attributes.

Save original graphic size

What is the best way to get original image size? Getting that information post-processing is a pain in the butt especially for those emf base64 graphics.

It would be hugely helpful if we could save the original sizes (Not the scaled canvas) to the img height and width attribute, (semantic info)

--output-image-size=original
--output-image-size=none (Standard behaviour)

<img src="data:image/x-emf;base64,AZX.." width="150pt" height="50pt">

I don't really mind to which units to save them, I guess it would be great to use points as there would be no confusion about DPI. (MS Office seems to process everything at 72DPI if we went with pixels).

We can find the original dimensions inside the shape property: <a:ext cx="1905000" cy="635000"/> These use the English Metric Units (914400 EMUs per inch) So it should be straightforward to convert those?

Or we can pass the all sizing info the image class so it will be easier for people to create a custom image converters

What do you think?

Support for equations

Currently it ignores equations with the following warning
An unrecognised element was ignored: {http://schemas.openxmlformats.org/officeDocument/2006/math}oMath

Is it possible to add support for equations?

Thanks for all your work. :)

wmz image files do not generate IMG tags in generated HTML

Hi There,

We have some image files, which when I unzip the docx file, are 'wmz'. Mammoth doesn't seem to even detect these as images.

Note, the unzipped docx has the PNG versions of these WMZ files included.

Would it be possible to get support for these perhaps?

thanks!

Any support for fonts?

Hi There,

We love Mammoth! Absolutely great stuff.

We have some old legacy documents where the authors tended to use font (Courier new) to indicate code, rather than a code style. I know Mammoth isn't really designed to care about fonts, but by any chance is it possible to map finds to styles as with standard style mapping?

thanks a lot!

Table header rows don't come through for tables

Attached is a sample document with a table. Mammoth produces a table like this ...

The following variable…	Must be set to…

`SET STUFF`	Stuff 1
`SET STUFF 2`	Stuff 2

So it doesn't specify that the first row is actually a header row.

Could Mammoth perhaps set the header row tag for this scenario?

thanks!
table-headers.docx

How to convert one .docx contains a formulae to html ? It seems that it is not supported? Can you give me another advice? thank you

How to convert one .docx contains a formulae to html ? It seems that it is not supported?
Can you give me another advice? thank you

Bookmarks in documents - propose they are span tags instead of anchors

If I have a simple docx with a link to a bookmark also inside that document, in word, clicking on the link goes to the text in the document.

Upon conversion, the location of the bookmark in the document is converted to another anchor tag, and the explicit link to the bookmark is the same tag. It doesn't have the same semantics as the original document. Was this intentional?

I propose that the actual bookmark itself is converted to a span tag, with an attribute of "data-mammoth-style":"bookmark" attached so that these elements can be found effectively if necessary. In the converted HTML, clicking on the link will go to the location of the span tag, like the original document.

How to convert one .docx contains a formulae to html ? It seems that it is not supported?

Style names with non-ASCII characters not recognized in Python 2

Hi,
I am really happy that I discovered mammoth today, it's a great tool!

I have a problem with one detail, though: I want to convert a docx file from a German user, and need to define some custom styles. And of course, some of the styles contain an umlaut...
Note: I'm using python2.7

Concrete example: The style for displaying a numbered listing is called "Aufzählung", and a variant of this, where the numbers are Roman I II III instead of Arabic 1 2 3, is called "Aufzählung Römisch". By default, mammoth does not recognize that paragraphs with these styles need to be a listing.

So I define my custom style_map:

style_map = """
p[style-name='Aufzählung'] => ol > li:fresh
p[style-name='Aufzählung Römisch'] => ol.roman > li:fresh
"""

Only to find that paragraphs with those styles are still not recognized as listings. If I rename the styles to plain ASCII, e.g. "Listing Roman" in my sample and in the style_map, all is fine.

Some debugging led me to https://github.com/mwilliamson/python-mammoth/blob/master/mammoth/document_matchers.py#L60, where this Warning happens:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal.
In my test, second, which comes from the sample Word file, is the unicode string u'Aufz\xe4hlung', and first is the byte string 'Aufz\xc3\xa4hlung' - so comparing fails. If first were also decoded to unicode using utf-8, they would be recognized as equal:

'>>> Aufz\xc3\xa4hlung'.decode('utf-8').upper() == u'Aufz\xe4hlung'.upper()
True

(Again, note: If I test this using python3.5, it all works. But my application is still using python2.7, where this problem occurs.)

If I make the following code changes: master...syslabcom:master, then my sample Word file works fine (I get my <ol> tags like I want). But obviously, this code will break under Python 3, since basestring and unicode have been removed.

Therefore:

Do you care about supporting this use-case also under Python 2?
And if yes, what would you think is a sensible approach for supporting both 2 and 3? I'm willing to contribute, but my experience with Python 3 so far is rather limited.

Getting BadZipFile issue

Hey,
Trying to implement mammoth module but getting
zipfile.BadZipFile: File is not a zip file

Error in image converting

Output
File "C:\Python34\lib\site-packages\mammoth\images.py", line 3, in convert_image attributes = func(image).copy()
File "C:\Python34\lib\site-packages\mammoth\conversion.py", line 154, in _convert_image with image.open() as image_bytes:
File "C:\Python34\lib\site-packages\mammoth\docx\document_xml.py", line 227, in open_image image_file = docx_file.open(image_path)
File "C:\Python34\lib\zipfile.py", line 1148, in open zinfo = self.getinfo(name)
File "C:\Python34\lib\zipfile.py", line 1084, in getinfo 'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'word\\\\media/image1.jpg' in the archive"

Testfile
Testcontent.docx (Dropbox)

Problem in converting to image

If doc file have math/ml equations used than it is not able to convert that into image.

Docx emz image files converted to x-emf by Mammoth, when docx zip has the PNG file

Hi There,

If I unzip the docx file, I have some images of 'emz' format. These seem to get created by Mammoth as 'x-emf'.

I can convert these with unoconv to PNG, but given the docx zip file actually already has the associated PNG files for all images, is there a reason Mammoth doesn't pass through that PNG by default please?

thanks!

[Enhancement Request] Including Outline Numbered

Hey, great package, I can't tell you how much it's helped me out.
The dream document I'd love to be able to convert is in a massive outline format. Just wondering if you looked at Word's Numbered Outline layout as a possibility, or if it would be impossible. Or maybe there's a way to do it in the code, as is? Thanks!

Math/Ml issue

Hey. I am using math/ml equations directly in my doc file.
<math xmlns="http://www.w3.org/1998/Math/MathML"><msubsup><mo>∫</mo><mrow><mi>cos</mi><mfenced><mrow><mi>tan</mi><mfenced><mrow/></mfenced></mrow></mfenced></mrow><mrow><mi>sin</mi><mfenced><mrow><msub><mi>log</mi><mrow/></msub><mfenced><mrow/></mfenced></mrow></mfenced></mrow></msubsup></math>

I want this text not to get convert to html that is i want this data to remain as it is how can i do that?
Thank You!

'_io.BytesIO' object had no attribute 'name'

Looks like line 31 of docx/init.py was changed a couple revs back to be:

body_readers = _body_readers(getattr(fileobj, "name"), zip_file)

We're calling mammoth.convert_to_html with a io.BytesIO stream and so it doesn't have a name attribute and the code bombs out with the AttributeError in the issue title. Is there an intentional design decision here to restrict the code to objects with name attributes?

Support w:sym elements

There's currently no support of arrows that are generated by word automatically when typing -->.

It should be converted to HTML arrows like →

AttributeError: 'Tab' object has no attribute 'children'

I have a pretty simple .docx which generates this error:

  File "/Users/greg/coding/code4sa/za-parliament-scrapers/za_parliament_scrapers/questions.py", line 100, in extract_content_from_document
    text = mammoth.extract_raw_text(f).value
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/__init__.py", line 27, in extract_raw_text
    return docx.read(fileobj).map(_extract_raw_text_from_element)
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/results.py", line 10, in map
    return Result(func(self.value), self.messages)
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/__init__.py", line 34, in _extract_raw_text_from_element
    text = "".join(map(_extract_raw_text_from_element, element.children))
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/__init__.py", line 34, in _extract_raw_text_from_element
    text = "".join(map(_extract_raw_text_from_element, element.children))
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/__init__.py", line 34, in _extract_raw_text_from_element
    text = "".join(map(_extract_raw_text_from_element, element.children))
  File "/Users/greg/coding/code4sa/pmg-cms-2/env/lib/python2.7/site-packages/mammoth/__init__.py", line 34, in _extract_raw_text_from_element
    text = "".join(map(_extract_raw_text_from_element, element.children))
AttributeError: 'Tab' object has no attribute 'children'

How to set table class using style maps?

Hello,

<w:tbl>
            <w:tblPr>
                <w:tblStyle w:val="tableone"/>
                <w:tblW w:w="0" w:type="auto"/>
                <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1"
                    w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
            </w:tblPr>
            <w:tblGrid>

With the above document excerpt, I am trying to set a class on the table styled with tableone style name.
I am using the following style_map but the output is not giving me the desired result, which should be
<table class="mytableone"></table> as specified in below style map settings:

style_map = """
tbl[style-name='tableone'] => table.mytableone:fresh
table[style-name='tableone'] => table.mytableone:fresh
"""

None of the above produces the needed output.

What is the best way to set table classes using Mammonth's style map settings?

Thanks for your reply.

Can convert_to_html save images in a separate dir?

The CLI can do it but I don't see the option when called from the library. What do you recommend if there isn't an option?

Numbered lists where the start number is not 1

Hi there,

We have some legacy documents, where the authors have started a numbered list at "1", then entered a bulleted list, table, then another numbered list item where the number is set to '2'. When parsing with Mammoth, this second numbered list item is set to "1".

I tried not setting freshness ...

p[style-name='Numbered List'] => ol > li

But no luck.

Is there a way to persist the numbering from the word document please?

thanks!

Markdown + output_dir?

Is it possible to export images and also convert the docx file to markdown? I'm only able to convert to html when using a different output_dir.

Example:

mammoth a_doc_file.docx --output-dir=media --output-format=markdown

How to convert text alignment to html

I will convert text alignment to html,but not support.

<w:p w:rsidR="006F2D0A" w:rsidRDefault="006F2D0A" w:rsidP="006F2D0A">
<w:pPr>
<w:jc w:val="center"/>
</w:pPr>
<w:r>
<w:t>textAlign</w:t>
</w:r>
</w:p>

Pre converted html tags in doc file

Is there any option to ignore the html conversion of pre-converted (html statements) in the input docx file.
ie if the input file contains few html tags, can we avoid the conversion for those statements.

element.name error

I had to change def read to the bellow to get python-mammoth working. It's line 284 in body_xml.

def read(element):
if hasattr(element, 'name'):
handler = handlers.get(element.name)
else:
setattr(element, 'name', 'noname')
handler = None;

Question: How to convert embedded x-emf images?

Hello,

How can we convert embedded .x-emf images to png or jpg? Is there any option/setting to output the embedded images to png or jpg instead of .x-emf?

Currently when I convert docx files, I get some images in the output-dir with .x-emf format and would need to convert them to png or jpg during docx conversion process.

Thanks for your help.

Underlines and Header note

The README states that underlines are supported. But if I add an underline in Word via the underline button it does not transfer over to HTML. Also H1 to H4 is working but H5 and H6 dropping out. Again, the README states this is implemented.

Unicode Error

Hi,

first of all congratulations for mammoth. It is really a great tool. Unfortuantely, when I run mammoth with by document I get the following error:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 42056: character maps to

Do you have any idea, what could be the issue here and how I could fix it. I run mammoth on windows 10

Update:
In particular the issue occurs if you use "wingdings" font with character "§" symbol

Moreover I figured that symbols such as arrow keys are not exported correctly. Here I get the error:
An unrecognised element was ignored: w:sym

Custom List Styles

Hi, I have another question. What's the best way to get different numbering styles in lists? Several lists in the documents I'm working with are using letters or Roman numerals, but it looks like the only options the code understands are numbered or not. Thanks so much for the help!

Feature Request: Support .xlsx

Cool script! Any chance you'd think about .xlsx as well?

After the success of the html conversion style is lost？

Hello
Thank you very much for providing the functionality, but I am now having a problem. I converted the docx file to html when the conversion was successful. But the original word in the style are gone, lost. Converted html only p tags, strong tags. I hope you can help me, thanks

target frame in a href?

Hi Michael,

Is it possible to add switch to cli when parsing word document a href links get target=_blank or any other target frame? Seems that there is no option in Mac word when editing hyperlink to add target frame.

mwilliamson / python-mammoth Goto Github PK

python-mammoth's People

Contributors

Stargazers

Watchers

Forkers

python-mammoth's Issues

Recommend Projects

Recommend Topics

Recommend Org