Giter Site home page Giter Site logo

go-docx's Introduction

go-docx

tests goreport GoDoc reference

Replace placeholders inside docx documents with speed and confidence.

This project provides a simple and clean API to perform replacing of user-defined placeholders. Without the uncertainty that the placeholders may be ripped apart by the WordprocessingML engine used to create the document.

Example

  • Simple: The API exposed is kept to a minimum in order to stick to the purpose.
  • Fast: go-docx is fast since it operates directly on the byte contents instead mapping the XMLs to a custom data struct.
  • Zero dependencies: go-docx is build with the go stdlib only, no external dependencies.

Table of Contents


➤ Purpose

The task at hand was to replace a set of user-defined placeholders inside a docx document archive with calculated values. All current implementations in Golang which solve this problem use a naive approach by attempting to strings.Replace() the placeholders.

Due to the nature of the WordprocessingML specification, a placeholder which is defined as {the-placeholder} may be ripped apart inside the resulting XML. The placeholder may then be in two fragments for example {the- and placeholder} which are spaced apart inside the XML.

The naive approach therefore is not always working. To provide a way to replace placeholders, even if they are fragmented, is the purpose of this library.

➤ Getting Started

All you need is to go get github.com/lukasjarosch/go-docx

func main() {
        // replaceMap is a key-value map whereas the keys
	// represent the placeholders without the delimiters
	replaceMap := docx.PlaceholderMap{
		"key":                         "REPLACE some more",
		"key-with-dash":               "REPLACE",
		"key-with-dashes":             "REPLACE",
		"key with space":              "REPLACE",
		"key_with_underscore":         "REPLACE",
		"multiline":                   "REPLACE",
		"key.with.dots":               "REPLACE",
		"mixed-key.separator_styles#": "REPLACE",
		"yet-another_placeholder":     "REPLACE",
	}

        // read and parse the template docx
	doc, err := docx.Open("template.docx")
	if err != nil {
	    panic(err)
	}

        // replace the keys with values from replaceMap
	err = doc.ReplaceAll(replaceMap)
	if err != nil {
	    panic(err)
	}

        // write out a new file
	err = doc.WriteToFile("replaced.docx")
	if err != nil {
		panic(err)
	}
}

Placholders

Placeholders are delimited with { and }, nesting of placeholders is not possible. Placeholders can be changed using ChangeOpenCloseDelimiter().

Styling

The way this lib works is that a placeholder is just a list of fragments. When detecting the placeholders inside the XML, it looks for the OpenDelimiter and CloseDelimiter. The first fragment found (e.g. {foo of placeholder {foo-bar}) will be replaced with the value from the ReplaceMap.

This means that technically you can style only the OpenDelimiter inside the Word-Document and the whole value will be styled like that after replacing. Although I do not recommend to do that as the WordprocessingML spec is somewhat fragile in this case. So it's best to just style the whole placeholder.

But, for whatever reason there might be, you can do that.

➤ Terminology

To not cause too much confusion, here is a list of terms which you might come across.

  • Parser: Every file which this lib handles (document, footers and headers) has their own parser attached since everything is relative to the underlying byte-slice (aka. file).

  • Position: A Position is just a Start and End offset, relative to the byte slice of the document of a parser.

  • Run: Describes the pair <w:r> and </w:r> and thus has two Positions for the open and close tag. Since they are Positions, they have a Start and End Position which point to < and > of the tag. A run also consists of a TagPair.

  • Placeholder: A Placeholder is basically just a list of PlaceholderFragments representing a full placeholder extracted by a Parser.

  • PlaceholderFragment: A PlaceholderFragment is a parsed fragment of a placeholder since those will most likely be ripped apart by WordprocessingML. The Placeholder {foo-bar-baz} might ultimately consist of 5 fragments ( {, foo-, bar-, baz, }). The fragment is at the heart of replacing. It knows to which Run it belongs to and has methods of manipulating these byte-offsets. Additionally it has a Position which describes the offset inside the TagPair since the fragments don't always start at the beginning of one (e.g. <w:t>some text {fragment-start</w:t>)

➤ How it works

This section will give you a short overview of what's actually going on. And honenstly.. it's a much needed reference for my future self :D.

Overview

The project does rely on some invariants of the WordprocessingML spec which defines the docx structure. A good overview over the spec can be found on: officeopenxml.com.

Since this project aims to work only on text within the document, it currently only focuses on the runs (<w:r> element). A run always encloses a text (<w:t> element) thus finding all runs inside the docx is the first step. Keep in mind that a run does not need to have a text element. It can also contain an image for example. But all text literals will always be inside a run, within their respective text tags.

To illustrate that, here is how this might look inside the document.xml.

 <w:p>
    <w:r>
        <w:t>{key-with-dashes}</w:t>
    </w:r>
</w:p>

One can clearly see that replacing the {key-with-dashes} placeholder is quite simple. Just do a strings.Replace(), right? Wrong!

Although this might work on 70-80% of the time, it will not work reliably. The reason is how the WordprocessingML spec is set-up. It will fragment text-literals based on many different circumstances.

For example if you added half of the placeholder, saved and quit Word, and then add the second half of the placeholder, it might happen (in order to preserve session history), that the placeholder will look something like that (simplified).

 <w:p>
    <w:r>
        <w:t>{key-</w:t>
    </w:r>
    <w:r>
        <w:t>with-dashes}</w:t>
    </w:r>
</w:p>

As you can clearly see, doing a simple replace doesn't do it in this case.

Premises

In order to achive the goal of reliably replacing values inside a docx archive, the following premises are considered:

  • Text literals are always inside <w:t> tags
  • <w:t> tags only occur inside <w:r> tags
  • All placeholders are delimited with predefined runes ({ and } in this case)
  • Placeholders cannot be nested (e.g. {foo {bar}})

Order of operations

Here I will outline what happens in order to achieve the said goal.

  1. Open the *.docx file specified and extract all files in which replacement should take place. Currently, there files extracted are word/document.xml, word/footer<X>.xml and word/header<X>.xml. Any content which resides in different files requires a modification.

  2. First XML pass. Iterate over a given file (e.g. the document.xml) and find all <w:r> and </w:r> tags inside the bytes of the file. Remember the positions given by the custom io.Reader implementation. Note Singleton tags are handled correctly (e.g. <w:r/>).

  3. Second XML pass. Basically the same as the first pass, just this time the text tags (<w:t>) inside the found runs are extracted.

  4. Placeholder extraction. At this point all text literals are known by their offset inside the file. Using the premise that no placeholder nesting is allowed, the placeholder fragments can be extracted from the text runs. At the end a placeholder may be described by X fragments. The result of the extraction is the knowledge of which placeholders are located inside the document and at which positions the fragments start and end.

  5. Making use of the positions and replace some content. This is the stage where all the placeholders need to be replaced by their expected values given in a PlaceholderMap. The process can rougly be outlined in two steps:

    • The first fragment of the placeholder (e.g. {foo-) is replaced by the actual value. This also explains why one only has to style the first fragment inside the document. As you cannot see the fragments it is still a good idea to style the whole placeholder as needed.
    • All other fragments of the placeholders are cut out, removing the leftovers.

All the steps taken in 5. require cumbersome shifting of the offsets. This is the tricky part where the most debugging happened (gosh, so many offsets). The given explanation is definitely enough to grasp the concept, leaving out the messy bits.

➤ License

This software is licensed under the MIT license.

go-docx's People

Contributors

gsbecerrag avatar lukasjarosch avatar sashamelentyev avatar zhasulan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

go-docx's Issues

PlaceholderMap (or Word) does not adhere to '\n' character

It does not seem to be possible to add a 'newline' character (\n or \r\n if you will) within a table cell of a Word table.

Current formatting:

var location string
	for _, host := range res.HostInfo {
		for i := range host.Ports {
			location += (host.Address + ":" + host.Ports[i].PortNumber + " (" + host.Ports[i].Proto + ")" + "\n") // also tried \r\n and \t
		}
	}

Output seems only to add some kind of whitespace within the cell:

locations1

This might be the dumbest issue ever written, so sorry beforehand. Is there any way to enforce this newline?

Error if placeholder has leading or tailing whitespace(-s) inside of delimiters

Example:

  • My docx template I have { organization} (pay attention to leading whitespace) and {address} placeholders
  • My placeholder map have organization and address fields

Than program falls with error:

...: not all placeholders were replaced, want=2, have=1

I tried to run with debugger to see which behaviour going on under the hood. I found out that stripXmlTags method deletes whitespaces from placeholder. But inside of docx.Open() method you have another behaviour which does not trim whitespaces.

This results to placeholders dismatch and throwing an error.

function to get all placeholders in file

I am working on tool to use docx files as templates and the package have cut a lot of effort so thanks for your work
i think it would be better if there was a function that return an array with all placeholders in template or even better map with place holder with n of occurrence of the place holder
thanks in advance

Inserting of new runs / paragraphs / docx-files

Only replacing placeholders might not always be enough. One might face the task to insert a sub-template into the document.

This currently cannot be done since replacements will always occur inside <w:t> tags but new content will most likely be added on a paragraph (<w:p>) level.

Is it possible to avoid line break collapse?

Hello,
I'm trying to use a value that's a literal string with some line breaks in it. Currently the line breaks gets collapsed and the string no longer looks the way I expect it to.

docx.PlaceholderMap{
   "PLACEHOLDER_KEY":    `here is my string
with different 
items on 
multiple lines
`,
}

turns into
here is my string with different items on multiple lines

Is it possible to preserve line breaks in placeholder values?

Documents corrupted

On certain Microsoft Office versions, the rendered documents can only be opened after repairing them.

This issue needs to be investigated and resolved

Many paragraphs.

Hi,
Is it possible to replace the placeholder with array of strings, where each array member would be in a new paragraph? Maybe it's possible to do it some other way?
If so, maybe you can give an example.

Document tag truncated

The generated document is invalid if the template file contains a spell-checking mark (curly red underline).

image

Word error message is as follows:
image

Data in word\document.xml is truncated:
w:rsidRDefault="00DB7D3C"/><w:sectPr w:rsidR="00DB7D3C"><w:pgSz w:w="11906" w:h="16838"/><w:pgMar w:top="1417" w:right="1417" w:bottom="1134" w:left="1417" w:header="708" w:footer="708" w:gutter="0"/><w:cols w:space="708"/><w:docGrid w:linePitch="360"/></w:sectPr></w:body></w:docum

Please see content of template.docx and test.docx file attached.

test.docx
template.docx

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.