stadelmanma / sablon Goto Github PK

View Code? Open in Web Editor NEW

This project forked from senny/sablon

1.0 1.0 0.0 5.83 MB

Ruby Document Template Processor based on docx templates and Mail Merge fields.

License: MIT License

Ruby 100.00%

sablon's People

Contributors

Stargazers

Watchers

sablon's Issues

Add processing of endnotes and footnotes

This can be done simply by adding footnotes.xml and endnote.xml to the files sablon parses. The content inside each <w:footnote> or <w:endnote> tag is the same as the document.xml.

I also want to add footnotes programmically using HTML. I'll probably make a pseduo-element <footnote id="#"> and <footnoteref id="#"> assuming nokogiri doesn't blow up on a fake element. If that is the case then I'll use a <div> with a class of "footnote"

Refactor template.rb so processors can be "registered"

This would work like the content module where document processors can be registered and given a pattern and precedence. Using my current template.rb setup it would look something like:

register_processor(%r{word/document.xml}, Processor::Document, 0)
register_processor(%r{word/(?:header|footer)\d*\.xml}, Processor::Document, 100)
register_processor(%r{word/footnotes\.xml}, Processor::Document, 200)
register_processor(%r{word/endnotes\.xml}, Processor::Document, 300)
register_processor(%r{word/numbering.xml}, Processor::Numbering, 400)
register_processor(/\[Content_Types\].xml/, Processor::ContentType, 500)

I don't know if this is the best approach for files like Numbering and ContentType because those might be better served in a more dynamic fashion. For example, numbering is only messed with when lists are added and never formally processed. The same logic could be applied to *.rels files and content types with some refactoring.

The main reasoning behind this is it will allow the end user to extend sablon with new processors much easier than they are able to now. Instead of monkey patching we can have a more formal API that while isn't super easy to access is still workable.

Expand HTML processing capabilities

Add support for the <span> tag as a method to format individual runs with inline style. Allow existing tags to support things specified by a style="" attribute.

A span tag will create a new run using the <w:r> tags and it's content will go in a <w:t> tag
Add support for basic inline styles:
text-align: left, right, center, justify (justify => both internally), use <w:jc> element
font-style: normal, italic, bold (bold is not a regular CSS option)
- <w:b /> for bold, <w:i /> for italics
font-weight: 0 < normal < 400 , bold > 400, bold & bolder = bold, normal & lighter = normal
color: (pass the supplied value directly into the XML, use hexdecimal without the #)
- <w:color w:val="FFF200"/> is the proper tag
background-color: (pass the supplied value directly into the XML)
- <w:shd w:val="clear" w:color="auto" w:fill="FFFF00"/> is the proper tag
text-decoration: underline, use the <w:u w:val="single"/> tag

Need to add support for the <sup> and <sub> tags.

It would be great to add support for the <table> tags

Paragraph styles go in <w:pPr> elements, content alignment can only be applied at a paragraph level.
Run styles go in <w:rPr> elements

Paragraph Information:
http://officeopenxml.com/WPparagraph.php
Text Information:
http://officeopenxml.com/WPtextFormatting.php
Table information:
http://officeopenxml.com/WPtable.php

Task List:

Implement support for the <span> tag
Implement support for a limited set of attributes specified by the style= attribute
- Do this in a flexible way so it can be relatively easy to add additional styles in the future
- Allow it to function on both the run and paragraph levels
Implement support for the <sup> and <sub> tags
- Ideally these would still work inside of a <span> tag
Implement support for parsing of HTML <table> tags

Implement a simple DOM to use with a template

Currently the environment instance serves as a poor man's document model by collecting various helper classes like numbering, footnotes and bookmarks. This system is inherently hard to scale. Some work on my live branch has laid a foundation for this such as storing all of the XML files in a template in memory instead of simple sequential processing.

The general implementation would be as followd:

Add alib/sablon/document_object_model directory
In that directory class files could be written to handle specific aspects of the document, such as numbering.xml. A generic "document" class would also exist.
The files that get shuttled off to the various DOM subclasses would be defined in a single "initialization" method of some sort.
A general DOM class instance would also exist that does the '"thinking" and would contain methods such as add_image, add_relationship, add_bookmark, add_list_definition, etc.
- The implementation details of these methods may be split between the "DOM" class and the file specific sub class. Adding relationships would be an example since the DOM class would need to pick the proper file (i.e. document.xml => document.xml.rels) due to the fact a file specific subclass would have no knowledge of the source of the relationship to be added.

Examples of files that might be repurposed into "dom classes" are lib/sablon/numbering.rb and lib/sablon/relationships.rb

Allow images to be imported with a partial

This could potentially be a bear to implement depending on how easily I can locate the image within word's directory structure. Checking out some of the sablon forks that implemented image substitution within a merge field would be a good place to start.

Requires #2

Current issues when using partials

This is a checklist of issues that are encountered when importing content from a partial they will be dealt with in individual issues and PRs.

Text header styles are not transferred across, inline style behaves just fine.
- Ideally we would only bring across the "Heading #" identifier and let the host document decide what style it should be.
Images won't be handled properly because the media folder isn't looked into
- I'll need to be careful injecting content like this incase the internal node IDs word seems to use conflict.

Implement MS word comments in HTML?

Use a basic WordML xml file as partial

This is the most basic level of functionality but lays the foundation for higher level functionality.

I want partials to be named with a leading underscore to match rails ERB convention. They will be called in the document using the name but omitting the leading underscore in a special merge field. «partial:filename» . Partials will be assumed local to the file they are within, however down the road I will add the ability to check a 'templates/shared' folder if the local lookup fails.

I still need to do the merge field substitution on this partial so it may be easier to recursively call the merge parser and then inject the final product into the document. In theory that would allow for nested partials.

In the current implementation of Sablon when using a WordML injection the entire paragraph is replaced by the content, this will also occur when using a partial.

Reference issue on original project: senny#40

Refactor how footnotes are handled

Currently the way I handle footnotes is half baked. I allow them inside the regular HTML insertion content because I couldn't think of a better way at the time, I have now thought of a better way.

The new method will use a content wrapper Sablon::Content::Footnote. Keys in the context hash will be of the form footnote:name, or already wrapped in that content type. The value of the key can be of three types, any other ones will raise an error, example below.

# New footnote insertion format examples
context = {
  # this first "plain text" format is inserted directly as the footnote text with no changes (i.e. String insertion). 
  'footnote:address' => '123 Example Dr. Orlando, FL',
  #
  # Insertion of HTML content (WordML would work exactly the same)
  # If content type is missing the logic falls back to Sablon::Content::Content#wrap
  'footnote:reference' => {
     content_type: :html,
     content: '<em>Title</em> - author name'
   },
  'footnote:reference2' => { 
     content: Sablon.content(:html, '<em>Title</em> - author name') 
   },
  'footnote:reference3' => Sablon.content(:footnote, Sablon.content(:html, '<em>Title</em> - author name'))
}

As seen in the example above footnote content gets wrapped twice, first so Sablon knows what to do with it and secondly to define how the actual content going inside the w:footnote tags is structured. The final result regardless of the starting point is a WordML insertion into the footnotes.xml file. When starting with plain text the full XML structure is generated, subbing in the content. When starting from HTML or WordML the footnote reference run is added in automatically if it is missing. If needed a wrapping paragraph tag will be added and an attribute pStyle with a value of "FootnoteText" if pStyle is missing. This means if the user supplies a <p> or <div> tag when defining the footnote they will need to set the pStyle attribute appropriately. as it it only checked after being converted to WordML.

# sample XML for a footnote
<w:footnotes>
  ...
  <w:footnote w:id="3">
    <w:p>
      <w:pPr>
        <w:pStyle w:val="FootnoteText"/>
      </w:pPr>
      <w:r>
        <w:rPr>
          <w:rStyle w:val="FootnoteReference"/>
        </w:rPr>
        <w:footnoteRef/>
      </w:r>
      <w:r>
        <w:t xml:space="preserve"> Footnote text content </w:t>
      </w:r>
    </w:p>
  </w:footnote>
  ...
</w:footnotes>

Footnote insertion into a document will take two forms, first if the name is used directly in a merge field of the document itself a footnote reference tag will be inserted, this will be the "standard" way supported in the upstream package. In my fork I'll reimplement the tag. The main benefit of this route is that I'll have all of my footnotes defined prior to processing the document.xml file. This means I can use the context to resolve footnote ids at parse time instead of afterwards, allowing me to remove the @env.footnotes.update_refereces call in converter.rb.

#Example of reference tag XML inserted into document.xml
      <w:r>
        <w:rPr>
          <w:rStyle w:val="FootnoteReference"/>
        </w:rPr>
        <w:footnoteReference w:id="4"/>
      </w:r>

I'm not sure how best to handle using the same footnote more than once. Currently I duplicate the footnote and allow the insertion. A better option might be to throw an error since it is technically disallowed by MS Word. Reuse of footnotes would require deliberate duplication in the context.

Closing note: This should also be applicable to endnotes with a few tweaks to element names, styles, etc. I also might be able to implement this as a "configuration only" option with the right changes to implement a basic DOM.

Change Numbering class from a singleton into being registered on the Context object

Implement code coverage

This would just be a nice feature not really required.

Implement equations

This will be highly dependent on how complex the math markup is. I'll probably create a pseduo tag <math> or <equation> to maintain semantic elements.

Use a regular docx file as a partial, only for simple text

Using WordML markup is simple but generating and maintaining the WordML xml snippets will be a major hassle. Instead we should be able to pull content from an existing docx file to greatly simplify the maintenance burden.

Word documents are very complex and the entire document's xml code should not be injected into the document calling the partial. The desired content will be indicated by two special merge fields.
«beginPartial» and «endPartial».

Only content in between those nodes will be kept, not including the nodes themselves. This should be essentially an equivalent XML stream to that handled by direct use of WordML. During initial implementation it may be beneficial to have docx and xml inject side-by-side to test for consistency.

Requires #1

Make Sablon use live XML instead of strings until files are written

Using live XML instead of strings when working with sablon is the first step towards a formal document object model and will allow a hige degree of flexibility because references to give portions of the document (i.e. bookmarks or footnote refs) can be maintained. This prevents the need for an "ast_to_docx" hook to do things like update my footnote references. This conversion will take some effort but I think it is worthwhile especially when we want to venture into other territories like adding media, docs partials, etc.

Basically instead of having the to_docx method return a string it would return an XML node or NodeSet that would be injected into the document at the proper location.

Implement a way to add footnotes, table and figure references

Maybe done using bookmarks? Probably needs stuff in a rels file as well

Allow headers and footer text content to be filled by a partial

Simple text injection into headers and footers would be very handy. Additional features such as setting up the page numbering and such would be an additional win but may be more complicated than it's worth.

This will take some experimentation because I am not sure exactly how content in headers and footers gets broken down.