Giter Site home page Giter Site logo

mc-pohp-utils's Introduction

mc-pohp-utils

Utility for publishing collections of book-length interview documents on the web. For now the main bit of functionality is docx2html.py, which translates docx-formatted interviews into HTML.

Setup

git clone https://github.com/milesefron/mc-pohp-utils.git
pip install -r requirements.txt

Running

This program looks for .docx files (assuming that these will be in the format of Miller Center POHP interviews). It will translate each .docx file it finds into a corresponding .html file, in the same directory as the .docx file.

There are three ways to run this program.

First, with no arguments like so:

$ python docx2html.py

In this case, the program will look for .docx files in $HOME/poh. NB it doesn't do a full directory walk; just a shallow listing of files.

The second method is to supply a different directory:

$ python docx2html.py /path/to/directory-full-of-docx-files

Lastly, you can point the program at a single .docx file:

$ python docx2html.py path-to-docx-file.docx

When the program is given a directory, it will ignore any files that don't have a .docx file extension. If there are Word documents that aren't bona fide POHP interviews, the results will be unpredictable and probably not useful.

There is also a -v aka --verbose flag. This only has an effect if there is trouble during the parsing or processing of documents. In that case, any stacktraces will be shown.

Understanding

This program is intended for use by Miller Center POHP staff. We often need to translate multiple long-form interview files from docx to html. This only works because the interviews follow a highly structured format. This software has been tested on over 100 interviews and works well on them. But it is somewhat brittle and its output should be checked before publishing results, especially in these cases:

  • Redactions. Be sure that all redactions seem to have been rendered properly.
  • Name headings. If interview personnell have unusually formatted names (e.g. with multiple titles or middle names), check closely that they appear correctly.
  • Starting line. Most interviews have frontmatter before the interview itself, which should be skipped by this program. But that means the software needs to detect where the interview proper begins. Again, this worked in our training examples, but a spot check is still a good idea.

If you need to alter the way the program handles interviews, you'll want to edit one of two files in this repo:

  • docx2html.py, which is where the heavy lifting happens
  • interview_utils.py which contains some basic helper code.

Troubleshooting

One thing to be aware of: this program tries to figure out who is speaking by looking for names with this logic:

  • at the beginning of a line
  • is it bolded?
  • does it follow a name-ish regex?
  • does it end with a colon? That allows us to identify Thomas Jefferson as the speaker when a line starts like so (using HTML as a proxy for Word):
<p><b>Thomas Jefferson:</b> Hi, I am the president.</p>

This works in almost all cases. However, every now and then you'll see something like this instead:

<p><b>Thomas Jefferson</b>: Hi, I am the president.</p>

i.e. the colon is OUTSIDE of the boldface. We originally tried to resolve these automatically, but found that there were unpredictable edge cases that made this too risky. As such, you may see the program bomb out with an error like this:

/Users/foo/poh/zzz_donezo/Alpha_Beta.docx
FOUND NON-BOLDED COLON IN NAME SLUG.  You should open the docx file and edit the bolding in this line...
Chanin: Oh yes. A fourth type of meeting included very large sessions in the East Room with the President speaking on a particular issue. At the deputy level this required you to watch the logistics. We had an excellent bunch of people to handle this area on our staff. As you know the mesh of logistics with politics happens very quickly.

This means that you should go edit the docx file /Users/foo/poh/zzz_donezo/Alpha_Beta.docx, looking for the shown line, and bringing the colon inside the boldface.

These should be rarities, but now you've been warned.

Addendum: Uploading PDF Materials to AWS

This repo also contains a program that will upload PDF materials (transcripts and/or briefing books) to their relevant locations in AWS for web display. This program is upload-materials.py and it works similarly to what we described earlier. For instance, invoking it like so:

$ python upload-materials.py

will look for PDF files in $HOME/poh, sending any with interview.pdf or backgroundmaterials.pdf to their respective AWS S3 buckets. As the program uploads the files, it logs each uploaded file's public URL for inclusion on the web.

You can also feed this program a single file to upload, as in:

$ python upload-materials.py /path/to/file.backgroundmaterials.pdf

or

$ python upload-materials.py /path/to/file.interview.pdf

Lastly, you can give this program a directory to process like in the default invocation, where

$ python upload-materials.py /path/to/directory/full/of/files

will send any files with interview.pdf or backgroundmaterials.pdf to their respective AWS S3 buckets. As the program uploads the files, it logs each uploaded file's public URL for inclusion on the web.

mc-pohp-utils's People

Contributors

milesefron avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.