Giter Site home page Giter Site logo

cm-uclii's Introduction

cm-uclii

Data and progress tracking for table extraction and semantically guided content enhancement

cm-uclii's People

Contributors

jkbcm avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

Forkers

jkbcm ikivanc

cm-uclii's Issues

Processing multi-row column headers discards all but one row

Many tables in this corpus and similar papers make use of multiple rows of column headers. This provides nested or tree-structured headings.

Example:
nestedtableheadingexample_svg

Currently, only one row of column headers is preserved by norma. This affects the HTML and CSV outputs.

In this example the upper line of headings (MUSIC | CBTE | CBTT | Total) intended as column headings are not included in the final formats for the table:

HTML:
nestedtableheadingexample_html

CSV:
[extract from start of file]

"Demographic variable ","(n = 19) ","(n = 18) ","(n = 18) ","(N = 55) "
Age,40.37 (9.64),40.72 (11.02),41.39 (12.73),40.61 (10.99)
Gender (female),7 (37%),12 (67%),6 (32%),25 (45%)
Indigenous status (Aboriginal),0 (0%),2 (11%),0 (0%),2 (4%)

Cause:

The current table output format uses a simple HTML4/XHTML structure:

<table>
 <caption />
 <tr><th></th> ... </tr>
 <tr><td></td> ... </tr>
 ...
</table>

Only one header row (i.e., row of form <tr><th></th> ... </tr>) is included. This is added in module svg2xml in TableContentCreator.java in method addHeader.

A solution to this issue which would simplify future development would be to make use of the higher-level table-structuring elements introduced in HTML4. Specifically an HTML4 table has syntax:

<!ELEMENT TABLE - -
     (CAPTION?, (COL*|COLGROUP*), THEAD?, TFOOT?, TBODY+)>

So <thead> could be used to group multiple header rows semantically without adding attributes to indicate header rows etc. for downstream processing and differentiation from rows of observation data.

Impact: Downstream processing would need a small refactor to take account of tables with a mixture of <tr> and the other grouping elements <thead>, <tbody> and <tfoot>. This would affect svg2xml, html and possibly other modules.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.