gawati / gawati-data Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 7.82 MB

Gawati Data server is a component of the Gawati application a legal data exchange platform

License: Other

XQuery 99.79% HTML 0.21%

akoma-ntoso portal xml xquery

gawati-data's People

Contributors

Stargazers

Watchers

gawati-data's Issues

Gawati Content Extraction and Relation with Gawati AKN metadata

Content Extraction and Mapping with Akoma Ntoso Documents

Gawati provides a separation between the metadata of a document, and the actual document itself.
Here by actual document we mean the PDF, Word, HTML legal document uploaded by the user into the system.

Relation between AKN and PDF documents

Currently the relation between the AKN metdata document and the PDF (for e.g.) is made via an xml reference.

The AKN metadata document provides a standard way to reference the document itself, via the <an:identification> block. If you open any AKN document you will see this part of the document with akomaNtoso->(doc type)->meta->identification :

<an:identification source="#gawati">
        <an:FRBRWork>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961"/>
            <an:FRBRdate name="Work Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRcountry value="za" showAs="South Africa"/>
            <an:FRBRnumber value="gn_no_47-1961" showAs="GN No. 47/1961"/>
            <an:FRBRprescriptive value="false"/>
            <an:FRBRauthoritative value="false"/>
        </an:FRBRWork>
        <an:FRBRExpression>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@"/>
            <an:FRBRdate name="Expression Date" date="1961-05-18"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRlanguage language="eng"/>
        </an:FRBRExpression>
        <an:FRBRManifestation>
            <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.xml"/>
            <an:FRBRuri value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/.akn"/>
            <an:FRBRdate name="Manifestation Date" date="2016-03-30"/>
            <an:FRBRauthor href="#author"/>
            <an:FRBRformat value="xml"/>
        </an:FRBRManifestation>
    </an:identification>

Just a quick explanation here of work / expression and manifestation. The Work refers to the Legislation in general - in this case the "Republic of South Africa (Temporary Provisions) Act 1961". The act can have multiple amendments over the years and can be published in different languages and formats - the "Work" encompasses everything. The expression is a specific published version of the Act at a specific point in time, indicated by the Expression date (FRBRExpression->FRBRdate). The manifestation is even more specific and talks about a specific format.

The XML document above is typically referenced via the Expression IRI (Internationalized Resource Identifier):

  <an:FRBRthis value="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main"/>

If you look further down the AKN document you will find a reference to a PDF file:

        <an:body>
            <an:book refersTo="#mainDocument">
                <an:componentRef src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
                    alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" GUID="#embedded-doc-1"
                    showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"/>
            </an:book>
        </an:body>

Here the <an:componentRef> provides a platform indpendent way to resolve the PDF document, lets examine it closely:

        <an:componentRef 
	    src="/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf"
            alt="akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf" 
            GUID="#embedded-doc-1"
            showAs="The Republic of South Africa (Temporary Provisions) Act, 1961"
	/>

The first attribute @src is a FRBRManifestation iri to the PDF. In gawati the binary document (the PDF) is stored on the file system:

So with an IRI like:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/!main.pdf, the folder part of it would be: /akn/za/act/1961-05-18/gn_no_47-1961/eng@/

The @alt attribute identifes the actual file name:

akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf; so the actual path of the file within the file system repository of PDFs would be:

/akn/za/act/1961-05-18/gn_no_47-1961/eng@/akn_za_act_1961-05-18_gn_no_47-1961_eng_main.pdf

Content Extraction

By Content Extraction we mean extracting the content from the PDF (eventually other formats) and associating it with the AKN metadata of the same document. We want to get this extracted content into an XML document, so we can search the AKN metadata and correlate that easily with the extracted content information by keeping them in the same context. The way to link the content extraction with the AKN metadata document would be again the AKN IRI of the document.

A specific document structure will be used to hold the extracted content:

<document xmlns="http://gawati.org/ns/text/1.0">
   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>
   <text>
      <page no="1">
         page 1 text in terms of words, sematic lines sentences etc...
      </page>
      <page no="2">
         page 2 text...
      </page>
      <page no="3">
        page 3 text ....
      </page>
      <page no="4">
        page 4 text...
      </page>
    .....
   </text>
</document>

Here the <source> element provides a reference point to the AKN metadata document. Each of the @iri attributes <work ... /> <expression .../> <manifestation... /> refer to <FRBRthis href=... within <FRBRWork.../>, <FRBRExpression.../> and <FRBRManifestation.../>.

   <source>
      <work iri="/akn/za/act/1961-05-18/gn_no_47-1961/!main" />
      <expression iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main" />
      <manifestation iri="/akn/za/act/1961-05-18/gn_no_47-1961/eng@!main.pdf" />
   </source>

The actual extracted text is within the <text> element, and the content is split by page number. The page number is an important structural aspect when the source document is in a binary format like PDF or Word where page number is significant (Instead if the source document was in HTML or XML the page number would be irrelevant) . Having the page number would allow us to link from search results of the content directly to the specific page number in the PDF.

dateTimes in documents need to be time zone agnostic

Refer to this thread, specifically:

So this is the mix so to speak:
Partners are in different time zones to each other (and to ALL)
ALL is in a different time zone.
It might seem recording date-time in UTC and capturing the time-zone offset should suffice for safely syncing data across servers. But then i realized, Time Zone information itself changes more often than imagined (see here for instance, https://www.timeanddate.com/news/time/ there were 22 changes to time zones just in 2017).

So to be safe the document needs to capture created & modified dates in terms of 3 data points to be always functional:

dateTime (in local Date Time of the partner)

timezone Offset to UTC (in hours)

timezoneLocation (the nearest timezone location to where the data is being entered .. see the interactive map here: https://momentjs.com/timezone/ )

We then use this information to calculate the corresponding dateTime in UTC for cross geographical dateTime comparisons ...

The other option is to store the dateTime (1) as UTC itself...and then render that to local dateTime wherever the user sees the created / modified date on the UI...

We need to change how created and modified dates are captured in the gawati documents. Currently they use a 'datetime with timezone' format, but don't capture timezone location info. Similarly date comparisons must be made in utc time and not in the native times (for.e.g when ordering documents by date)

Presenting PDF files (and other doc types) in the browser

PDF files (and other formats like DOCX) pose a challenge for presenting content online. PDF viewers for browsers are complex software by themselves and there is no consistent standard for presenting PDFs across mobile and desktop browsers. Formats like DOCX can be converted to PDF and made available for presentation.

Approach 1 - present pdf directly

Large PDF files cause a slow loading response, because even viewing the first few pages requires the full PDF document into the browser. Currently we follow this approach

An alternative is to process the PDF into a linearized PDF . THat means processing the pdf files into a linearized pdf using something like qpdf.

This still presents a problem of loading a single pdf.

Approach 2 - convert a pdf to an image at runtime

PDF (or a specific page of a pdf) can be converted to an image at runtime and presented online. This allows on demand request of pages, and pages themselves are just images so they can be loaded across devices without a problem. This implies using an intermediate service to process the PDF page request into an image.

Approach 3 - preprocess the PDF into images

Convert the PDF into images in advance and serve images when requested via the browser. Complete PDF can be made available for download. THis approach is similar to Approach 2, but simpler because there is no intermediate service that processes the pdf. The downside, the disk-space usage immediately doubles as the images are essentially duplicates of the file.

Approach 4 - using specialized tools that convert PDF to HTML "lookalikes"

See http://coolwanglu.github.io/pdf2htmlEX/

Analytics: Extract Law metadata

There is a large amount of African Legislation online on the ILO NATLEX portal. This is a UN agency website, which has curated different African legislation subject-wise and with keywords.

For the Gawati Project (https://www.gawati.org) we are trying to build a repository of African Legislation which can be searched in one place, but has been curated from different sources. ILO NATLEX is one such source.

ILO NATLEX site : http://www.ilo.org/dyn/natlex/natlex4.home?p_lang=en

We are interested only in African countries, and these can be found under country profiles:

http://www.ilo.org/dyn/natlex/natlex4.byCountry?p_lang=en

Here if you see Zimbabwe for example:
http://www.ilo.org/dyn/natlex/natlex4.countrySubjects?p_lang=en&p_country=ZWE

Each of the items here leads to a document citation:

E,g. clicking general provisions:
http://www.ilo.org/dyn/natlex/natlex4.listResults?p_lang=en&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

If I click special economic zones act

http://www.ilo.org/dyn/natlex/natlex4.detail?p_lang=en&p_isn=104410&p_country=ZWE&p_count=395&p_classification=01&p_classcount=87

This refers to a “Special Economic Zones Act” of the country, the link to the pDF is highlighted below in yellow.

There is important metadata here about the document:

Name, Country, Type, Official Date (Adopted On), ISN Number and citation text + PDF itself.

We need to extract this information into the official AKoma Ntoso XML format used by gawati.

Steps to Take

1 Akoma Ntoso XML

Here are sample documents in Akoma Ntoso XML format (a) has the XML documents, and (b) has the PDF document that is described by (a) .
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_xml_sample-1.2.zip
https://github.com/gawati/gawati-data-xml/releases/download/1.2/akn_pdf_sample-1.2.zip

2 Downloading Source Documents

For the ILO site, you need to first gather raw data.
So by African country (lets start with 1 country, Zimbabwe to start with) , download the source metadata (as shown earlier) and the associated PDF document.

3 Processing Downloaded Data

Next step is to process the downloaded raw meta-data, and convert that to Akoma Ntoso format. The PDF file need not be converted, but needs to be associated with the corresponding Akoma Ntoso document, as shown in “1 Akoma Ntoso XML” above.

Automatic tagging of text

We want to tag text automatically using known available algorithms.

The full text of the PDF is availabe in the akn_ft/ collection the metadata of the PDF documents is in the akn/ collection.

We need to have a service that accepts document text (for an IRI) from the metadata and fulltext collections and returns a weighted set of probable tags for the text.

There will need to be some level of weightage given to the source of the provided text. e.g. a Tag that occurs in the document titlle should have a higher weightage than text appearing in the body of the doucment.

THese probable tags should then be saved back on the metadata document in the akn/ collection.

So to implement the service:

get the data first - xml to text of the document in /akn_ft. xml to text of the document in /akn. XML to text of the document in /akn will not be effective because a lot of the data is in attributes, so a custom XQuery to selectively pick text data should be written to send text to the service. To prototype / test the service you just need to use /akn_ft since the data is already in text form there. once the idea is validated, the extractor for /akn metadata can be added.
implement the service - as a node js backend which accepts the text or tagged text and returns probabl tags.
service response acceptor - receives the generated tags.
UI - needs to be implemented within a panel on gawati-editor-ui (this can be done later after the service is implemented)

Performance: Use Sort Indexes to Improve performance

When the number of documents in the system increases, there is a significant degradation of the "Recent Documents" listing. This is not evident when the system has an optimal amount of memory, however during testing by allocating smaller amounts of RAM to the JVM, this problem became evident.

We need to apply sort-indexes to the data, to improve the efficiency of pure listing queries (where data is not filtered )
(see https://exist-db.org/exist/apps/fundocs/view.html?uri=http://exist-db.org/xquery/sort&location=java:org.exist.xquery.modules.sort.SortModule )

Current TTFB for listings with over 50,000 docs is 66 seconds .

PDF full text search integrated with XML search

Currently only the Akoma Ntoso XML metadata document is searchable in Gawati.
We want to have the PDF also searchable.
Currently:

PDFs are stored on the file system – they are identified by the name and path.
Akoma Ntoso XML metadata for gawati is stored in the XML database – they are identified by the Akoma Ntoso IRI metadatata, primarily FRBRExpression/FRBRthis/@value [1] . The XML document has info about its corresponding PDF document [2] via the componentRef element.
[1] --

<an:identification source="#gawati">
                <an:FRBRWork>
                    <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/!main"/>
                    <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951"/>
                    <an:FRBRdate name="Work Date" date="1951-11-16"/>
                    <an:FRBRauthor href="#author"/>
                    <an:FRBRcountry value="mr" showAs="Mauritania"/>
                    <an:FRBRnumber value="gn_no_214-1951" showAs="GN No. 214/1951"/>
                    <an:FRBRprescriptive value="false"/>
                    <an:FRBRauthoritative value="false"/>
                </an:FRBRWork>
                <an:FRBRExpression>
                    <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main"/>
                    <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@"/>
                    <an:FRBRdate name="Expression Date" date="1951-11-16"/>
                    <an:FRBRauthor href="#author"/>
                    <an:FRBRlanguage language="eng"/>
                </an:FRBRExpression>
                <an:FRBRManifestation>
                    <an:FRBRthis value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main.xml"/>
                    <an:FRBRuri value="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/.akn"/>
                    <an:FRBRdate name="Manifestation Date" date="2016-03-04"/>
                    <an:FRBRauthor href="#author"/>
                    <an:FRBRformat value="xml"/>
                </an:FRBRManifestation>
            </an:identification>

[2] –

<an:body>
            <an:book refersTo="#mainDocument">
                <an:componentRef src="/akn/mr/act/1951-11-16/gn_no_214-1951/eng@/!main.pdf" alt="akn_mr_act_1951-11-16_gn_no_214-1951_eng_main.pdf" GUID="#embedded-doc-1" showAs="Electricity (Amendment) Regulations, 1951 (Amended)"/>
            </an:book>
        </an:body>

We want to index the PDF and connect it with the already indexed Akoma Ntoso (AKN) XML metadata.
To index the PDF we have:
PDF to XML - pdf to xml which produces a generic XML file (page by page) out of OCR-ed PDF documents which can be indexed and searched.

Step 1 )

Iterate through each AKN document, and for each corresponding PDF associated with it, run PDF to XML. Connect the produced XML to the Akoma Ntoso XML document, by introducing the FRBRExpression/FRBRthis/@value into it so it can be used as a metadata to connect the 2 documents. Give the produced document a coherent naming convention in-line with how the Akoma Ntoso metadata xml documents are named, the produced document has to be stored in the same collection in the XML db like the AKN XML metadata documents.
Once this is done, move to Step 2 .

Step 2)

Add index configurations for the produced XML documents. You will need to index the pages for full text, and the bridge metadata FRRBthis/@value for a range type index. (see https://exist-db.org/exist/apps/doc/indexing.xml )

Step 3)

Create a search service for search for full text for a particular IRI. You will need to add this service to https://github.com/gawati/gawati-data/ . You can find many existing services defined in https://github.com/gawati/gawati-data/blob/dev/services/services.xql / https://github.com/gawati/gawati-data/blob/dev/services/services-json.xql (Note JSON or XML are just outputs in eXist, the internal format is always XML, you just set the output-type method to json and output mimetype to json and the service will output JSON instead of XML ) .

Step 4)

Once the service is implemented – integrate the service into the UI (https://github.com/gawati/gawati-portal-ui ) . Implement a search on the document page e.g. : https://alldev.gawati.org/#/doc/_lang/en/_iri/akn/ng/act/2014-09-08/hb_1302471/eng@/!main
Which allows searching within the document. Add a tab called "Search" after "Metadata" which provides a search box and shows the full text search results in the tab.

Deprecate gawati-data-xml

gawati-data-xml (https://github.com/gawati/gawati-data-xml) will be deprecated. Currently gawati-data-xml is an eXist app package, which while easy to setup poses a risk of accidental deletion of data if the gawati-data-xml package is accidentally uninstalled.

portal data will be stored in /db/docs/gawati-data with the same user ownership as gawati-data. In this way the data is disconnected from application installation. /db/docs/gawati-data will be setup in the post-process` script of gawati-dat.a

Upgrade to eXist 4.2

Current version used is eXist 3.4.1

We need to upgrade to eXist 4.2

For existing instances of gawati to upgrade in place, see http://exist-db.org/exist/apps/wiki/blogs/eXist//eXistdb400

services and services-json have services with identical signatures

There are these 2 services in services-json:


declare
    %rest:POST("{$json}")
    %rest:path("/gw/doc/exists")
    %rest:produces("application/json")
    %output:media-type("application/json")  
    %output:method("json")
function services-json:exists-xml($json) {
        services:exists-xml($json)
};

declare
    %rest:POST("{$json}")
    %rest:path("/gw/doc/sync")
    %rest:produces("application/json")
    %output:media-type("application/json")  
    %output:method("json")
function services-json:sync-xml($json) {
        services:sync-xml($json)
};

they have identical signatures to services in services.xql.

We cannot have 2 services with the same signature as it corrupts the restxq service stack.

services in services-json which are just json equivalents of services.xql need to ALWAYS end with json to avoid such problems.

The above need to become /gw/doc/sync/json and /gw/doc/exists/json