oracc / oracc-rest Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 0.0 14.69 MB

Python 95.78% Shell 2.10% jq 0.59% Dockerfile 1.53%

oracc-rest's Introduction

oracc sources

Sources for Oracc.

Setup

Please refer this setup guide for setting it locally and this setup guide for setting it up on a server.

oracc-rest's People

Contributors

Stargazers

Watchers

oracc-rest's Issues

Figure out how to stop Apache auto updating ES

the oracc2 apache server seems to autoupdate ES, which causes issues with the ICU analysis plugin, which stops the search working as intended. It would be good to find out how to stop this so normal running can continue as planned.

Count of instances should be an integer

Currently everything is ingested as a string in the produced JSON. The icount field should be numeric, but we may also want to convert other fields.

Ingest portal pages

To be used for the "information page search" mode in the website (oracc/oracc-search-front-end#5).

Sort suggestions and completions

At the moment, we return them in alphabetic(?) order. This means that some unexpected suggestions come near the top of the list (especially terms starting with parentheses). It would be more useful to have the most frequently-occurring terms first.

CORS problems

When trying to combine with this with the Angular front-end, we have been getting failures due to browser implementations of the Same Origin Policy, based on CORS (Cross-Origin Resource Sharing). From looking into it a bit, this looks like it should be expected, as Angular and Flask are running on different ports and are therefore considered different origins.

What we need to do, in brief, is to add the appropriate metadata/headers to Flask's HTTP responses which will allow Angular to display the response data. The Flask-CORS library should do this, in principle, but it doesn't seem to work for some reason. We have therefore switched to using pure Flask (no-rest branch) instead of Flask-restful, to try this in more detail. We have it working on @raquel-ucl's machine, but not the Azure VMs we were testing (not clear why it's not working there).

Some ideas of what could be going wrong / what we should do:

It's possible that the Angular request is "preflighted", in which case we also need to handle the OPTIONS call that precedes the GET.
- We should check the request sent by Angular (or received by Flask(-restful)) to see its headers; even something as simple as the Content-Type being non-standard would trigger the preflight process.
We might need to set up Flask-CORS more carefully, not just with the default options.

Improve documentation on how to set up a local dev environment

The current readme needs to be updated to make it clearer how to set up a local dev environment,
We should add more details about multipass and document any extra steps that are different to the prod server,
We can have a section on running the flask app in debug mode, but the main focus should be on getting it up and running in multipass as that mimics the prod environment more closely

write shell script for ingest

for scheduling CRON jobs it would be good to have an ingest script which essentially just follows the ingest commands we run manually.

Move CI to GitHub Actions

Further tests on non-ascii processing

Some remaining things that were not checked when support for non-ASCII synonyms was introduced (from #27):

Make sure that the preprocessing does not remove any non-ASCII characters! (e.g. that Unicode sequences are understood correctly)
Test that other fields like cf.sort behave as before, i.e. don't use the new analyzer

See also #18 for a description of the original task.

Include "written as" in the returned information

The summary for each glossary entry contains a part showing how the word was written (abbreviated "wr."). This is particularly useful for Sumerian, but less so for other languages. For example:

wr. a₂

(to find this, go to http://build-oracc.museum.upenn.edu/neo/sux and look at any entry (or search); the "written as" part is just before the senses)

@stinney suggested this is something that would be useful to display in search results.

For Sumerian words, that part of the summary is built from information in the bases section of an entry. Glossaries for other languages are likely to lack these sections, in which case I'm not sure where the "written as" part comes from.

Update documentation for oracc2

The current documentation references the build-oracc server, including the config file examples,
This needs to be updated to use oracc2

Match versions of Elasticsearch

The version of ES used in our GitHub actions is not the same version that is used on the Oracc server (even though the documentation says it is the same version),
For clarity, we should either update ES or update the documentation to reflect the current state of things,
Maybe linked to #28

Update production version to latest

I have made some changes to the API such as adding a /test route.
I need to update the application on the server to this latest version

Serve over HTTPS

Now that the frontend is running over HTTPS, the backend also needs to.
We can either do that in the Flask app (if we know the location of the relevant keys), or configure the backend to be served through Apache on the server.

Location of wsgi file

At the moment the oraccflask.wsgi file lives on the Ubuntu Oracc server in a /var/www directory.
For ease of use, we should look into whether it is possible to store this file in the same location as the rest of the Flask project files.
To test this, it will be necessary to update the Apache config that references the WSGI file and move it to the location where the rest of the Flask files are located

Improve documentation for dev environment

It should be possible to set up a dev environment with a local version of some data (or at least some fake data),
At the moment there is little documentation to describe how to get started with the flask api development and no instructions for setting up a dev environment.
I think the api currently only works on the server right now, although I could be wrong about that so I should look into this

Set a production and development server in the build-oracc server

Production and development servers should run the contents of the master and development branch, respectively.

Update ports for local and production version

Now that we have a local version of the app available, we need to expose the appropriate port so that the angular application can talk to the api both locally and in production.
If we update the port that the api runs on, we need to make sure that the production version also works with this port and update all of the documentation

Create some tests

Copied from oracc/elastic-search-poc#5:

@raquel-ucl commented:

We can have a few tests that run ingestions and queries in travis:
https://docs.travis-ci.com/user/database-setup/#ElasticSearch

Things we can test (to be expanded):

Glossary data is broken down correctly from the JSON
Data is uploaded correctly to ES and is accessible (partly done in #20)
The Flask server returns expected results, with the correct headers for CORS.

This may need a little refactoring and/or patching so we can run test uploads/searches on a separate index. I'm not sure whether something like ElasticMock is useful to us, but possibly worth looking at as an alternative.

Incompatibilities with Elasticsearch 7

ES6 is still supported, but it's harder to install (e.g. can't easily through Homebrew), so we should update the code for ES7. From memory, the main issue is the creation of the index, which now doesn't accept a type.

Subtasks:

Check availability and installation of ES7
Read in data successfully
Query data successfully
Fix deprecation warnings
Update on Oracc server
Update README to mention correct version

Ingestion of large files

Very large glossaries may cause the ingestion to fail by running out of memory. This seems to happen during loading of the JSON file.

There are several alternatives to consider:

Using the read_json method from Pandas to read the file in chunks (example, docs), but this fails with a cryptic error (possibly because every line in the glossary file is not a complete JSON object)
Using ijson to read the file iteratively.
Using a more efficient JSON parser, such as yajl-py, NANA or UltraJSON.
Preprocessing the file to remove any sections that we do not use (such as the summaries), perhaps by a JSON processor like jq.
Modifying the ingestion so that we do it in two passes. In more detail:
- Remove the instances and summaries from the file (as above). Keep the instances in a separate file.
- Change the ingestion (break_down.py) so that we don't include the instances field in each new entry, but keep track of the xis for each entry (so we know the instance to which it will need to be linked).
- Index this data into ElasticSearch.
- In a second phase, load the instances, and use the Update API or Update by Query (python version here) to add the new field to the existing entries.

The last option will involve the most changes. Using a different JSON reader will be easier, but will still require some changes to create and write the new entries one-by-one rather than in bulk, although nothing major.

Return all results

The searches as written don't always return all the results. This is because of the default size returned by ElasticSearch (10 results, 3 inner hits). We can override the first one by using the scan interface, and so get more than "outer" 10 results (documents, i.e. glossaries), but there is no equivalent scanning process for inner hits.

Alternatively, we can specify explicit sizes for outer and inner results, which is fragile (depends on the size of the database), but could be determined after an initial counting query.

The best solution is to change how we ingest the data, and have one document per entry (rather than one document per glossary), avoiding the nesting altogether (oracc/website#16).

Change multi-word querying behaviour to AND

If a user searches for "a b", the results will match a or b. This was the original Elasticsearch behaviour and was preserved after #20. It would be better to only return results that match both a and b, so that the search matches usual behaviour and giving more terms narrows down the set of results.

This is very simple:

Change the query
Update the tests to match
Update the readme

Support partial matching

It looks like we are only matching on full words. For example, searching for Gil will only return results that have exactly Gil in one of the searched fields.
Instead, we would like to return entries containing e.g. Gilgamesh. There should be an ElasticSearch option or different query type for this.

Support ASCII representations of non-ASCII characters

The old Oracc search let you use ASCII synonyms for some non-ASCII characters, e.g. Gilgamesz instead of Gilgameš. We should modify the search so that it allows these substitutions. This could be done in at least two ways:

Given a query string, generate all possible Unicode strings it can represent (according to these substitution rules), and return results for all of them; we may be able to run a single ElasticSearch query for all the variants, otherwise we should run multiple queries and commbine all the results.
Modify ElasticSearch (I think the analyzer?) so that it recognises, for example, sz and š as equivalent.

The second seems more robust and generalisable for future extensions, but it may or may not be simple...

Look into Swagger

Swagger can help make the API design more rebust. It lets try out the API, see whether we are accepting any HTTP requests we shouldn't, and auto-produce documentation.

Useful links:

Ingest catalogue data

Period information is already in the glossaries ES DB, but location and genre (and other possibly interesting bits like sub/super genre) are not linked to individual glossary entries. Those are in the catalogue.json file Steve sent. We need to:

Check all entries in our ES glossary are linked to a P-object in catalogue.json.
Add the instance IDs to the ES glossary entries.
Add the catalogue "members" (i.e. P-objects).
Change the search_all endpoint to return also genre and location (and maybe other things) by joining catalogue entries and glossary entries.

Support partial return of results

We need to be able to retrieve results lazily, i.e. a "page" at a time, to avoid long loading times. See also oracc/oracc-search-front-end#11.

ElasticSearch offers several options for getting paged results:

The from and size fields; this has a limit (by default 10000, which we may exceed) past which it becomes disallowed, or at least very inefficient.
Scrolling; this is not meant to be used for real-time requests, but rather for processing large amounts of data on the back-end.
The search_after option seems like the best solution. It requires keeping track of the last result (glossary entry), but does not look too complex to implement. Performance is not clear, as it seems that the search is repeated each time, but that might also be true of the other options.

Regardless of the choice, we will also need to extend the search endpoints to accept a field on which to sort.

Consider using ElasticSearch's suggesters and/or fuzziness on search

ElasticSearch provides some functionality we could use to present users with suggestions when they mistype words. Have a look at Suggesters and Fuzzy Queries. Note those two link to ES 7.1, and I'm not sure which ES version we are using right now :)

Improve ingest logging

Steve would like us to keep permanent logs for the elasticsearch ingest operation (so we can keep track of when the ingest happened and what was the result),

Stop the api running in dev mode

According to the documentation the flask api is currently running in development mode by running flask run & on the server,
Ideally, the api should not be constantly running in dev mode as this makes it prone to crashing,
Look into how to deploy the api not in development mode

Run local version of the server in dev mode

We are currently testing the app functionality using the deployed version of the api,
It would be better to have a completely local version of the api running so that we can test this easier,
This would also require having some dummy data available,
The local version could be served from e.g. a multipass instance

Set up CI

Travis should be enough. Hopefully elastic search is not too tedious to configure.

Update README

There are instances in the readme with outdated information:
Guidance for querying the api directly on the server should be curl -k https://localhost:5000/search/water-skin on line 122,
This is to reflect that we are now using https

Improve documentation

Move example notes in script files to the wiki. See, for example, https://github.com/oracc/oracc-rest/blob/individual_index/ingest/basic_curl_calls
Restructure wiki a little. There's some useful snippets there, and it'd be nice to add some notes on the design.