Sources for Oracc.
Please refer this setup guide for setting it locally and this setup guide for setting it up on a server.
Sources for Oracc.
Please refer this setup guide for setting it locally and this setup guide for setting it up on a server.
the oracc2 apache server seems to autoupdate ES, which causes issues with the ICU analysis plugin, which stops the search working as intended. It would be good to find out how to stop this so normal running can continue as planned.
Currently everything is ingested as a string in the produced JSON. The icount
field should be numeric, but we may also want to convert other fields.
To be used for the "information page search" mode in the website (oracc/oracc-search-front-end#5).
At the moment, we return them in alphabetic(?) order. This means that some unexpected suggestions come near the top of the list (especially terms starting with parentheses). It would be more useful to have the most frequently-occurring terms first.
When trying to combine with this with the Angular front-end, we have been getting failures due to browser implementations of the Same Origin Policy, based on CORS (Cross-Origin Resource Sharing). From looking into it a bit, this looks like it should be expected, as Angular and Flask are running on different ports and are therefore considered different origins.
What we need to do, in brief, is to add the appropriate metadata/headers to Flask's HTTP responses which will allow Angular to display the response data. The Flask-CORS library should do this, in principle, but it doesn't seem to work for some reason. We have therefore switched to using pure Flask (no-rest
branch) instead of Flask-restful, to try this in more detail. We have it working on @raquel-ucl's machine, but not the Azure VMs we were testing (not clear why it's not working there).
Some ideas of what could be going wrong / what we should do:
OPTIONS
call that precedes the GET
.
Content-Type
being non-standard would trigger the preflight process.for scheduling CRON jobs it would be good to have an ingest script which essentially just follows the ingest commands we run manually.
Some remaining things that were not checked when support for non-ASCII synonyms was introduced (from #27):
cf.sort
behave as before, i.e. don't use the new analyzerSee also #18 for a description of the original task.
The summary for each glossary entry contains a part showing how the word was written (abbreviated "wr.
"). This is particularly useful for Sumerian, but less so for other languages. For example:
wr. a₂
(to find this, go to http://build-oracc.museum.upenn.edu/neo/sux and look at any entry (or search); the "written as" part is just before the senses)
@stinney suggested this is something that would be useful to display in search results.
For Sumerian words, that part of the summary is built from information in the bases
section of an entry. Glossaries for other languages are likely to lack these sections, in which case I'm not sure where the "written as" part comes from.
build-oracc
server, including the config file examples,oracc2
/test
route.Now that the frontend is running over HTTPS, the backend also needs to.
We can either do that in the Flask app (if we know the location of the relevant keys), or configure the backend to be served through Apache on the server.
Production and development servers should run the contents of the master and development branch, respectively.
Copied from oracc/elastic-search-poc#5:
@raquel-ucl commented:
We can have a few tests that run ingestions and queries in travis:
https://docs.travis-ci.com/user/database-setup/#ElasticSearch
Things we can test (to be expanded):
This may need a little refactoring and/or patching so we can run test uploads/searches on a separate index. I'm not sure whether something like ElasticMock is useful to us, but possibly worth looking at as an alternative.
ES6 is still supported, but it's harder to install (e.g. can't easily through Homebrew), so we should update the code for ES7. From memory, the main issue is the creation of the index, which now doesn't accept a type.
Subtasks:
Very large glossaries may cause the ingestion to fail by running out of memory. This seems to happen during loading of the JSON file.
There are several alternatives to consider:
read_json
method from Pandas to read the file in chunks (example, docs), but this fails with a cryptic error (possibly because every line in the glossary file is not a complete JSON object)break_down.py
) so that we don't include the instances
field in each new entry, but keep track of the xis
for each entry (so we know the instance to which it will need to be linked).The last option will involve the most changes. Using a different JSON reader will be easier, but will still require some changes to create and write the new entries one-by-one rather than in bulk, although nothing major.
The searches as written don't always return all the results. This is because of the default size returned by ElasticSearch (10 results, 3 inner hits). We can override the first one by using the scan
interface, and so get more than "outer" 10 results (documents, i.e. glossaries), but there is no equivalent scanning process for inner hits.
Alternatively, we can specify explicit sizes for outer and inner results, which is fragile (depends on the size of the database), but could be determined after an initial counting query.
The best solution is to change how we ingest the data, and have one document per entry (rather than one document per glossary), avoiding the nesting altogether (oracc/website#16).
If a user searches for "a b", the results will match a or b. This was the original Elasticsearch behaviour and was preserved after #20. It would be better to only return results that match both a and b, so that the search matches usual behaviour and giving more terms narrows down the set of results.
This is very simple:
It looks like we are only matching on full words. For example, searching for Gil
will only return results that have exactly Gil
in one of the searched fields.
Instead, we would like to return entries containing e.g. Gilgamesh
. There should be an ElasticSearch option or different query type for this.
The old Oracc search let you use ASCII synonyms for some non-ASCII characters, e.g. Gilgamesz
instead of Gilgameš
. We should modify the search so that it allows these substitutions. This could be done in at least two ways:
sz
and š
as equivalent.The second seems more robust and generalisable for future extensions, but it may or may not be simple...
Swagger can help make the API design more rebust. It lets try out the API, see whether we are accepting any HTTP requests we shouldn't, and auto-produce documentation.
Useful links:
Period information is already in the glossaries ES DB, but location and genre (and other possibly interesting bits like sub/super genre) are not linked to individual glossary entries. Those are in the catalogue.json
file Steve sent. We need to:
catalogue.json
.search_all
endpoint to return also genre and location (and maybe other things) by joining catalogue entries and glossary entries.We need to be able to retrieve results lazily, i.e. a "page" at a time, to avoid long loading times. See also oracc/oracc-search-front-end#11.
ElasticSearch offers several options for getting paged results:
from
and size
fields; this has a limit (by default 10000, which we may exceed) past which it becomes disallowed, or at least very inefficient.search_after
option seems like the best solution. It requires keeping track of the last result (glossary entry), but does not look too complex to implement. Performance is not clear, as it seems that the search is repeated each time, but that might also be true of the other options.Regardless of the choice, we will also need to extend the search endpoints to accept a field on which to sort.
ElasticSearch provides some functionality we could use to present users with suggestions when they mistype words. Have a look at Suggesters and Fuzzy Queries. Note those two link to ES 7.1, and I'm not sure which ES version we are using right now :)
flask run &
on the server,Travis should be enough. Hopefully elastic search is not too tedious to configure.
curl -k https://localhost:5000/search/water-skin
on line 122,A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.