reynoldsnlp / flair Goto Github PK
View Code? Open in Web Editor NEWfork from the FLAIR project at Tuebingen University
License: Other
fork from the FLAIR project at Tuebingen University
License: Other
The view project uses the seco lookup tool. The Russian model can come from my udar project here.
Maybe start with this file to get an idea how it is used: https://gitlab.com/view/core/blob/master/app/src/main/java/werti/util/HFSTAnalyser.java
In general, anything with HFST or CG3 in the name might be relevant. HFST is the tool that does the lookup. A given token could have more than one possible grammatical analysis (reading), so we use a Constraint Grammar (CG3) to eliminate readings based on context.
After loading flair, the problem can be found by:
We need to decide which grammatical constructions for the project to recognize and report.
It seems that the directory src/main/webapp/META-INF
is extraneous. Assuming this is correct and its removal will not cause problems, it should be removed.
This may be planned maintenance or something, but when I run a search in any language I get the following errors.
Failed to load resource: the server responded with a status of 404 ()
Failed to load resource: the server responded with a status of 500 ()
SEVERE: Couldn't begin web search operation. Exception: com.google.gwt.user.client.rpc.StatusCodeException: 500 The call failed on the server; see server log for details
Obb @ flair-0.js:3041
Hhc @ flair-0.js:2862
Ghc @ flair-0.js:2070
Ihc @ flair-0.js:1836
Tz @ flair-0.js:2844
Pz @ flair-0.js:1639
Tl @ flair-0.js:3040
Wc @ flair-0.js:3035
rib @ flair-0.js:2974
LS @ flair-0.js:2259
YS @ flair-0.js:3041
(anonymous) @ flair-0.js:2110
xJ @ flair-0.js:1445
AJ @ flair-0.js:2750
(anonymous) @ flair-0.js:2100
We need automated testing. Let's use this thread to discuss/plan/prioritize.
This was a localization issue.
src/main/webapp/WEB-INF/deploy/flair/symbolMaps/*
are static files, checked in to the repository. They should probably NOT be in the repo, but should be dynamically generated.
Also, mvn clean
should probably remove them.
On the branch russian2
in the file src/main/java/com/flair/client/localization/resources/strings-en-constructions.tsv
, lines 688 through 741 need user-facing info about conjugation classes.
_gram-name_
need a short name describing the conjugation class_gram-helpText_
need a short description and/or examplesThese values should be placed in the 3rd column (after the 2nd \t
character)
In other words, remove all absolute paths from code. This includes reading/writing weka
models, reading/writing temporary MADAMIRA files, etc.
webapps/
, so this location should be avoided for writing....to at least 1.20
One container for MADAMIRA
and one for FLAIR
.
We need to use a list of verbs that can introduce indirect speech.
For example, "скажу ему, когда ты придешь." should be recognized as not being a question
We need some way to recognize which conjugation class each verb belongs to, as well as which declension class each noun belongs to.
Google Form or email
The Arabic model has 4 readability levels (1
, 2
, 3
and 4
), and they are not based on CEFR levels (A1
-C2
), so the labels on the webpage are misleading.
I saw this on corpora-list
, and thought they could come in handy for us.
I don't understand what you exactly mean by basic vocabulary used for defining words but if you mean roots from which Arabic words can be generated, here are two dictionaries. In our team we have two Arabic dictionaries available in LMF format
Contemporary Arabic dictionary with 32300 lexical entries generated from 5778 roots
Al wassit" Arabic dictionary with 61101 lexical entries generated from 6900 roots
Both can be freely downloaded from http://arabic.emi.ac.ma/alelm/?q=Resources
You can also take a look at the second release of Arabic wordnet from http://globalwordnet.org/arabic-wordnet
best
karim
Russian constructions need names, paths, and help texts for German localization of the UI.
These localizations should be edited in these files:
src/main/java/com/flair/client/localization/resources/strings-de-constructions.tsv
src/main/java/com/flair/client/localization/resources/strings-de-general.tsv
The original English localizations can be seen in these files:
src/main/java/com/flair/client/localization/resources/strings-en-constructions.tsv
src/main/java/com/flair/client/localization/resources/strings-en-general.tsv
Searched for الحور الرجراج
for 40 results and they were being processed very slowly. Starting about 25/04/2019 16:19:41
in catalina.out
.
What will it take to daemonize this?
One approach to this sort of thing is Apache UIMA, but this would take a huge refactor.
Currently, data is sent to MADAMIRA server using a temporary file. We should try to find a way to pass it directly without all of the disk I/O.
When there are more search results than can be seen on the page, it is unclear that FLAIR is still processing. It would be beneficial to have some kind of notice saying that it's in the middle of processing.
Figure out why, and maybe we need to integrate into systemd, i.e. turn it into a service.
Should there be a specific sub-category of conditionals for unreal situations? For example, the English sentence "If you had been here, It would've been better" is an unreal conditional. Should that be a separate construction (with accompanying weight slider) from just "conditionals"?
It would be nice if it were possible to request the next x search results. For example, if I request 10 sites, and none of them are quite what I'm looking for, it would be nice to just look for the next 10 instead of having to request 20 and reprocess the 10 I already know aren't going to work.
When an individual server instance has to use more than three parser models, it results in an outOfMemory error from the server. On the user's end, all they see is a web page that is stuck loading documents. There should be a way for us to gracefully handle such situations. Ideally we would find a way to work around the high memory needs of this project.
When the cg-conv
utility is run from src/main/java/com/flair/server/utilities/CgConv.java
, certain inputs cause it to hang, then time out as programmed.
One such input for Russian is the content of this site, which can be found by searching говорить
in Russian with curated domains; it is the 3rd result (as of June 18, 2020)
See above
parser does work from command line.
Is the change of name to "Foreign Language Acquisition Information Retrieval" intentional or a typo? The original project is called "Form-Focused Linguistically Aware Information Retrieval".
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.