montrealcorpustools / polyglotdb Goto Github PK
View Code? Open in Web Editor NEWLanguage data store and linguistic query API
License: MIT License
Language data store and linguistic query API
License: MIT License
So, when I tried to impore the Spade-ICE-Can corpus, I get an out of memory error when I have about half a gig of RAM left.
ps-worker | [2018-07-09 18:28:20,613: INFO/ForkPoolWorker-1] Finished loading phone relationships!
ps-worker | [2018-07-09 18:28:20,614: INFO/ForkPoolWorker-1] Loading phone relationships...
ps-worker | [2018-07-09 18:30:19,967: ERROR/ForkPoolWorker-1] Task pgdb.tasks.import_corpus_task[5a65e94b-2d24-4bb4-8409-df00755b5b52] raised unexpected: TransientError("There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.",)
ps-worker | Traceback (most recent call last):
ps-worker | File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
ps-worker | R = retval = fun(*args, **kwargs)
ps-worker | File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
ps-worker | return self.run(*args, **kwargs)
ps-worker | File "/site/proj/pgdb/tasks.py", line 9, in import_corpus_task
ps-worker | corpus.import_corpus()
ps-worker | File "/site/proj/pgdb/models.py", line 528, in import_corpus
ps-worker | c.load(parser, self.source_directory)
ps-worker | File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 129, in load
ps-worker | could_not_parse = self.load_directory(parser, path)
ps-worker | File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 247, in load_directory
ps-worker | self.finalize_import(data, call_back, parser.stop_check)
ps-worker | File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 68, in finalize_import
ps-worker | import_csvs(self, data, call_back, stop_check)
ps-worker | File "/site/proj/PolyglotDB/polyglotdb/io/importer/from_csv.py", line 196, in import_csvs
ps-worker | corpus_context.execute_cypher(s)
ps-worker | File "/site/proj/PolyglotDB/polyglotdb/corpus/base.py", line 98, in execute_cypher
ps-worker | results = session.run(statement, **parameters)
ps-worker | File "/site/env/lib/python3.6/site-packages/neo4j/v1/api.py", line 325, in run
ps-worker | self._connection.fetch()
ps-worker | File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 290, in fetch
ps-worker | return self._fetch()
ps-worker | File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 330, in _fetch
ps-worker | response.on_failure(summary_metadata or {})
ps-worker | File "/site/env/lib/python3.6/site-packages/neo4j/v1/result.py", line 70, in on_failure
ps-worker | raise CypherError.hydrate(**metadata)
ps-worker | neo4j.exceptions.TransientError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.
"Underlying" forms versus "Surface" forms
Deleting an Edge should affect Nodes as well, deleting them if they are not needed
Should be able to generate something like:
MATCH (word_b0:Anchor:untimed)-[:r_word]->(node_word:word:untimed)-[:r_word]->(word_e0:Anchor:untimed),
(node_word)-[:is_a]->(type_node_word:word_type),
(node_word)<-[:contained_by]-(:phone:untimed)-[:is_a]->(type_node_phone:phone_type)
WHERE type_node_word.label = 'are'
WITH node_word, type_node_word, collect(type_node_phone) as p
RETURN node_word.id AS id, type_node_word.label AS word_label,
extract(n in [x in range(size(p)-1,0,-1) | p[x]]|n.label) as word_phone
q = corpus_context.query_graph(corpus_context.surface_transcription)
q = q.filter(corpus_context.surface_transcription.label == 'aa')
q = q.filter(corpus_context.surface_transcription.following.label.in_(['p','t','k','b','d','g','dx', 'tq']))
q = q.filter(corpus_context.surface_transcription.following.end == corpus_context.word.end)
print(q.count())
Should return the count of 'aa' tokens followed by word-final stops.
Should be straightforward
Dictionary-like requests (lexicon['word']
) should use a cache and only hit the database the first time it's requested.
Median
Quantile
Inter-quartile range
Originally any function that relied on cypher's LOAD CSV
was done based on discourses, because discourses were originally labels on annotations, rather than nodes in the graph. Cypher does not allow for labels to be specified by variables (such as csvLine.discourse
). For corpora like Buckeye with relatively large discourses, there wasn't too much of a performance hit loading based off of discourses. However, small corpora suffer, because we can't take advantage of any of the speed benefits of using LOAD CSV
.
At the moment, there's no reason not to do CSV loading based on speakers rather than discourses, at the very least. When I tried putting all annotations of a single type into a single CSV, Neo4j ran out of memory, but this could be fixed based on recent changelog entries for Neo4j 3.0.
Functions that make use of CSV imports: Importing discourses/corpora, encoding utterances, encoding syllables, and enriching words/phones/discourses/speakers with additional properties.
Results are split into speakers/discourses, rather than over the whole corpus. Temporary workaround is to set c.config.query_behavior = 'other'
.
Double check that everything works under the new version
Can't currently use filters to find things that haven't been encoded
Example: c.phone.type_subset != 'syllabic' won't find consonants if only syllabics have been encoded
Can probably be the same as normal corpora, but with reduced functionality for regenerating frequency information.
Words would be annotation graphs with all the subannotations that we'd expect in a normal corpus, but no attachments to other words or parent annotations.
For instance, "phone" should be easily accessible from "word" if the word annotation encompasses many phone annotations
Log errors/warnings (and tell users that they exist in a log), as well as debug info if flagged (how long processing takes).
These two queries should return the same result:
query_graph(corpus_context.phone).durations().aggregate(corpus_context.phone.duration)
query_graph(corpus_context.phone).aggregate(corpus_context.phone.duration)
Queries like:
q = c.query_graph(c.phone).filter(c.phone.type_subset != 'syllabic')
Should return all elements that are not marked as syllabic
.
Current code assumes at least two words per discourse, not always true.
Document use of everything in annograph.classes, with examples and graph plots
This manifests as only having a single speaker, which may cause issues with memory if there are in fact many speakers, and many/all corpora will have multiple speakers.
Annotations could themselves be annotated (Speakers can have a gender, age, etc; words can have neighborhood densities, syntactic categories, abstract tiers, etc)
Currently doesn't take advantage of parallel processing, but could be highly parallel.
Document all the relational database access for interested users
The inspect function should return a PartiturParser for a given file, that loads (a subset of) the tiers in the file into tier objects that PolyglotDB can use
Allow for importing and exporting through:
CSV
Interlinear glossing
Textgrids
File-delimited tiers (ala buckeye or timit)
Insertion of edges should create new Nodes, only if needed
At the moment it's currently hardcoded to syllabic
, could be anything with a default to syllabic
Speaker name and annotation type splitting for MFA parsers (and other TextGrid based ones) is based off of of dashes splitting names from annotations (i.e., Speaker 1 - phone), places where speaker names are not codes can sometimes included hyphens, which cause an error.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.