The polyglotdb from montrealcorpustools

Running out of memory when importing corpus

So, when I tried to impore the Spade-ICE-Can corpus, I get an out of memory error when I have about half a gig of RAM left.

ps-worker   | [2018-07-09 18:28:20,613: INFO/ForkPoolWorker-1] Finished loading phone relationships!
ps-worker   | [2018-07-09 18:28:20,614: INFO/ForkPoolWorker-1] Loading phone relationships...
ps-worker   | [2018-07-09 18:30:19,967: ERROR/ForkPoolWorker-1] Task pgdb.tasks.import_corpus_task[5a65e94b-2d24-4bb4-8409-df00755b5b52] raised unexpected: TransientError("There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.",)
ps-worker   | Traceback (most recent call last):
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
ps-worker   |     R = retval = fun(*args, **kwargs)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
ps-worker   |     return self.run(*args, **kwargs)
ps-worker   |   File "/site/proj/pgdb/tasks.py", line 9, in import_corpus_task
ps-worker   |     corpus.import_corpus()
ps-worker   |   File "/site/proj/pgdb/models.py", line 528, in import_corpus
ps-worker   |     c.load(parser, self.source_directory)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 129, in load
ps-worker   |     could_not_parse = self.load_directory(parser, path)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 247, in load_directory
ps-worker   |     self.finalize_import(data, call_back, parser.stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 68, in finalize_import
ps-worker   |     import_csvs(self, data, call_back, stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/io/importer/from_csv.py", line 196, in import_csvs
ps-worker   |     corpus_context.execute_cypher(s)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/base.py", line 98, in execute_cypher
ps-worker   |     results = session.run(statement, **parameters)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/api.py", line 325, in run
ps-worker   |     self._connection.fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 290, in fetch
ps-worker   |     return self._fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 330, in _fetch
ps-worker   |     response.on_failure(summary_metadata or {})
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/result.py", line 70, in on_failure
ps-worker   |     raise CypherError.hydrate(**metadata)
ps-worker   | neo4j.exceptions.TransientError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

Allow for multiple transcription annotations to be specified

"Underlying" forms versus "Surface" forms

Allow for syllable tiers to be specified

Allow for deleting annotations

Deleting an Edge should affect Nodes as well, deleting them if they are not needed

Export corpus to JSON (EMU-SDMS format)

Allow for outputting and querying of subarcs

Should be able to generate something like:

MATCH (word_b0:Anchor:untimed)-[:r_word]->(node_word:word:untimed)-[:r_word]->(word_e0:Anchor:untimed),
(node_word)-[:is_a]->(type_node_word:word_type),
(node_word)<-[:contained_by]-(:phone:untimed)-[:is_a]->(type_node_phone:phone_type)
WHERE type_node_word.label = 'are'
WITH node_word, type_node_word, collect(type_node_phone) as p
RETURN node_word.id AS id, type_node_word.label AS word_label, 
extract(n in [x in range(size(p)-1,0,-1) | p[x]]|n.label) as word_phone

Aligning of following phone to a word edge should work

q = corpus_context.query_graph(corpus_context.surface_transcription)
q = q.filter(corpus_context.surface_transcription.label == 'aa')
q = q.filter(corpus_context.surface_transcription.following.label.in_(['p','t','k','b','d','g','dx', 'tq']))
q = q.filter(corpus_context.surface_transcription.following.end == corpus_context.word.end)
print(q.count())

Should return the count of 'aa' tokens followed by word-final stops.

Allow for editing labels of annotations

Should be straightforward

Allow for morpheme annotations to be specified

Cache inventory and lexicon queries

Dictionary-like requests (lexicon['word']) should use a cache and only hit the database the first time it's requested.

Add more summary statistic functions

Median
Quantile
Inter-quartile range

Examples of multiple group by in docs

Originally any function that relied on cypher's LOAD CSV was done based on discourses, because discourses were originally labels on annotations, rather than nodes in the graph. Cypher does not allow for labels to be specified by variables (such as csvLine.discourse). For corpora like Buckeye with relatively large discourses, there wasn't too much of a performance hit loading based off of discourses. However, small corpora suffer, because we can't take advantage of any of the speed benefits of using LOAD CSV.

At the moment, there's no reason not to do CSV loading based on speakers rather than discourses, at the very least. When I tried putting all annotations of a single type into a single CSV, Neo4j ran out of memory, but this could be fixed based on recent changelog entries for Neo4j 3.0.

Functions that make use of CSV imports: Importing discourses/corpora, encoding utterances, encoding syllables, and enriching words/phones/discourses/speakers with additional properties.

Aggregate queries do not behave properly with speaker or discourse split queries

Results are split into speakers/discourses, rather than over the whole corpus. Temporary workaround is to set c.config.query_behavior = 'other'.

Phonological search

Export query results to csv file

Upgrade Neo4j to 3.1.X

Double check that everything works under the new version

Filters for non-encoded subsets

Can't currently use filters to find things that haven't been encoded
Example: c.phone.type_subset != 'syllabic' won't find consonants if only syllabics have been encoded

Easy way to list all labels for an annotation type

Add support for frequency dictionary style corpora

Can probably be the same as normal corpora, but with reduced functionality for regenerating frequency information.

Words would be annotation graphs with all the subannotations that we'd expect in a normal corpus, but no attachments to other words or parent annotations.

Allow for easy access to subannotations of annotations

For instance, "phone" should be easily accessible from "word" if the word annotation encompasses many phone annotations

Add logging support to IO

Log errors/warnings (and tell users that they exist in a log), as well as debug info if flagged (how long processing takes).

Allow for grouping of words into lines

Allow for easy access of parent annotations and sister annotations

Aggregation should ignore non group-by columns

These two queries should return the same result:

query_graph(corpus_context.phone).durations().aggregate(corpus_context.phone.duration)

query_graph(corpus_context.phone).aggregate(corpus_context.phone.duration)

Add support for querying on not subset

Queries like:

q = c.query_graph(c.phone).filter(c.phone.type_subset != 'syllabic')

Should return all elements that are not marked as syllabic.

Error on encoding utterances for discourses with one word

Current code assumes at least two words per discourse, not always true.

Raise exception when loading empty directory

Allowing plotting of annotation graphs

High level documentation

Document use of everything in annograph.classes, with examples and graph plots

Throw warning if speaker parsing is not working

This manifests as only having a single speaker, which may cause issues with memory if there are in fact many speakers, and many/all corpora will have multiple speakers.

Add support for annotation attributes

Annotations could themselves be annotated (Speakers can have a gender, age, etc; words can have neighborhood densities, syntactic categories, abstract tiers, etc)

Import CSJ XML format

Inspect function for CSJ
Parser class for CSJ

Examples of importing and exporting corpora in docs

Optimize import pipeline

Currently doesn't take advantage of parallel processing, but could be highly parallel.

Low level documentation

Document all the relational database access for interested users

Import BAS Partitur format

Inspect function for Partitur
Parser class for Partitur

The inspect function should return a PartiturParser for a given file, that loads (a subset of) the tiers in the file into tier objects that PolyglotDB can use

Implement phonological tiers

Allow directory loading to follow symlinks

Add IO functionality

Allow for importing and exporting through:

CSV
Interlinear glossing
Textgrids
File-delimited tiers (ala buckeye or timit)

Allow for adding annotations

Insertion of edges should create new Nodes, only if needed

Allow arbitrary label for encoding syllables

At the moment it's currently hardcoded to syllabic, could be anything with a default to syllabic

Parsing speaker tiers with hyphenated names throws error

Speaker name and annotation type splitting for MFA parsers (and other TextGrid based ones) is based off of of dashes splitting names from annotations (i.e., Speaker 1 - phone), places where speaker names are not codes can sometimes included hyphens, which cause an error.

Make sure fields in columns get added to the cypher query

Add feather support for exporting results

https://blog.rstudio.org/2016/03/29/feather/

montrealcorpustools / polyglotdb Goto Github PK

polyglotdb's People

Contributors

Stargazers

Watchers

Forkers

polyglotdb's Issues

Recommend Projects

Recommend Topics

Recommend Org