Giter Site home page Giter Site logo

montrealcorpustools / polyglotdb Goto Github PK

View Code? Open in Web Editor NEW
34.0 34.0 13.0 13.68 MB

Language data store and linguistic query API

License: MIT License

Python 98.45% Shell 0.21% Praat 1.33%
acoustics database influxdb neo4j rest-api speech-analysis speech-processing

polyglotdb's People

Contributors

a-coles avatar esteng avatar james-tanner avatar jeffmielke avatar michaelgoodale avatar michaelhaaf avatar mmcauliffe avatar msonderegger avatar samihuc avatar vannawillerton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

polyglotdb's Issues

Running out of memory when importing corpus

So, when I tried to impore the Spade-ICE-Can corpus, I get an out of memory error when I have about half a gig of RAM left.

ps-worker   | [2018-07-09 18:28:20,613: INFO/ForkPoolWorker-1] Finished loading phone relationships!
ps-worker   | [2018-07-09 18:28:20,614: INFO/ForkPoolWorker-1] Loading phone relationships...
ps-worker   | [2018-07-09 18:30:19,967: ERROR/ForkPoolWorker-1] Task pgdb.tasks.import_corpus_task[5a65e94b-2d24-4bb4-8409-df00755b5b52] raised unexpected: TransientError("There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.",)
ps-worker   | Traceback (most recent call last):
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 375, in trace_task
ps-worker   |     R = retval = fun(*args, **kwargs)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/celery/app/trace.py", line 632, in __protected_call__
ps-worker   |     return self.run(*args, **kwargs)
ps-worker   |   File "/site/proj/pgdb/tasks.py", line 9, in import_corpus_task
ps-worker   |     corpus.import_corpus()
ps-worker   |   File "/site/proj/pgdb/models.py", line 528, in import_corpus
ps-worker   |     c.load(parser, self.source_directory)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 129, in load
ps-worker   |     could_not_parse = self.load_directory(parser, path)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 247, in load_directory
ps-worker   |     self.finalize_import(data, call_back, parser.stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/importable.py", line 68, in finalize_import
ps-worker   |     import_csvs(self, data, call_back, stop_check)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/io/importer/from_csv.py", line 196, in import_csvs
ps-worker   |     corpus_context.execute_cypher(s)
ps-worker   |   File "/site/proj/PolyglotDB/polyglotdb/corpus/base.py", line 98, in execute_cypher
ps-worker   |     results = session.run(statement, **parameters)
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/api.py", line 325, in run
ps-worker   |     self._connection.fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 290, in fetch
ps-worker   |     return self._fetch()
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/bolt/connection.py", line 330, in _fetch
ps-worker   |     response.on_failure(summary_metadata or {})
ps-worker   |   File "/site/env/lib/python3.6/site-packages/neo4j/v1/result.py", line 70, in on_failure
ps-worker   |     raise CypherError.hydrate(**metadata)
ps-worker   | neo4j.exceptions.TransientError: There is not enough memory to perform the current task. Please try increasing 'dbms.memory.heap.max_size' in the neo4j configuration (normally in 'conf/neo4j.conf' or, if you you are using Neo4j Desktop, found through the user interface) or if you are running an embedded installation increase the heap by using '-Xmx' command line flag, and then restart the database.

Allow for outputting and querying of subarcs

Should be able to generate something like:

MATCH (word_b0:Anchor:untimed)-[:r_word]->(node_word:word:untimed)-[:r_word]->(word_e0:Anchor:untimed),
(node_word)-[:is_a]->(type_node_word:word_type),
(node_word)<-[:contained_by]-(:phone:untimed)-[:is_a]->(type_node_phone:phone_type)
WHERE type_node_word.label = 'are'
WITH node_word, type_node_word, collect(type_node_phone) as p
RETURN node_word.id AS id, type_node_word.label AS word_label, 
extract(n in [x in range(size(p)-1,0,-1) | p[x]]|n.label) as word_phone

Aligning of following phone to a word edge should work

q = corpus_context.query_graph(corpus_context.surface_transcription)
q = q.filter(corpus_context.surface_transcription.label == 'aa')
q = q.filter(corpus_context.surface_transcription.following.label.in_(['p','t','k','b','d','g','dx', 'tq']))
q = q.filter(corpus_context.surface_transcription.following.end == corpus_context.word.end)
print(q.count())

Should return the count of 'aa' tokens followed by word-final stops.

Optimize CSV imports

Originally any function that relied on cypher's LOAD CSV was done based on discourses, because discourses were originally labels on annotations, rather than nodes in the graph. Cypher does not allow for labels to be specified by variables (such as csvLine.discourse). For corpora like Buckeye with relatively large discourses, there wasn't too much of a performance hit loading based off of discourses. However, small corpora suffer, because we can't take advantage of any of the speed benefits of using LOAD CSV.

At the moment, there's no reason not to do CSV loading based on speakers rather than discourses, at the very least. When I tried putting all annotations of a single type into a single CSV, Neo4j ran out of memory, but this could be fixed based on recent changelog entries for Neo4j 3.0.

Functions that make use of CSV imports: Importing discourses/corpora, encoding utterances, encoding syllables, and enriching words/phones/discourses/speakers with additional properties.

Filters for non-encoded subsets

Can't currently use filters to find things that haven't been encoded
Example: c.phone.type_subset != 'syllabic' won't find consonants if only syllabics have been encoded

Add support for frequency dictionary style corpora

Can probably be the same as normal corpora, but with reduced functionality for regenerating frequency information.

Words would be annotation graphs with all the subannotations that we'd expect in a normal corpus, but no attachments to other words or parent annotations.

Add logging support to IO

Log errors/warnings (and tell users that they exist in a log), as well as debug info if flagged (how long processing takes).

Aggregation should ignore non group-by columns

These two queries should return the same result:

query_graph(corpus_context.phone).durations().aggregate(corpus_context.phone.duration)

query_graph(corpus_context.phone).aggregate(corpus_context.phone.duration)

Add support for annotation attributes

Annotations could themselves be annotated (Speakers can have a gender, age, etc; words can have neighborhood densities, syntactic categories, abstract tiers, etc)

Import BAS Partitur format

  • Inspect function for Partitur
  • Parser class for Partitur

The inspect function should return a PartiturParser for a given file, that loads (a subset of) the tiers in the file into tier objects that PolyglotDB can use

Add IO functionality

Allow for importing and exporting through:

CSV
Interlinear glossing
Textgrids
File-delimited tiers (ala buckeye or timit)

Parsing speaker tiers with hyphenated names throws error

Speaker name and annotation type splitting for MFA parsers (and other TextGrid based ones) is based off of of dashes splitting names from annotations (i.e., Speaker 1 - phone), places where speaker names are not codes can sometimes included hyphens, which cause an error.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.