Giter Site home page Giter Site logo

gkunter / coquery Goto Github PK

View Code? Open in Web Editor NEW
18.0 5.0 4.0 22.13 MB

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.

License: GNU General Public License v3.0

Python 84.58% CSS 0.72% JavaScript 2.82% HTML 11.81% Shell 0.01% Inno Setup 0.05% Batchfile 0.01%

coquery's Introduction

Coquery - a free corpus query tool

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse text corpora. It is available for Windows, Linux, and Mac OS X computers.

You can either build your own corpus from a collection of text files (either PDF, MS Word, OpenDocument, HTML, or plain text) in a directory on your computer, or install a corpus module for one of the supported corpora (the corpus data files are not provided by Coquery).

Coquery: Main interface

Tutorials and documentation can be found on the Coquery website: http://www.coquery.org

Features

An incomplete list of the things you can do with Coquery:

Corpora

  • Use the corpus manager to install one of the supported corpora
  • Build your own corpus from PDF, HTML, .docx, .odt, or plain text files
  • Filter your query for example by year, genre, or speaker gender
  • Choose which corpus features will be included in your query results
  • View every token that matches your query within its context

Queries

  • Match tokens by orthography, phonetic transcription, lemma, or gloss, and restrict your query by part-of-speech
  • Use string functions e.g. to test if a token contains a letter sequence
  • Use the same query syntax for all installed corpora
  • Automate queries by reading them from an input file
  • Save query results from speech corpora as Praat TextGrids

Analysis

  • Summarize the query results as frequency tables or contingency tables
  • Calculate entropies and relative frequencies
  • Fetch collocations, and calculate association statistics like mutual information scores or conditional probabilities

Visualizations

  • Use bar charts, heat maps, or bubble charts to visualize frequency distributions
  • Illustrate diachronic changes by using time series plots
  • Show the distribution of tokens within a corpus in a barcode or a beeswarm plot

Databases

  • Either connect to easy-to-use internal databases, or to powerful MySQL servers
  • Access large databases on a MySQL server over the network
  • Create links between tables from different corpora, e.g. to provide phonetic transcriptions for tokens in an unannotated corpus

Supported corpora

Coquery already has installers for the following linguistic corpora and lexical databases:

If the list is missing a corpus that you want to see supported in Coquery, you can either write your own corpus installer in Python using the installer API, or you can contact the Coquery maintainers and ask them for assistance.

License

Copyright (c) 2016 Gero Kunter

Initial development was supported by: English Linguistics Institut für Amerikanistik und Amerikanistik Heinrich-Heine-Universität Düsseldorf

Coquery is free software released under the terms of the GNU General Public license (version 3).

coquery's People

Contributors

gkunter avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

coquery's Issues

Python 3: Wrong contexts shown

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
A wrong context is shown if Python 3 is used. Apparently, only the first words of the left and the right context lists are used.

EXAMPLE:

#!bash
$ python2 coquery.py -q "residualized" -O -c 5
W1,Context
RESIDUALIZED,"covariance analysis was inappropriate, RESIDUALIZED gain scores were created by"
RESIDUALIZED," As indicated earlier, RESIDUALIZED scores represent posttreatment severity scores"
RESIDUALIZED,variable was the respondents' RESIDUALIZED depressive symptoms' scores.
RESIDUALIZED,"their pretreatment counterparts. These RESIDUALIZED scores, representing net severity"
#!bash
$ python3 coquery.py -q "residualized" -O -c 5
W1,Context
RESIDUALIZED,covariance RESIDUALIZED gain
RESIDUALIZED, RESIDUALIZED scores
RESIDUALIZED,variable RESIDUALIZED depressive
RESIDUALIZED,"their RESIDUALIZED scores,"

SOLUTION:
None yet.


[New feature] Make all contexts available, even in aggregated lists

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


If the user clicks on a row in an aggregated list (e.g. a frequency list), a more or less random context is shown (actually, it is the first matching context in the corpus, but that's not transparent to the user).

It would be very useful to be able to access all available contexts somehow. One idea is that after clicking, a window would open, with a KWIC list for the matches. Another click on one of the rows would open the context viewer.


[New feature] Track number of references to MySQL table record during corpus building

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


It might be interesting to keep track of the number of times an existing record is referenced during the corpus building process.

Currently, BaseCorpusBuilder and subclasses use table_get(dict) for records. This method is an "select or insert" method: if there is a record that matches the values in dict, it returns the id of that record. If there is no matching record, a new record is inserted, and the new id is returned.

Now, if that function was changed so that a counter was increased by one whenever it is called, we would have stuff like a global corpus frequency of all tokens for free. There might also be a solution to Issue #20 somewhere. However, different source features do not necessarily have their own database records, so the issue might be more complicated than that.

For reference, this type of "update or insert" action is sometimes called "upsert". MySQL provides the INSERT ... ON DUPLICATE KEY UPDATE syntax for them.

http://dev.mysql.com/doc/refman/5.1/en/insert-on-duplicate.html


Context still available even for dictionaries

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
Dictionary corpora such as CMUdict don't have information on tokens in context, i.e. there is no token_id that can be interpreted sequentially. However, context modes are still enabled for these corpora, which causes the query to fail if a context span is selected.

SOLUTION:
Disable the context box for corpus resources without a sequential token_id.


corpus modules have too much boilerplate code

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, the modules in corpora/ all contain code that checks against running them from directly from the command line, sets up the logger, and tries to disable the query cache if requested. This is clearly boilerplate code that makes the corpus modules more complicated than necessary.

SOLUTION:
Remove this boilerplate code, perhaps by moving it into the initialization of the specific Resource class.


SQLCorpus.lexicon.get_entry() fetches all fields

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, all fields provided by the lexicon are fetched by get_entry(). This can be very slow, particularly with linked tables (e.g. if cmudict.dict is used for transcription).

SOLUTION:
SQLCorpus.sql_string_get_entry() should only include those tables that are requested, and SQLCorpus.get_entry() should know how to handle these restricted results.


Unclear treatment of pos tables in SQLCorpus.sql_string_get_table()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
The SELECT string constructed by sql_string_get_table() for each query token in a self join does not handle part-of-speech tables correctly. It selects the part-of-speech identifier from {resource.corpus_table}.{resource.word_pos_id}, but this will only work if the PosId column in word_table is the same in corpus_table.

SOLUTION:
Rewrite the whole function.


Highlight all tokens in context viewer

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
Currently, if you click on a token in the results table, only this token is highlighted in the context viewer, even if other tokens occur in the same context.

SOLUTION:
The context renderer should first create a list of all result tokens that occur in the current source, and then highlight the tokens in the viewer.


QueryResult.get_row() and Session.expand_header() should share their code

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Session.expand_header() is used to create the header line, while QueryResult.get_row() produces the data lines. Currently, both functions are independent of each other, and changes have to be made to both. This may easily introduce bugs.

SOLUTION:
The number and order of fields should be determined once per session, and both functions should resort to that, not determine number and order themselves.


Improve the management of queries with tokens that are not in the lexicon

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
If a query token does not match an entry in the lexicon, corpus.sql_string_run_query_where_string() returns an empty string. As a result, the query string can be unrestrained.

SOLUTION:
Sometimes an empty where_string is desired (e.g. -q "*"). These cases must be distinguished from those where the empty string is caused by an non-existing entry. Therefore, get_wordid_list() should return [-1] only in one of the two cases, and sql_string_run_query_where_string() should react accordingly.


Plotting of percent barplot failed

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
The percentage barplot failed to plot the results of a query, but the normal barplot worked.

BNC Query:

[ajc] and more [aj0] *
more [aj0] and [ajc] *

Output columns: Word (hidden), Query String

Error:

Length mismatch: Expected axis has 1 elements, new values have 2 elements

 visualizer.py, line 539: start_draw_thread
   barplot.py, line 219: draw
     visualizer.py, line 181: map_data
       barplot.py, line 146: plot_facet

Output column numbers not useful if quantifying modifiers are used

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
The way the quantifying modifiers are dealt with breaks visualizations. For example, a query string [v_] the{0,} [n_] creates an output like this:

Word1      Word2     Word3
win        the       race
bakes      bread

It would be much more useful to have the output organized like this:

Word1      Word2     Word3
win        the       race
bakes                bread

In this way, a visualizer could use Word1 and Word 3 as sources. Probably, most of the time the part of a query containing a quantifier is not as interesting as the fixed parts. This behaviour might be disabled in a Settings dialog.


get_frequency() and get_corpus_size() don't acknowledge result filters

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
The two methods Corpus.get_frequency() and Corpus.get_corpus_size() don't go through the filter list, so the obtained frequencies always refer to the whole corpus, not the subcorpus that a query may be based on.

This is particularly obvious when calculating collocations: the list of collocates that is returned is delimited by the filter list, but the frequencies are not.

SOLUTION:
Add the filter list constraints to the queries issued by Corpus.get_frequency(). For Corpus.get_corpus_size(), a lookup table should be used (see Issue #20).


Memory leak: hidden reference to data table

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
There seems to be a memory leak: when the same query is executed several times, the old data table (or a derivative of it) does not seem to get freed. This may lead to a disproportionate memory consumption, and probably severe performance hits after a while.

This can be tested by using the memory dump option is set, and the same query is executed several times. For example, the multi-line query

[train].[n*]
[ship].[n*]
[plane|aeroplane].[n*]

queried on COHA with Year and Query String selected yields after four executions:

{'size': 1043568, 'ref': 125917, 'id': 139751767240784, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751694539592, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751563908736, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751433104128, 'class': "<type 'list'>"}

The referents are the 125917 matching tokens.

SOLUTION:
Not obvious. Given that the class is a list, this is probably the list of query results Query.Results that isn't freed correctly, but this needs more testing.


Contexts can come from other texts other than the token

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, Corpus.get_context() ignores whether the words within the context span are also from the same text as the query match. For example, a match that is the last token of one text will be printed with the first words of the next text as the right context;

SOLUTION:
Change the logic of get_context() so that start and end take the beginning and end into consideration.


Token specification ignored if no lexicon feature requested

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
The query string does not contain the Word_id restrictions if no lexicon feature is requested.

For example, querying "car" in COHA with only "Year" selected equates to a query of all year columns from sources.

SOLUTION:
Make sure that word_id is in the selected features if a token specification is given.


List of requested fields is stored as a property of CorpusQuery()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
In CorpusQuery.init(), the property self.request_list is constructed based on which arguments are set in options.cfg. This list is then checked against the features provided by the selected corpus.

SOLUTION:
The list of requested fields is a session-wide property, and should therefore be a property of Session(). This list has to be made available to CorpusQuery(), but also to SQLLexicon.get_entry().


Code reduplication in SQLLexicon.sql_string_get_entry() and SQLLexicon.sql_string_get_matching_wordids()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
The functions SQLLexicon.sql_string_get_entry() and SQLLexicon.sql_string_get_matching_wordids() both construct a query string that may join several tables from the data base. The two strings seem to differ only in the WHERE clause.

SOLUTION:
Implement a function SQLLexicon.sql_string_table_join() that both functions can use.


[New feature] Have proportion as an additional output column

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


It may be helpful to have a proportion as an additional output column in a FREQUENCY query. The proportion would simply be the length of the output table divided by the value of the Frequency column for each row.

Eventually, the FREQUENCY query mode might even be removed, and turned into a special kind of optional output column.


Import of corpus module should involve testing for completeness

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, corpus modules are imported in Session.init() without any testing. In particular, there is no test whether the corpus Resource class provides all fields required to do the requested query. Such testing only takes place in some of the sql_string_xxx functions, so the script may, for instance, abort after the query during output.

SOLUTION:
First of all, a clear catalogue of what is required for a resource description is needed (see also Issue #11). Once that is established, BaseResource should implement a method validate() that is called in Session.init() after Corpus is initialized. This method should abort with an instructive error message if any required variable is missing.


Writing results from a frequency query is inacceptably slow

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, the underlying MySQL query of a frequency query and a context query is identical. This means that the Python code (more specifically, the write_resutls() method of FrequencyQuery) has to go a huge table and do the frequency aggregation in Python. This is extremely slow, in particular for frequent words. It is so bad that it forces swapping even with quite much memory.

SOLUTION:
Find a way to construct the MySQL queries obtaining frequencies in such a way that this is done by the MySQL server, and not by Python.


[New feature] Add stop-word list

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


Users could specify a stop-word list. Queries would be matched against the stop-words, and only included if they are not listed there.

Technically, the stop-word list should be incorporated as another MySQL table. Adding and using a stop-word list might be done, then, along these lines:

  • Select a file with orthographic representations
  • create a table coq_stopwords that contains the WordIds matching these representations
  • when calling SQLResource.get_matching_wordids(), this table is either considered in the query that returns the word_ids for a query token, or applied to the returned list.

Note that the GUI will have to provide the following:

  • a button to add a stop-word list from a file
  • a button to clear this stop-word list
  • a tabular view of the stop-words
  • ways of appending, removing, and editing this table

This could be incorporated as another tab in addition to the "Query results" tab and the "Query log" tab.


Installers should only create .py file if the setup is complete

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, the installers using corpusbuilder.py can write a Python corpus module even if the setup for that corpus is incomplete (e.g. no MySQL database of the given name, no tables, no entries). This should be changed so that the module is only written if the MySQL part has been installed.

SOLUTION:
Change corpusbuilder.py so that it checks for the existence of a data base with the given name, of the tables in the table description, and at least one entry in the corpus table.


Remove side effects from SQLLexicon.sql_string_query

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
The method SQLLexicon.sql_string_query() creates the MySQL string required for the current query. However, it also has the side effect that it modifies Query.Session.output_order so that the list contains the resource feature names that are contained in the output. This side effect is undesired, because it is unexpected that a sql_string_xxx method does anything else than returning a string.

In addition, a WordNotInLexicon exception may be raised within the method so that output_order is not modified correctly (see Issue #12).

SOLUTION:
Move the code that modifies output_order outside of this function.


BNC corpus: sentence_id is used as source_id for tokens

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).

SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.


Frequency column cannot be hidden

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
In a data results table, the frequency column can't be hidden correctly. The context menu changes, but the data is still shown. This problem is probably caused by an incorrect interpretation of the column title: instead of 'Frequency', the context menu has the wrong heading 'coquery_invisible_corpus_id).


[New syntax] Allow partial query token negation

Currently, query tokens can be negated by preceding them with the negation character. It may be useful to include partial negation so that only either the word/lemma is negated or the class specification:

  • ~[fish].[v*] negates the whole token, i.e. matches any word that is not a verbal word-form of FISH
  • [~fish].[v*] negates only the lemma, i.e. it would match any verb that is not a word-form of FISH.
  • [fish].[~v*] negates only the class, i.e. it would match any word-form of FISH that is not a verb.
  • [~fish].[~v*] negates both the lemma and the class, and would probably match any word that is not a word-form of FISH and which is not a verb.

Double negation may become difficult, though. One solution could be this:

  • ~[~fish].[v*] would be equivalent to [fish].[~v*]
  • ~[fish].[~v*] would be equivalent to [~fish].[~v*]
  • ~[~fish].[~v*] would be equivalent to [fish].[v*]

Pre-calculated corpus size lists required

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).

SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.


Create a unified layout for resource descriptions

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


ISSUE:
Currently, each corpus module can define an arbitrary set of labels that are used in the different sql_string_xxx functions to construct valid MySQL query strings. However, there is no mechanism that can be used to unify access to different tables across corpora.

SOLUTION:
Instead of using simply strings, a corpus module could contain a complete table layout description that also represents the links between the different tables. This layout could be part of a configuration file, so in order to adjust the module to an existing database, only the configuration file would need to be changed.


Token frequency not calculated correctly in Collocate mode

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)


PROBLEM:
The token frequency in Collocations queries are wrong. Using the query string language in ICE-NG with Text as output column and a left and right context span of 10, the shown frequency for at least one item 'igala' is wrong. It occurs 6 times in the left and 2 times in the right context, but the column Collocate frequency shows a frequency of 5.

This may be a capitalization error, or perhaps an encoding error.


Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.