gkunter / coquery Goto Github PK

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse a text corpus.

License: GNU General Public License v3.0

Python 84.58% CSS 0.72% JavaScript 2.82% HTML 11.81% Shell 0.01% Inno Setup 0.05% Batchfile 0.01%

coquery's Introduction

Coquery - a free corpus query tool

Coquery is a free corpus query tool for linguists, lexicographers, translators, and anybody who wishes to search and analyse text corpora. It is available for Windows, Linux, and Mac OS X computers.

You can either build your own corpus from a collection of text files (either PDF, MS Word, OpenDocument, HTML, or plain text) in a directory on your computer, or install a corpus module for one of the supported corpora (the corpus data files are not provided by Coquery).

Tutorials and documentation can be found on the Coquery website: http://www.coquery.org

Features

An incomplete list of the things you can do with Coquery:

Corpora

Use the corpus manager to install one of the supported corpora
Build your own corpus from PDF, HTML, .docx, .odt, or plain text files
Filter your query for example by year, genre, or speaker gender
Choose which corpus features will be included in your query results
View every token that matches your query within its context

Queries

Match tokens by orthography, phonetic transcription, lemma, or gloss, and restrict your query by part-of-speech
Use string functions e.g. to test if a token contains a letter sequence
Use the same query syntax for all installed corpora
Automate queries by reading them from an input file
Save query results from speech corpora as Praat TextGrids

Analysis

Summarize the query results as frequency tables or contingency tables
Calculate entropies and relative frequencies
Fetch collocations, and calculate association statistics like mutual information scores or conditional probabilities

Visualizations

Use bar charts, heat maps, or bubble charts to visualize frequency distributions
Illustrate diachronic changes by using time series plots
Show the distribution of tokens within a corpus in a barcode or a beeswarm plot

Databases

Either connect to easy-to-use internal databases, or to powerful MySQL servers
Access large databases on a MySQL server over the network
Create links between tables from different corpora, e.g. to provide phonetic transcriptions for tokens in an unannotated corpus

Supported corpora

Coquery already has installers for the following linguistic corpora and lexical databases:

If the list is missing a corpus that you want to see supported in Coquery, you can either write your own corpus installer in Python using the installer API, or you can contact the Coquery maintainers and ask them for assistance.

License

Initial development was supported by: English Linguistics Institut für Amerikanistik und Amerikanistik Heinrich-Heine-Universität Düsseldorf

Coquery is free software released under the terms of the GNU General Public license (version 3).

coquery's People

Contributors

Stargazers

Watchers

Forkers

nlpfun github-bigang zy1023 stlm1376

coquery's Issues

Python 3: Wrong contexts shown

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
A wrong context is shown if Python 3 is used. Apparently, only the first words of the left and the right context lists are used.

EXAMPLE:

#!bash
$ python2 coquery.py -q "residualized" -O -c 5
W1,Context
RESIDUALIZED,"covariance analysis was inappropriate, RESIDUALIZED gain scores were created by"
RESIDUALIZED," As indicated earlier, RESIDUALIZED scores represent posttreatment severity scores"
RESIDUALIZED,variable was the respondents' RESIDUALIZED depressive symptoms' scores.
RESIDUALIZED,"their pretreatment counterparts. These RESIDUALIZED scores, representing net severity"

#!bash
$ python3 coquery.py -q "residualized" -O -c 5
W1,Context
RESIDUALIZED,covariance RESIDUALIZED gain
RESIDUALIZED, RESIDUALIZED scores
RESIDUALIZED,variable RESIDUALIZED depressive
RESIDUALIZED,"their RESIDUALIZED scores,"

SOLUTION:
None yet.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/16

CSV files should use visual header order

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
If the Query results table is saved to a CSV file, the visual order is ignored. There is currently no way of rearranging columns for a file.

SOLUTION:
The list of columns constructed in CoqueryApp.save_results should call logicalIndex.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/50

BNC: Query with [nn] breaks

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Using the BNC, the query break the [*nn*] breaks with the error:

Type
InternalError
Message
(1690, "BIGINT UNSIGNED value is out of range in '(`bnc`.`corpus`.`Token_id` - 2)'")

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/46

Be more informative if MySQL module is missing

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Currently, the program prints all sorts of error messages if no supported MySQL module could be loaded. As this is not so rare with new Python installations, the error message should be more informative

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/37

[New feature] Make all contexts available, even in aggregated lists

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

If the user clicks on a row in an aggregated list (e.g. a frequency list), a more or less random context is shown (actually, it is the first matching context in the corpus, but that's not transparent to the user).

It would be very useful to be able to access all available contexts somehow. One idea is that after clicking, a window would open, with a KWIC list for the matches. Another click on one of the rows would open the context viewer.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/47

Python 3: byte-string markers in output

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
In Python 3, the output strings have byte-string markers b'' around them. This makes the output useless.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/19

[New feature] Track number of references to MySQL table record during corpus building

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

It might be interesting to keep track of the number of times an existing record is referenced during the corpus building process.

Currently, BaseCorpusBuilder and subclasses use table_get(dict) for records. This method is an "select or insert" method: if there is a record that matches the values in dict, it returns the id of that record. If there is no matching record, a new record is inserted, and the new id is returned.

Now, if that function was changed so that a counter was increased by one whenever it is called, we would have stuff like a global corpus frequency of all tokens for free. There might also be a solution to Issue #20 somewhere. However, different source features do not necessarily have their own database records, so the issue might be more complicated than that.

For reference, this type of "update or insert" action is sometimes called "upsert". MySQL provides the INSERT ... ON DUPLICATE KEY UPDATE syntax for them.

http://dev.mysql.com/doc/refman/5.1/en/insert-on-duplicate.html

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/34

Context still available even for dictionaries

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Dictionary corpora such as CMUdict don't have information on tokens in context, i.e. there is no token_id that can be interpreted sequentially. However, context modes are still enabled for these corpora, which causes the query to fail if a context span is selected.

SOLUTION:
Disable the context box for corpus resources without a sequential token_id.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/41

corpus modules have too much boilerplate code

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, the modules in corpora/ all contain code that checks against running them from directly from the command line, sets up the logger, and tries to disable the query cache if requested. This is clearly boilerplate code that makes the corpus modules more complicated than necessary.

SOLUTION:
Remove this boilerplate code, perhaps by moving it into the initialization of the specific Resource class.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/5

SQLCorpus.lexicon.get_entry() fetches all fields

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, all fields provided by the lexicon are fetched by get_entry(). This can be very slow, particularly with linked tables (e.g. if cmudict.dict is used for transcription).

SOLUTION:
SQLCorpus.sql_string_get_entry() should only include those tables that are requested, and SQLCorpus.get_entry() should know how to handle these restricted results.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/1

BNC: Context viewer not working in Collocates

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
In the results list from the Collocate query mode, clicking on collocates to see the context viewer does not work, with the offending line 1720 in corpus.py (render_context):

#!python
start = int(tab.iloc[x - 1].coquery_invisible_corpus_id)

ValueError: cannot convert float NaN to integer

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/45

Numeric columns not right-aligned

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
In the results view, numeric values should be right-aligned. There is code in the data() method of CoqTableModel in results.py, but it doesn't seem to have an effect.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/22

Wildcard characters as literal punctuation marks

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The characters '*' and '?' are cannot be queried for. There should be a way to escape them.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/40

Unclear treatment of pos tables in SQLCorpus.sql_string_get_table()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The SELECT string constructed by sql_string_get_table() for each query token in a self join does not handle part-of-speech tables correctly. It selects the part-of-speech identifier from {resource.corpus_table}.{resource.word_pos_id}, but this will only work if the PosId column in word_table is the same in corpus_table.

SOLUTION:
Rewrite the whole function.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/9

File information does not work in the BNC corpus

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The BNC database does include file information, but currently, this information cannot be retrieved.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/6

Highlight all tokens in context viewer

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Currently, if you click on a token in the results table, only this token is highlighted in the context viewer, even if other tokens occur in the same context.

SOLUTION:
The context renderer should first create a list of all result tokens that occur in the current source, and then highlight the tokens in the viewer.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/24

QueryResult.get_row() and Session.expand_header() should share their code

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Session.expand_header() is used to create the header line, while QueryResult.get_row() produces the data lines. Currently, both functions are independent of each other, and changes have to be made to both. This may easily introduce bugs.

SOLUTION:
The number and order of fields should be determined once per session, and both functions should resort to that, not determine number and order themselves.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/10

Improve the management of queries with tokens that are not in the lexicon

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
If a query token does not match an entry in the lexicon, corpus.sql_string_run_query_where_string() returns an empty string. As a result, the query string can be unrestrained.

SOLUTION:
Sometimes an empty where_string is desired (e.g. -q "*"). These cases must be distinguished from those where the empty string is caused by an non-existing entry. Therefore, get_wordid_list() should return [-1] only in one of the two cases, and sql_string_run_query_where_string() should react accordingly.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/12

Plotting of percent barplot failed

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The percentage barplot failed to plot the results of a query, but the normal barplot worked.

BNC Query:

[ajc] and more [aj0] *
more [aj0] and [ajc] *

Output columns: Word (hidden), Query String

Error:

Length mismatch: Expected axis has 1 elements, new values have 2 elements

 visualizer.py, line 539: start_draw_thread
   barplot.py, line 219: draw
     visualizer.py, line 181: map_data
       barplot.py, line 146: plot_facet

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/48

Query syntax '*.[POS]' doesn't work

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
A query token with a wildcard word specification and a part-of-speech specification should be valid (the wildcard specification should just be ignored). Currently, this doesn't work, though. Apparently, an invalid MySQL query string is generated instead.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/39

Closed visualizations are still being updated

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
It seems that visualizations are not thoroughly destroyed when the widget showing them is closed. This can lead to memory leaks and slower performance over time.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/38

Output column numbers not useful if quantifying modifiers are used

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The way the quantifying modifiers are dealt with breaks visualizations. For example, a query string [v_] the{0,} [n_] creates an output like this:

Word1      Word2     Word3
win        the       race
bakes      bread

It would be much more useful to have the output organized like this:

Word1      Word2     Word3
win        the       race
bakes                bread

In this way, a visualizer could use Word1 and Word 3 as sources. Probably, most of the time the part of a query containing a quantifier is not as interesting as the fixed parts. This behaviour might be disabled in a Settings dialog.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/23

get_frequency() and get_corpus_size() don't acknowledge result filters

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The two methods Corpus.get_frequency() and Corpus.get_corpus_size() don't go through the filter list, so the obtained frequencies always refer to the whole corpus, not the subcorpus that a query may be based on.

This is particularly obvious when calculating collocations: the list of collocates that is returned is delimited by the filter list, but the frequencies are not.

SOLUTION:
Add the filter list constraints to the queries issued by Corpus.get_frequency(). For Corpus.get_corpus_size(), a lookup table should be used (see Issue #20).

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/21

Memory leak: hidden reference to data table

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
There seems to be a memory leak: when the same query is executed several times, the old data table (or a derivative of it) does not seem to get freed. This may lead to a disproportionate memory consumption, and probably severe performance hits after a while.

This can be tested by using the memory dump option is set, and the same query is executed several times. For example, the multi-line query

[train].[n*]
[ship].[n*]
[plane|aeroplane].[n*]

queried on COHA with Year and Query String selected yields after four executions:

{'size': 1043568, 'ref': 125917, 'id': 139751767240784, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751694539592, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751563908736, 'class': "<type 'list'>"}
{'size': 1043568, 'ref': 125917, 'id': 139751433104128, 'class': "<type 'list'>"}

The referents are the 125917 matching tokens.

SOLUTION:
Not obvious. Given that the class is a list, this is probably the list of query results Query.Results that isn't freed correctly, but this needs more testing.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/31

Contexts can come from other texts other than the token

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, Corpus.get_context() ignores whether the words within the context span are also from the same text as the query match. For example, a match that is the last token of one text will be printed with the first words of the next text as the right context;

SOLUTION:
Change the logic of get_context() so that start and end take the beginning and end into consideration.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/4

Windows: Frequency column empty

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
When doing a frequency query under Windows, the 'Frequency' column is empty even though the data frame contains the correct values.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/35

Token specification ignored if no lexicon feature requested

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The query string does not contain the Word_id restrictions if no lexicon feature is requested.

For example, querying "car" in COHA with only "Year" selected equates to a query of all year columns from sources.

SOLUTION:
Make sure that word_id is in the selected features if a token specification is given.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/27

get_entry() does not look up part-of-speech labels in resource.pos_table

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
In corpora such as COCA with a separate table for part-of-speech labels, the parameter -p prints pos_id instead of pos_label.

SOLUTION:
Fix SQLLexicon.get_entry() so that external pos tables are linked.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/8

List of requested fields is stored as a property of CorpusQuery()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
In CorpusQuery.init(), the property self.request_list is constructed based on which arguments are set in options.cfg. This list is then checked against the features provided by the selected corpus.

SOLUTION:
The list of requested fields is a session-wide property, and should therefore be a property of Session(). This list has to be made available to CorpusQuery(), but also to SQLLexicon.get_entry().

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/3

Sorting columns not erased for new queries

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
If a new query is run, the previous sorting settings are still remembered. If a sorting column has been set that is not in the new query results table any more, an error occurs.

SOLUTION:
When sorting, discard all columns that are not in the current results table.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/42

Code reduplication in SQLLexicon.sql_string_get_entry() and SQLLexicon.sql_string_get_matching_wordids()

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The functions SQLLexicon.sql_string_get_entry() and SQLLexicon.sql_string_get_matching_wordids() both construct a query string that may join several tables from the data base. The two strings seem to differ only in the WHERE clause.

SOLUTION:
Implement a function SQLLexicon.sql_string_table_join() that both functions can use.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/2

[New feature] Have proportion as an additional output column

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

It may be helpful to have a proportion as an additional output column in a FREQUENCY query. The proportion would simply be the length of the output table divided by the value of the Frequency column for each row.

Eventually, the FREQUENCY query mode might even be removed, and turned into a special kind of optional output column.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/29

Import of corpus module should involve testing for completeness

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, corpus modules are imported in Session.init() without any testing. In particular, there is no test whether the corpus Resource class provides all fields required to do the requested query. Such testing only takes place in some of the sql_string_xxx functions, so the script may, for instance, abort after the query during output.

SOLUTION:
First of all, a clear catalogue of what is required for a resource description is needed (see also Issue #11). Once that is established, BaseResource should implement a method validate() that is called in Session.init() after Corpus is initialized. This method should abort with an instructive error message if any required variable is missing.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/13

Writing results from a frequency query is inacceptably slow

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, the underlying MySQL query of a frequency query and a context query is identical. This means that the Python code (more specifically, the write_resutls() method of FrequencyQuery) has to go a huge table and do the frequency aggregation in Python. This is extremely slow, in particular for frequent words. It is so bad that it forces swapping even with quite much memory.

SOLUTION:
Find a way to construct the MySQL queries obtaining frequencies in such a way that this is done by the MySQL server, and not by Python.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/15

Visualizer gives error messages if all columns are hidden

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
If the user hides all data columns, the barcode visualizer (and probably others) fail to handle that.

Message list index out of range

 visualizer.py, line 539: start_draw_thread

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/49

[New feature] Add stop-word list

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

Users could specify a stop-word list. Queries would be matched against the stop-words, and only included if they are not listed there.

Technically, the stop-word list should be incorporated as another MySQL table. Adding and using a stop-word list might be done, then, along these lines:

Select a file with orthographic representations
create a table coq_stopwords that contains the WordIds matching these representations
when calling SQLResource.get_matching_wordids(), this table is either considered in the query that returns the word_ids for a query token, or applied to the returned list.

Note that the GUI will have to provide the following:

a button to add a stop-word list from a file
a button to clear this stop-word list
a tabular view of the stop-words
ways of appending, removing, and editing this table

This could be incorporated as another tab in addition to the "Query results" tab and the "Query log" tab.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/26

Linking column not always in list of required features

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
If a column is used as a link to an external table, and a column from that table is selected while the internal column is not selected, an error occurs.

SOLUTION:
Check the linking columns when gathering the required features.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/43

Installers should only create .py file if the setup is complete

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, the installers using corpusbuilder.py can write a Python corpus module even if the setup for that corpus is incomplete (e.g. no MySQL database of the given name, no tables, no entries). This should be changed so that the module is only written if the MySQL part has been installed.

SOLUTION:
Change corpusbuilder.py so that it checks for the existence of a data base with the given name, of the tables in the table description, and at least one entry in the corpus table.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/17

Remove side effects from SQLLexicon.sql_string_query

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The method SQLLexicon.sql_string_query() creates the MySQL string required for the current query. However, it also has the side effect that it modifies Query.Session.output_order so that the list contains the resource feature names that are contained in the output. This side effect is undesired, because it is unexpected that a sql_string_xxx method does anything else than returning a string.

In addition, a WordNotInLexicon exception may be raised within the method so that output_order is not modified correctly (see Issue #12).

SOLUTION:
Move the code that modifies output_order outside of this function.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/33

There is no standard way of handling text filters across the corpora.

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Filtering using -T is not handled in a unified way across corpora.

SOLUTION:
Provide an interface that makes text filtering compatible across corpora.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/7

BNC corpus: sentence_id is used as source_id for tokens

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
The BNC corpus currently uses the sentence_id to keep track of the source of a token. This sentence_id is then linked to the actual source table if the text is accessed. This makes look-up of some information rather complicated (see Issue #11), and it causes an inconsistent behaviour between corpora. For example, in BNC, context is delimited to sentences, but in COCA, to texts (see also Issue #4).

SOLUTION:
The table bnc.element should store the text_id, not the sentence_id. The table bnc.sentence should store token_id as an additional column. This requires changes to tools/create_bnc.py.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/14

Frequency column cannot be hidden

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
In a data results table, the frequency column can't be hidden correctly. The context menu changes, but the data is still shown. This problem is probably caused by an incorrect interpretation of the column title: instead of 'Frequency', the context menu has the wrong heading 'coquery_invisible_corpus_id).

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/36

Python3: File output broken

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
When trying to write to an output file using -o, the file is created, but no output seems to be produced.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/18

[New syntax] Allow partial query token negation

Currently, query tokens can be negated by preceding them with the negation character. It may be useful to include partial negation so that only either the word/lemma is negated or the class specification:

~[fish].[v*] negates the whole token, i.e. matches any word that is not a verbal word-form of FISH
[~fish].[v*] negates only the lemma, i.e. it would match any verb that is not a word-form of FISH.
[fish].[~v*] negates only the class, i.e. it would match any word-form of FISH that is not a verb.
[~fish].[~v*] negates both the lemma and the class, and would probably match any word that is not a word-form of FISH and which is not a verb.

Double negation may become difficult, though. One solution could be this:

~[~fish].[v*] would be equivalent to [fish].[~v*]
~[fish].[~v*] would be equivalent to [~fish].[~v*]
~[~fish].[~v*] would be equivalent to [fish].[v*]

Pre-calculated corpus size lists required

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Detecting the size of sub-corpora (e.g. a sub-corpus that contains only sources from one genre) can be very slow if the corpus is big due to the COUNT(*) clause. This is a problem if we want to express relative frequencies (words per million).

SOLUTION:
During corpus creation, produce a data table that stores the corpus size for all combinations of source features. This table can be used as a lookup instead of a SQL query.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/20

Program crashes if query log is scrolled during queries

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Coquery crashes immediately if the log pane is scrolled during a query.

SOLUTION:
None yet.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/32

Visualizations should run in a separate thread

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
Some visualizations can take some time to complete. During that time, the GUI is not usable, and there is no feedback to the user that something is still happening.

SOLUTION:
Add a progress indicator, and have the visualization part run in a separate thread.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/30

Create a unified layout for resource descriptions

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

ISSUE:
Currently, each corpus module can define an arbitrary set of labels that are used in the different sql_string_xxx functions to construct valid MySQL query strings. However, there is no mechanism that can be used to unify access to different tables across corpora.

SOLUTION:
Instead of using simply strings, a corpus module could contain a complete table layout description that also represents the links between the different tables. This layout could be part of a configuration file, so in order to adjust the module to an existing database, only the configuration file would need to be changed.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/11

Non-UTF-8 query input files don't work

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
If a query input file is used that is not encoded as UTF-8, and if that file contains special characters, the queries from the file can't be read.

SOLUTION:
Make the UnicodeReader class in defines.py aware of other encodings.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/25

Token frequency not calculated correctly in Collocate mode

Originally reported by: gkunter (Bitbucket: gkunter, GitHub: gkunter)

PROBLEM:
The token frequency in Collocations queries are wrong. Using the query string language in ICE-NG with Text as output column and a left and right context span of 10, the shown frequency for at least one item 'igala' is wrong. It occurs 6 times in the left and 2 times in the right context, but the column Collocate frequency shows a frequency of 5.

This may be a capitalization error, or perhaps an encoding error.

Bitbucket: https://bitbucket.org/gkunter/coquery/issue/44

gkunter / coquery Goto Github PK

coquery's Introduction

Coquery - a free corpus query tool

Features

Corpora

Queries

Analysis

Visualizations

Databases

Supported corpora

License

coquery's People

Contributors

Stargazers

Watchers

Forkers

coquery's Issues

Recommend Projects

Recommend Topics

Recommend Org