thomjur / pycollocation Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 1.0 57 KB

Python module to do simple collocation analysis of a corpus.

License: GNU General Public License v3.0

Python 100.00%

pycollocation's People

Contributors

Watchers

Forkers

trutzig89182

pycollocation's Issues

CLI

We need to create a better CLI, maybe with argparse? I have no idea what is best actually.

create CLI in analysis.py with more flexible interface

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

Originally posted by @thomjur in #15 (comment)

Creating twitter adapter

Habe mal ein bisschen an dem twitter adapter weiter gearbeitet und ihn nun auch einigermaßen zum Laufen bekommen.
Wäre aber noch zu entscheiden, wie er in den Rest eingebunden wird. Sollte er z.B. auch eher unter tools? Oder mit in analysis.py?

Visualization/Output

I am currently working on a more structured output of the results.

Add tokenizer

Writing the unit tests I saw that punctuation signs are considered to be part of the words in our first very rough separation of words with word_list = document.split(" ").

So implementing the a tokenizer is an important next step. You already mentioned the NLTK tokenizer in the script.

Feature: Merged Results

@trutzig89182 I plan to work on a combined result later today or tomorrow (to avoid that we are both working on the same feature again^^).

My idea is the following: I think it is generally common to combine the results from the left and right collocation analysis and to only indicate on which side the term appears most frequently.

I therefore plan to combine left_counter and right_counter dicts to generate a single dict. While doing this, I also plan to add some more details to the pandas output. It should look like this:


idx	word	coll_frequency	orient	total_freq
1	dolor	4	left	8
2	etiam	2	-	10
3	lares	9	right	9

Unit tests

I started to conceive some basic unit tests, if you want to have a look: trutzig89182@84fbfe2

Not really good in testing, so if you want to go a completely different way, that’s fine with me, too.

TODO (General List)

Maybe it would be good to collect several smaller TODOs here:

the nltk word_tokenizer also lists punctuation. I am not sure if we want that. For the moment, I have added a simple list comprehension to filter \w+ only... but we might need to think of better solutions here (or is we stick to this, we can also use NLTK's RegexTokenizer.
Add functions to directly work with twitter data from jsonl files (low priority)
implementing stop words #17

Create a html GUI

Perhaps we could create a html interface using flask. But we would have discuss if that is something of value for this project or not first.

Searching with wildcards?

Should we include the possiblity to search with wildcards? This should be possible by defining the search_word as a regex object and checking if it matches the word x in the current sentence.

I had the impression, that you don’t se regex as a core functionality. Perhaps it is problematic for the search of collocations?

In my case I look for words associated with „Datenschützer“ and „Datenschützer\in(en)“, and so searching for „Datenschützer*“ could make things easier.

Program Structure

@trutzig89182

I have started thinking about the general structure based on the comments you made. Here is a first draft:

So, there are basically two ways one could interact with this program:

Via CLI with the help of arguments.
By importing the module into Python and using the start_collocation_analysis() function.

What do you think?

Here is the draw.io file that you can also change. We can also use a different program, this is only a first draft.

https://1drv.ms/u/s!AjzmoTNnf_mknqAUBlIqx-OB5jdLAg?e=5WtRyu

Error when Counters are empty

Hey, I have been fiddling a bit with the twitter/twarc adapter and came across an error when using the analysis.py which seems to stem from the definition of axes in display.py. Given the hour my cognitive capacity is quite limited right now, so I will just drop this here for the moment. I will try to figure out if I just did something wrong or if there is a bug somewhere some time later.

Traceback (most recent call last):
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 77, in <module>
    start_collocation_analysis(collection, search_term, int(l_window), int(r_window), statistic, doc_type="folder", output_type = output_type)
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 61, in start_collocation_analysis
    display.get_results_collocates(left_counter, right_counter, full_counter, search_term_count, l_window, r_window, statistic, output_type)
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/tools/display.py", line 22, in get_results_collocates
    df_top_collocates.columns = ["collocate", "coll_freq"]
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 5491, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 763, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

Statistics

@trutzig89182

I have started adding some basic statistics. So far, I have only implemented the most basic one (MU) calculating collocation_freq divided by expected_collocation_frequency. The implementations are mainly based on Brezina et al.

So, basically you now have an option to indicate if you want to use a statistic (default is "freq" with just counting collocation frequencies as we did before). For testing, I compared the results with the results I got from LancsBox when analyzing our test corpus. Although the numbers are different (they seem to have changed the function parameters), the "clustering" seems to be correct. For the moment, you can see the results when running test.py.

I notice that our tiny project is becoming more and more elaborate, so we should soon start a round of restructuring and documenting, otherwise things might become very messay at some point. But first, we should maybe add the interfaces or/and the general workflow how we would like to operate with our program. What do you think?

Statistics to implement

MU
Z-Score
Log-Lik

package structure

I have created a first sub package (tools). This is by no means fixed, but we might start thinking about the overall structure of our program.