thomjur / pycollocation Goto Github PK
View Code? Open in Web Editor NEWPython module to do simple collocation analysis of a corpus.
License: GNU General Public License v3.0
Python module to do simple collocation analysis of a corpus.
License: GNU General Public License v3.0
We need to create a better CLI, maybe with argparse
? I have no idea what is best actually.
analysis.py
with more flexible interfaceYes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type
, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable"
that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py
/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.
Originally posted by @thomjur in #15 (comment)
Habe mal ein bisschen an dem twitter adapter weiter gearbeitet und ihn nun auch einigermaßen zum Laufen bekommen.
Wäre aber noch zu entscheiden, wie er in den Rest eingebunden wird. Sollte er z.B. auch eher unter tools? Oder mit in analysis.py
?
I am currently working on a more structured output of the results.
Writing the unit tests I saw that punctuation signs are considered to be part of the words in our first very rough separation of words with word_list = document.split(" ").
So implementing the a tokenizer is an important next step. You already mentioned the NLTK tokenizer in the script.
@trutzig89182 I plan to work on a combined result later today or tomorrow (to avoid that we are both working on the same feature again^^).
My idea is the following: I think it is generally common to combine the results from the left and right collocation analysis and to only indicate on which side the term appears most frequently.
I therefore plan to combine left_counter
and right_counter
dicts to generate a single dict. While doing this, I also plan to add some more details to the pandas output. It should look like this:
idx | word | coll_frequency | orient | total_freq |
1 | dolor | 4 | left | 8 |
2 | etiam | 2 | - | 10 |
3 | lares | 9 | right | 9 |
I started to conceive some basic unit tests, if you want to have a look: trutzig89182@84fbfe2
Not really good in testing, so if you want to go a completely different way, that’s fine with me, too.
Maybe it would be good to collect several smaller TODOs here:
word_tokenizer
also lists punctuation. I am not sure if we want that. For the moment, I have added a simple list comprehension to filter \w+
only... but we might need to think of better solutions here (or is we stick to this, we can also use NLTK's RegexTokenizer
.Perhaps we could create a html interface using flask. But we would have discuss if that is something of value for this project or not first.
Should we include the possiblity to search with wildcards? This should be possible by defining the search_word as a regex object and checking if it matches the word x in the current sentence.
I had the impression, that you don’t se regex as a core functionality. Perhaps it is problematic for the search of collocations?
In my case I look for words associated with „Datenschützer“ and „Datenschützer\in(en)“, and so searching for „Datenschützer*“ could make things easier.
I have started thinking about the general structure based on the comments you made. Here is a first draft:
So, there are basically two ways one could interact with this program:
start_collocation_analysis()
function.What do you think?
Here is the draw.io
file that you can also change. We can also use a different program, this is only a first draft.
Hey, I have been fiddling a bit with the twitter/twarc adapter and came across an error when using the analysis.py which seems to stem from the definition of axes in display.py. Given the hour my cognitive capacity is quite limited right now, so I will just drop this here for the moment. I will try to figure out if I just did something wrong or if there is a bug somewhere some time later.
Traceback (most recent call last):
File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 77, in <module>
start_collocation_analysis(collection, search_term, int(l_window), int(r_window), statistic, doc_type="folder", output_type = output_type)
File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 61, in start_collocation_analysis
display.get_results_collocates(left_counter, right_counter, full_counter, search_term_count, l_window, r_window, statistic, output_type)
File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/tools/display.py", line 22, in get_results_collocates
df_top_collocates.columns = ["collocate", "coll_freq"]
File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 5491, in __setattr__
return object.__setattr__(self, name, value)
File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 763, in _set_axis
self._mgr.set_axis(axis, labels)
File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
self._validate_set_axis(axis, new_labels)
File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements
I have started adding some basic statistics. So far, I have only implemented the most basic one (MU) calculating collocation_freq
divided by expected_collocation_frequency
. The implementations are mainly based on Brezina et al.
So, basically you now have an option to indicate if you want to use a statistic (default is "freq" with just counting collocation frequencies as we did before). For testing, I compared the results with the results I got from LancsBox
when analyzing our test corpus. Although the numbers are different (they seem to have changed the function parameters), the "clustering" seems to be correct. For the moment, you can see the results when running test.py
.
I notice that our tiny project is becoming more and more elaborate, so we should soon start a round of restructuring and documenting, otherwise things might become very messay at some point. But first, we should maybe add the interfaces or/and the general workflow how we would like to operate with our program. What do you think?
I have created a first sub package (tools
). This is by no means fixed, but we might start thinking about the overall structure of our program.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.