Giter Site home page Giter Site logo

pycollocation's People

Contributors

thomasjur avatar thomjur avatar trutzig89182 avatar

Watchers

 avatar  avatar

Forkers

trutzig89182

pycollocation's Issues

CLI

We need to create a better CLI, maybe with argparse? I have no idea what is best actually.

  • create CLI in analysis.py with more flexible interface

implementing stop words

Yes, this sounds more reasonable than checking the whole stop word list for every single word. Regarding the jsonl file: since this is our program, you can also implement a special "jsonl" option for doc_type, if you want. This would be very specific (since it would be for Twitter JSON only, I guess)... but why not. Otherwise, it should theoretically work by passing an iterator (class) with doc_type="iterable" that iterates over the jsonl files. An example can be found here in the gensim documentation (section "Training your own model"): https://radimrehurek.com/gensim/auto_examples/tutorials/run_word2vec.html#sphx-glr-auto-examples-tutorials-run-word2vec-py

/EDIT: Ah, I am unsure how the counting with/without stop words works. Are they also excluded from the total word count? This would be important to know, since it would mean that deleting the corresponding rows in the final results table is too late.

Originally posted by @thomjur in #15 (comment)

Creating twitter adapter

Habe mal ein bisschen an dem twitter adapter weiter gearbeitet und ihn nun auch einigermaßen zum Laufen bekommen.
Wäre aber noch zu entscheiden, wie er in den Rest eingebunden wird. Sollte er z.B. auch eher unter tools? Oder mit in analysis.py?

Add tokenizer

Writing the unit tests I saw that punctuation signs are considered to be part of the words in our first very rough separation of words with word_list = document.split(" ").

So implementing the a tokenizer is an important next step. You already mentioned the NLTK tokenizer in the script.

Feature: Merged Results

@trutzig89182 I plan to work on a combined result later today or tomorrow (to avoid that we are both working on the same feature again^^).

My idea is the following: I think it is generally common to combine the results from the left and right collocation analysis and to only indicate on which side the term appears most frequently.

I therefore plan to combine left_counter and right_counter dicts to generate a single dict. While doing this, I also plan to add some more details to the pandas output. It should look like this:

idx word coll_frequency orient total_freq
1 dolor 4 left 8
2 etiam 2 - 10
3 lares 9 right 9

Unit tests

I started to conceive some basic unit tests, if you want to have a look: trutzig89182@84fbfe2

Not really good in testing, so if you want to go a completely different way, that’s fine with me, too.

TODO (General List)

Maybe it would be good to collect several smaller TODOs here:

  • the nltk word_tokenizer also lists punctuation. I am not sure if we want that. For the moment, I have added a simple list comprehension to filter \w+ only... but we might need to think of better solutions here (or is we stick to this, we can also use NLTK's RegexTokenizer.
  • Add functions to directly work with twitter data from jsonl files (low priority)
  • implementing stop words #17

Create a html GUI

Perhaps we could create a html interface using flask. But we would have discuss if that is something of value for this project or not first.

Searching with wildcards?

Should we include the possiblity to search with wildcards? This should be possible by defining the search_word as a regex object and checking if it matches the word x in the current sentence.

I had the impression, that you don’t se regex as a core functionality. Perhaps it is problematic for the search of collocations?

In my case I look for words associated with „Datenschützer“ and „Datenschützer\in(en)“, and so searching for „Datenschützer*“ could make things easier.

Program Structure

@trutzig89182

I have started thinking about the general structure based on the comments you made. Here is a first draft:

PyCollocation drawio

So, there are basically two ways one could interact with this program:

  1. Via CLI with the help of arguments.
  2. By importing the module into Python and using the start_collocation_analysis() function.

What do you think?

Here is the draw.io file that you can also change. We can also use a different program, this is only a first draft.

https://1drv.ms/u/s!AjzmoTNnf_mknqAUBlIqx-OB5jdLAg?e=5WtRyu

Error when Counters are empty

Hey, I have been fiddling a bit with the twitter/twarc adapter and came across an error when using the analysis.py which seems to stem from the definition of axes in display.py. Given the hour my cognitive capacity is quite limited right now, so I will just drop this here for the moment. I will try to figure out if I just did something wrong or if there is a bug somewhere some time later.

Traceback (most recent call last):
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 77, in <module>
    start_collocation_analysis(collection, search_term, int(l_window), int(r_window), statistic, doc_type="folder", output_type = output_type)
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/analysis.py", line 61, in start_collocation_analysis
    display.get_results_collocates(left_counter, right_counter, full_counter, search_term_count, l_window, r_window, statistic, output_type)
  File "/Users/maxmustermann/Documents/GitHub/PyCollocation_test/tools/display.py", line 22, in get_results_collocates
    df_top_collocates.columns = ["collocate", "coll_freq"]
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 5491, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/_libs/properties.pyx", line 70, in pandas._libs.properties.AxisProperty.__set__
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/generic.py", line 763, in _set_axis
    self._mgr.set_axis(axis, labels)
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/managers.py", line 216, in set_axis
    self._validate_set_axis(axis, new_labels)
  File "/opt/anaconda3/envs/twarc-venv/lib/python3.9/site-packages/pandas/core/internals/base.py", line 57, in _validate_set_axis
    raise ValueError(
ValueError: Length mismatch: Expected axis has 1 elements, new values have 2 elements

Statistics

@trutzig89182

I have started adding some basic statistics. So far, I have only implemented the most basic one (MU) calculating collocation_freq divided by expected_collocation_frequency. The implementations are mainly based on Brezina et al.

So, basically you now have an option to indicate if you want to use a statistic (default is "freq" with just counting collocation frequencies as we did before). For testing, I compared the results with the results I got from LancsBox when analyzing our test corpus. Although the numbers are different (they seem to have changed the function parameters), the "clustering" seems to be correct. For the moment, you can see the results when running test.py.

I notice that our tiny project is becoming more and more elaborate, so we should soon start a round of restructuring and documenting, otherwise things might become very messay at some point. But first, we should maybe add the interfaces or/and the general workflow how we would like to operate with our program. What do you think?

Statistics to implement

  • MU
  • Z-Score
  • Log-Lik

package structure

I have created a first sub package (tools). This is by no means fixed, but we might start thinking about the overall structure of our program.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.