Giter Site home page Giter Site logo

Comments (4)

xeniaqian94 avatar xeniaqian94 commented on May 27, 2024
MAP P@30
Ad-hoc experiments 0.3631 0.4109
Remove stemming 0.2237 0.2633
With stemming, index only english users’ tweet 0.3614 0.4054
With stemming, index with EnglishAnalyzer 0.3611 0.3878
With stemming, use EnglishAnalyzer and index only english users’ tweet 0.3518 0.3898

Some draft conclusion,

  1. Porter Stemming Algorithm is significantly useful for tweets index.
  2. Even though expecting a increase in MAP, P@30 in English-only indexing, there is still a potential explanation to the actual slight MAP, P@30 decrease in English-only indexing(row 3, 4, 5):
    It may be difficult to justify whether a tweet is indeed non-English since the language field is associated with user profile, e.g. a “fr” user may actually writes some tweets in “en”, therefore after filtering, some relevant actual english tweets by non-english users are filtered out and some non-relevant tweets are pushed into top 30 list.
  3. There’s a small defect found in Line 197, io.anserini.document.twitter.Status.java,
    original: status.lang = obj.get("lang").getAsString();
    modified: status.lang = obj.getAsJsonObject("user").get("lang").getAsString();
    Here “lang” field is within “user” sub-Json object?

from anserini.

lintool avatar lintool commented on May 27, 2024

@xeniaqian94 Please rerun the experiments above per our discussion and see if the numbers change...

from anserini.

xeniaqian94 avatar xeniaqian94 commented on May 27, 2024

1. Effect of PorterStemFilter

Measure Mean Standard Error Test P-value
MAP_ad_hoc 0.363110 0.031869
MAP_no_stemming 0.332631 0.032804
ΔMAP 0.030480 (0.004318, 0.056641) paired t 0.0234

0.0234 < 0.05, reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.

Measure Mean Standard Error Test P-value
P@30_ad_hoc 0.410888 0.039596
P@30_no_stemming 0.399998 0.040605
ΔP@30 0.010890 (-0.021777, 0.043557) paired t 0.5059

0.5059 > 0.05, do not reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.

This is the experiment branch with raw MAP and P@30 results per topics https://github.com/xeniaqian94/Anserini/tree/no_stemming

2. Effect of indexing only English Tweets

Should have finished this part in hours... But expecting locally to listen and store some 2015 tweets streams... in which case qrel and topics file are ...

Since in tweets2011 collection, json structure has only user's lang field, while 2015 json structure has a tweets' "lang" field. Testing on the tweets 2011 collection, can mostly only index english users' tweets.

Settings WindowSettings Window

This difference made the previous results not justifiable, even in reverse trend.

Measure Mean Standard Error Test P-value
MAP_ad_hoc 0.363110 0.031869
MAP_english_users 0.361447 0.031470
ΔMAP 0.001663 (-0.006326,0.009653 ) paired t 0.6774

0.6774 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.

Measure Mean Standard Error Test P-value
P@30_ad_hoc 0.410888 0.039596
P@30_english_users 0.405445 0.038909
ΔP@30 0.005443 (-0.002213,0.013099) paired t 0.1593

0.1593 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.

Intuitively, english-only index versus all-language index is somewhat similar to per-topic index versus complete index(https://cs.uwaterloo.ca/~jimmylin/publications/Wang_Lin_ECIR2014.pdf). English-only index will increase term occurrences. But since this experiment uses query likelihood model and tf-idf, changes in idf values may still be not sufficient because of log scale effect, retrieved hits' scores would change but not the rank. On the other hand, relevant tweets may even dilute or partially lost because of english filter's effect, which decreases P@30, MAP.

from anserini.

lintool avatar lintool commented on May 27, 2024

We might revisit tweet tokenization at some later point in time, but closing issue for now.

from anserini.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.