<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Try out different analyzers on Tweets collection about anserini HOT 4 CLOSED

lintool commented on May 27, 2024

Try out different analyzers on Tweets collection

from anserini.

Comments (4)

xeniaqian94 commented on May 27, 2024

	MAP	P@30
Ad-hoc experiments	0.3631	0.4109
Remove stemming	0.2237	0.2633
With stemming, index only english users’ tweet	0.3614	0.4054
With stemming, index with EnglishAnalyzer	0.3611	0.3878
With stemming, use EnglishAnalyzer and index only english users’ tweet	0.3518	0.3898

Some draft conclusion,

Porter Stemming Algorithm is significantly useful for tweets index.
Even though expecting a increase in MAP, P@30 in English-only indexing, there is still a potential explanation to the actual slight MAP, P@30 decrease in English-only indexing(row 3, 4, 5):
It may be difficult to justify whether a tweet is indeed non-English since the language field is associated with user profile, e.g. a “fr” user may actually writes some tweets in “en”, therefore after filtering, some relevant actual english tweets by non-english users are filtered out and some non-relevant tweets are pushed into top 30 list.
There’s a small defect found in Line 197, io.anserini.document.twitter.Status.java,
original: status.lang = obj.get("lang").getAsString();
modified: status.lang = obj.getAsJsonObject("user").get("lang").getAsString();
Here “lang” field is within “user” sub-Json object?

from anserini.

lintool commented on May 27, 2024

@xeniaqian94 Please rerun the experiments above per our discussion and see if the numbers change...

from anserini.

xeniaqian94 commented on May 27, 2024

1. Effect of PorterStemFilter

Measure	Mean	Standard Error	Test	P-value
MAP_ad_hoc	0.363110	0.031869
MAP_no_stemming	0.332631	0.032804
ΔMAP	0.030480 (0.004318, 0.056641)		paired t	0.0234

0.0234 < 0.05, reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.

Measure	Mean	Standard Error	Test	P-value
P@30_ad_hoc	0.410888	0.039596
P@30_no_stemming	0.399998	0.040605
ΔP@30	0.010890 (-0.021777, 0.043557)		paired t	0.5059

0.5059 > 0.05, do not reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.

This is the experiment branch with raw MAP and P@30 results per topics https://github.com/xeniaqian94/Anserini/tree/no_stemming

2. Effect of indexing only English Tweets

Should have finished this part in hours... But expecting locally to listen and store some 2015 tweets streams... in which case qrel and topics file are ...

Since in tweets2011 collection, json structure has only user's lang field, while 2015 json structure has a tweets' "lang" field. Testing on the tweets 2011 collection, can mostly only index english users' tweets.

This difference made the previous results not justifiable, even in reverse trend.

Measure	Mean	Standard Error	Test	P-value
MAP_ad_hoc	0.363110	0.031869
MAP_english_users	0.361447	0.031470
ΔMAP	0.001663 (-0.006326,0.009653 )		paired t	0.6774

0.6774 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.

Measure	Mean	Standard Error	Test	P-value
P@30_ad_hoc	0.410888	0.039596
P@30_english_users	0.405445	0.038909
ΔP@30	0.005443 (-0.002213,0.013099)		paired t	0.1593

0.1593 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.

Intuitively, english-only index versus all-language index is somewhat similar to per-topic index versus complete index(https://cs.uwaterloo.ca/~jimmylin/publications/Wang_Lin_ECIR2014.pdf). English-only index will increase term occurrences. But since this experiment uses query likelihood model and tf-idf, changes in idf values may still be not sufficient because of log scale effect, retrieved hits' scores would change but not the rank. On the other hand, relevant tweets may even dilute or partially lost because of english filter's effect, which decreases P@30, MAP.

from anserini.

lintool commented on May 27, 2024

We might revisit tweet tokenization at some later point in time, but closing issue for now.

from anserini.

Try out different analyzers on Tweets collection about anserini HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent