Comments (4)
MAP | P@30 | |
---|---|---|
Ad-hoc experiments | 0.3631 | 0.4109 |
Remove stemming | 0.2237 | 0.2633 |
With stemming, index only english users’ tweet | 0.3614 | 0.4054 |
With stemming, index with EnglishAnalyzer | 0.3611 | 0.3878 |
With stemming, use EnglishAnalyzer and index only english users’ tweet | 0.3518 | 0.3898 |
Some draft conclusion,
- Porter Stemming Algorithm is significantly useful for tweets index.
- Even though expecting a increase in MAP, P@30 in English-only indexing, there is still a potential explanation to the actual slight MAP, P@30 decrease in English-only indexing(row 3, 4, 5):
It may be difficult to justify whether a tweet is indeed non-English since the language field is associated with user profile, e.g. a “fr” user may actually writes some tweets in “en”, therefore after filtering, some relevant actual english tweets by non-english users are filtered out and some non-relevant tweets are pushed into top 30 list. - There’s a small defect found in Line 197,
io.anserini.document.twitter.Status.java
,
original:status.lang = obj.get("lang").getAsString();
modified:status.lang = obj.getAsJsonObject("user").get("lang").getAsString();
Here “lang” field is within “user” sub-Json object?
from anserini.
@xeniaqian94 Please rerun the experiments above per our discussion and see if the numbers change...
from anserini.
1. Effect of PorterStemFilter
Measure | Mean | Standard Error | Test | P-value |
---|---|---|---|---|
MAP_ad_hoc | 0.363110 | 0.031869 | ||
MAP_no_stemming | 0.332631 | 0.032804 | ||
ΔMAP | 0.030480 (0.004318, 0.056641) | paired t | 0.0234 |
0.0234 < 0.05, reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.
Measure | Mean | Standard Error | Test | P-value |
---|---|---|---|---|
P@30_ad_hoc | 0.410888 | 0.039596 | ||
P@30_no_stemming | 0.399998 | 0.040605 | ||
ΔP@30 | 0.010890 (-0.021777, 0.043557) | paired t | 0.5059 |
0.5059 > 0.05, do not reject the null hypothesis that, the difference between using or not using PorterStemmingFilter is not statistically significant.
This is the experiment branch with raw MAP and P@30 results per topics https://github.com/xeniaqian94/Anserini/tree/no_stemming
2. Effect of indexing only English Tweets
Should have finished this part in hours... But expecting locally to listen and store some 2015 tweets streams... in which case qrel and topics file are ...
Since in tweets2011 collection, json structure has only user's lang field, while 2015 json structure has a tweets' "lang" field. Testing on the tweets 2011 collection, can mostly only index english users' tweets.
This difference made the previous results not justifiable, even in reverse trend.
Measure | Mean | Standard Error | Test | P-value |
---|---|---|---|---|
MAP_ad_hoc | 0.363110 | 0.031869 | ||
MAP_english_users | 0.361447 | 0.031470 | ||
ΔMAP | 0.001663 (-0.006326,0.009653 ) | paired t | 0.6774 |
0.6774 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.
Measure | Mean | Standard Error | Test | P-value |
---|---|---|---|---|
P@30_ad_hoc | 0.410888 | 0.039596 | ||
P@30_english_users | 0.405445 | 0.038909 | ||
ΔP@30 | 0.005443 (-0.002213,0.013099) | paired t | 0.1593 |
0.1593 > 0.05, do not reject the null hypothesis that, the difference between indexing all language users' tweets and indexing english users' tweets only is not statistically significant.
Intuitively, english-only index versus all-language index is somewhat similar to per-topic index versus complete index(https://cs.uwaterloo.ca/~jimmylin/publications/Wang_Lin_ECIR2014.pdf). English-only index will increase term occurrences. But since this experiment uses query likelihood model and tf-idf, changes in idf values may still be not sufficient because of log scale effect, retrieved hits' scores would change but not the rank. On the other hand, relevant tweets may even dilute or partially lost because of english filter's effect, which decreases P@30, MAP.
from anserini.
We might revisit tweet tokenization at some later point in time, but closing issue for now.
from anserini.
Related Issues (20)
- Maven build / test issue HOT 2
- Add DL19/DL20 for Cohere V3 embeddings HOT 2
- Anserini Retrieval latency question - Mono thread/CPU ?
- bge-base-en-v1.5 encoder query length issues HOT 1
- Allow trec_eval to take symbols representing standard qrels (instead of full qrel files) HOT 7
- Upgrade JDK? HOT 4
- Add dl22 docs to Anserini HOT 2
- Change local filename of downloaded pre-built index HOT 4
- Duplicate downloading of ONNX files for test cases?
- Can't run 2CR on pre-built indexes directly on fatjar - can't read YAML files HOT 14
- Building anserini on MacOS HOT 21
- Missing appassembler-maven-plugin:2.1.0:assemble HOT 6
- Instructions for reproducing runs on MS MARCO V2.1 with prebuilt indexes HOT 1
- Align RunMsMarco with Fatjar regression instructions HOT 2
- Errors with new MS MARCO v2.1 and BEIR regressions HOT 6
- REST API design HOT 4
- Implement run fusion directly in Anserini
- Aligned doc output with 2CR repro classes HOT 1
- Try out new REST API - connect with RankLLM HOT 1
- Discussion: REST API routes for different corpus/model combinations - how do we name? HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from anserini.