taolei87 / askubuntu Goto Github PK
View Code? Open in Web Editor NEWAskUbuntu Question Dataset
AskUbuntu Question Dataset
The number of tokens in the file vectors_pruned.200.txt is 100406.
I wanted to know if some of the tokens from the askUbuntu Dataset have been removed to achieve the above number of 100406.
If not which tokenizer is used over the data.
Hi @taolei87 ,
Thanks for sharing these data! I would like to ask on which text was the BM25 calculated?
In the test.txt and dev.txt files there are the BM25 scores of the questions computed by the Lucene search engine. However, I didn't see anywhere whether the scores are based on the titles, the bodies of the questions, or both of them, and whether stopwords are removed for these scores. Could you please clarify?
Thanks in advance:)
Dear @taolei87 , first of all, thank you for compiling this interesting corpus.
While exploring it, I have noted that the negative instances in the train_random.txt set include duplicates sometimes. For instance:
query : negative (random) instances appearing more than once
163842 : 217493, 185573
393230 : 197044
...
This phenomenon occurs in 1136 out of 12,723 instances. I fully understand this issue is not crucial and fixing it would change nothing, as it simply implies that a negative instance is considered twice. Still you might want to check the random instances selector in case there are other issues or side effects.
Best regards
albarron
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.