kbalog / uis-dat640-fall2020 Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 9.0 2.55 MB

Information Retrieval and Text Mining course at the University of Stavanger (DAT640), 2020 fall

Jupyter Notebook 100.00%

uis-dat640-fall2020's People

Contributors

Stargazers

Watchers

Forkers

asahicantu thek123 tlinjordet spliichx01 febriantiw eivindst hsouporto sth4k cbobed

uis-dat640-fall2020's Issues

A1.3 Collection contains virus?

On Windows machines, you might received a false alarm for some of the email messages containing a virus. The collection is a set of txt files, which cannot execute a virus, so there is no danger. However, to pass the tests, you need to make sure that all documents are present. You can simply replace the contents of these documents with an empty string.

Specifically, the following files may be affected:

data/train/030/032
data/train/030/033
data/train/030/034
data/train/030/035
data/train/030/103

A4 Issues

DBPedia is regularly updated, and may update certain pages that will affect this assignment, some people get more than 2743 indexed items now, will this imply that we will all fail the tests for this assignment due to this fact?

And, field mapping probabilities seems to be bugged, has anyone actually passed this test so far?

Query_processing_DaaT score function

What is the correct method for calculating the score in the Query_processing_DaaT exercise? Using the score function from the notebook on document 0 I get score = 1/3. The test expects 0.0526.

Doc0 = "duck duck duck"  
Query = "beijing duck recipe"  
  c_t,d / |d| * c_t,q / |q|
  0 / 3 * 1 / 3 # beijing in doc0  
+ 3 / 3 * 1 / 3 # duck in doc0  
+ 0 / 3 * 1 / 3 # recipe in doc0
= 1/3

A3 MLM smoothing parameter

The MLM problem description lists lambda_i as the field-specific smoothing parameter. In the constructor and the tests the constant 0.1 is used. Should we expect the hidden tests to use a common constant, or do we need to check whether the supplied object is indexable. If so, will the index be the field name or position?

A5 Performance Improvement + Confusion

Hey, in get_doc_term_freqs, could we please change es.termvectors(index=index, id=doc_id, fields=field, term_statistics=True) to es.termvectors(index=index, id=doc_id, fields=field, term_statistics=False) since we only need 'term_freq' here. Retrieving term statistics across the entire field in the corpus here is unnecessary and degrades the performance a lot when dealing with the large dataset...

Additionally, it is not clear whether we should add the title or title plus description in load_queries.

Project - Group Number?

The mail we received about team allocations contain only information about who we are teamed up with and what project we've gotten, but not the actual group-number
The table showing what timeslot each group has on github, only show the group-number and time-period, and not the name of the group-members.

Could you please update the table to also contain the names of the students, or send each group a mail informing us of the groupnumber?

Currently we do not have any way to actually know our groupnumber (as far as I can see at least)

@kbalog @tlinjordet

Assignment 3 LM Bug?

I cannot seem to pass the last test for the LM scoring in assignment 3, however every other scoring + test passes fine ??

A5 extract_query_doc_features

The extract_features test using query = 't6 t6' and document = 'd3' expects unique_query_terms_in_title and unique_query_terms_in_body to be 2. This doesn't seem correct when there is only one unique query term 't6'.

A5 prepare_ltr_training_data

Has anyone been able to pass the X_train[0] test?

I'm unable to get the correct IDF values (the tests on the toy index are ok). The first query in TRAIN_QUERY_IDS is Death, sudden, giving me IDF values [3.2432702900360164, 5.0103371466737].

Another issue I found is that the analyze_query function doesn't always find the correct document id. For example the term hallucinations is supposedly in document 87125370. When I look at the term vectors the closest term is hallucin. What is the proper way of handling this? Ignore the terms? Assume document frequency of 1? Other terms with the same issue: ray, densitometry, osteophytosis, elastomers, silicone, ...

kbalog / uis-dat640-fall2020 Goto Github PK

uis-dat640-fall2020's People

Contributors

Stargazers

Watchers

Forkers

uis-dat640-fall2020's Issues

A1.3 Collection contains virus?

A4 Issues

Query_processing_DaaT score function

A3 MLM smoothing parameter

A5 Performance Improvement + Confusion

Project - Group Number?

Assignment 3 LM Bug?

A5 extract_query_doc_features

A5 prepare_ltr_training_data

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent