Giter Site home page Giter Site logo

uis-dat640-fall2020's People

Contributors

kbalog avatar tlinjordet avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

uis-dat640-fall2020's Issues

A1.3 Collection contains virus?

On Windows machines, you might received a false alarm for some of the email messages containing a virus. The collection is a set of txt files, which cannot execute a virus, so there is no danger. However, to pass the tests, you need to make sure that all documents are present. You can simply replace the contents of these documents with an empty string.

Specifically, the following files may be affected:

data/train/030/032
data/train/030/033
data/train/030/034
data/train/030/035
data/train/030/103

A4 Issues

DBPedia is regularly updated, and may update certain pages that will affect this assignment, some people get more than 2743 indexed items now, will this imply that we will all fail the tests for this assignment due to this fact?

And, field mapping probabilities seems to be bugged, has anyone actually passed this test so far?

Query_processing_DaaT score function

What is the correct method for calculating the score in the Query_processing_DaaT exercise? Using the score function from the notebook on document 0 I get score = 1/3. The test expects 0.0526.

Doc0 = "duck duck duck"  
Query = "beijing duck recipe"  
  c_t,d / |d| * c_t,q / |q|
  0 / 3 * 1 / 3 # beijing in doc0  
+ 3 / 3 * 1 / 3 # duck in doc0  
+ 0 / 3 * 1 / 3 # recipe in doc0
= 1/3

A3 MLM smoothing parameter

The MLM problem description lists lambda_i as the field-specific smoothing parameter. In the constructor and the tests the constant 0.1 is used. Should we expect the hidden tests to use a common constant, or do we need to check whether the supplied object is indexable. If so, will the index be the field name or position?

A5 Performance Improvement + Confusion

Hey, in get_doc_term_freqs, could we please change es.termvectors(index=index, id=doc_id, fields=field, term_statistics=True) to es.termvectors(index=index, id=doc_id, fields=field, term_statistics=False) since we only need 'term_freq' here. Retrieving term statistics across the entire field in the corpus here is unnecessary and degrades the performance a lot when dealing with the large dataset...

Additionally, it is not clear whether we should add the title or title plus description in load_queries.

Project - Group Number?

The mail we received about team allocations contain only information about who we are teamed up with and what project we've gotten, but not the actual group-number
The table showing what timeslot each group has on github, only show the group-number and time-period, and not the name of the group-members.

Could you please update the table to also contain the names of the students, or send each group a mail informing us of the groupnumber?

Currently we do not have any way to actually know our groupnumber (as far as I can see at least)

@kbalog @tlinjordet

Assignment 3 LM Bug?

I cannot seem to pass the last test for the LM scoring in assignment 3, however every other scoring + test passes fine ??

A5 extract_query_doc_features

The extract_features test using query = 't6 t6' and document = 'd3' expects unique_query_terms_in_title and unique_query_terms_in_body to be 2. This doesn't seem correct when there is only one unique query term 't6'.

A5 prepare_ltr_training_data

Has anyone been able to pass the X_train[0] test?

I'm unable to get the correct IDF values (the tests on the toy index are ok). The first query in TRAIN_QUERY_IDS is Death, sudden, giving me IDF values [3.2432702900360164, 5.0103371466737].

Another issue I found is that the analyze_query function doesn't always find the correct document id. For example the term hallucinations is supposedly in document 87125370. When I look at the term vectors the closest term is hallucin. What is the proper way of handling this? Ignore the terms? Assume document frequency of 1? Other terms with the same issue: ray, densitometry, osteophytosis, elastomers, silicone, ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.