DBPedia is regularly updated, and may update certain pages that will affect this assig

Please refer to <a href="https://github.com/kbalog/ir-course/blob/master/assignments/A

A4 Issues,about kbalog/uis-dat640-fall2020

Comments (25)

ChristofferHolmesland commented on August 14, 2024 2

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.
I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.
@Chrystallic

You only need the clm object if you implement it according to the supplied formula.

from uis-dat640-fall2020.

kbalog commented on August 14, 2024 2

Please refer to this page for clarification and guidance. And yes, the deadline has been extended by 48 hours.

from uis-dat640-fall2020.

asahicantu commented on August 14, 2024 1

I have not, actually it keeps giving me different number of records and classification in 'gospel'.

I think the mapping probability is using different probabilities for p(f) since according to the rule and by having unifrom p(f) the sum of all probabilities should be 1.. Have not been able to make it work nor pass with give equation.

from uis-dat640-fall2020.

asahicantu commented on August 14, 2024 1

We can suggest a deadline extension since we lost a lot of time trying to figure out these issues once the cells are updated or we are allowed to modify the asserts...

from uis-dat640-fall2020.

FebriantiW commented on August 14, 2024 1

passed deadline.. bonus poeng then ?

from uis-dat640-fall2020.

BerntA commented on August 14, 2024

Yep, I get 2748 now, previously I had 2743 indexed artists. (I still pass the tests I passed previously though)

from uis-dat640-fall2020.

FebriantiW commented on August 14, 2024

Me too 2748 and i think it is a bug.. been trying for days and reading papers.... all saying the same thing but still different result...

from uis-dat640-fall2020.

BerntA commented on August 14, 2024

Yeah, same here. The formula is very clear, but still applying the given formula does not give the result which the test expects.. The prof. or TA should run through the entire assignment, re-index, etc and confirm..

from uis-dat640-fall2020.

Wizdore commented on August 14, 2024

2748 as well.. doesn't pass the assertions. Stuck.

from uis-dat640-fall2020.

ChristofferHolmesland commented on August 14, 2024

I'm getting 2748 entities and I'm passing the tests, except get_term_mapping_probs which seem to be bugged.

from uis-dat640-fall2020.

Chrystallic commented on August 14, 2024

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.

I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.

from uis-dat640-fall2020.

Chrystallic commented on August 14, 2024

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.
I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.
@Chrystallic

You only need the clm object if you implement it according to the supplied formula.

Ok, my current implementation utilizes only the 'clm' object, but I was curious if someone had gotten plausible answers using both the 'clm' and 'es' objects.

from uis-dat640-fall2020.

kbalog commented on August 14, 2024

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually.
Also, getting less than 2500 items indexed will not be penalized.

from uis-dat640-fall2020.

a-nesse commented on August 14, 2024

Do any of you who have not passed the
'Tests for field mapping probabilities '

still pass the following
'Tests for PRMS retrieval' (the second to last cell, before the end of the assignment notebook)?

I assumed there was an issue due to the get_term_mapping_probs() being miscalculated.

from uis-dat640-fall2020.

Chrystallic commented on August 14, 2024

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually.
Also, getting less than 2500 items indexed will not be penalized.

@kbalog. If you were aware of this "bug" why have you not informed the students at an earlier date of this issue? Would it be possible to at least provide us with a correct value for one of the fields, for instance what is the expected correct output value for the description field in Pt_f_3_1?

clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pt_f_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pt_f_3_1['description'] == pytest.approx(?????)

from uis-dat640-fall2020.

tlinjordet commented on August 14, 2024

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment,
and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as
code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:

# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1

The updated test cell for field mapping probabilities is:

# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)

The updated test cell for prms_retrieval is

# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']

and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable
names of the code for get_term_mapping_probabilities.

The correct description should be
"
Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$
for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model.
"
and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities
should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability
(normalized over the probabilities of all fields) that a given field $f$ should be the
field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:

def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.
    
    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 
        
    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t

Use the longer mathematical description under the heading "Retrieval method" in the
A4 notebook and note that the mapping probability is $P(f|t)$.

from uis-dat640-fall2020.

Wizdore commented on August 14, 2024

Are we going to get an extension?
because its just 1 hour till deadline

from uis-dat640-fall2020.

BerntA commented on August 14, 2024

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment,
and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as
code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:
# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1
The updated test cell for field mapping probabilities is:
# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)
The updated test cell for prms_retrieval is
# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']
and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable
names of the code for get_term_mapping_probabilities.

The correct description should be
"
Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$
for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model.
"
and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities
should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability
(normalized over the probabilities of all fields) that a given field $f$ should be the
field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:
def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.
    
    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 
        
    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t
Use the longer mathematical description under the heading "Retrieval method" in the
A4 notebook and note that the mapping probability is $P(f|t)$.

The new tests still seem to be wrong, we all get the wrong answer (we all get the same wrong answer), either you have implemented field mapping probabilities wrong, or we are being heavily misleaded by the description in the assignment leading us to implement it wrong.. (pardon me if I'm being harsh here)

or, are we just indexing different content even though we have the same filtering procedure?

from uis-dat640-fall2020.

BerntA commented on August 14, 2024

The second test seems more promising, but still fails.

While the first test is way off @ description

from uis-dat640-fall2020.

FebriantiW commented on August 14, 2024

I get the same mapping result as @BerntA

from uis-dat640-fall2020.

FebriantiW commented on August 14, 2024

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

from uis-dat640-fall2020.

BerntA commented on August 14, 2024

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

Yeah, the results here still depend on the indexing. I get,

'Joseph Summers (musician)',
 'Sun Nan',
 'Stefanie Sun',
 'Sylvia (singer)',
 'Virgil Sturgill'

from uis-dat640-fall2020.

FebriantiW commented on August 14, 2024

okay, same same :). Thanks!

from uis-dat640-fall2020.

tlinjordet commented on August 14, 2024

For anyone seeing this thread later, see the A4 errata.

from uis-dat640-fall2020.

kbalog commented on August 14, 2024

in this case should we paste and keep the new test cell ? or remove it after our test ?

Replace the original test cell with the new one.

from uis-dat640-fall2020.

A4 Issues about uis-dat640-fall2020 HOT 25 OPEN

Comments (25)

(1) Changing numbers of entities to be indexed

(2) Field mapping probabilities

(1) Changing numbers of entities to be indexed

(2) Field mapping probabilities

Related Issues (9)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent