Giter Site home page Giter Site logo

A4 Issues about uis-dat640-fall2020 HOT 25 OPEN

kbalog avatar kbalog commented on August 14, 2024 4
A4 Issues

from uis-dat640-fall2020.

Comments (25)

ChristofferHolmesland avatar ChristofferHolmesland commented on August 14, 2024 2

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.
I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.
@Chrystallic

You only need the clm object if you implement it according to the supplied formula.

from uis-dat640-fall2020.

kbalog avatar kbalog commented on August 14, 2024 2

Please refer to this page for clarification and guidance. And yes, the deadline has been extended by 48 hours.

from uis-dat640-fall2020.

asahicantu avatar asahicantu commented on August 14, 2024 1

I have not, actually it keeps giving me different number of records and classification in 'gospel'.

I think the mapping probability is using different probabilities for p(f) since according to the rule and by having unifrom p(f) the sum of all probabilities should be 1.. Have not been able to make it work nor pass with give equation.
Image

from uis-dat640-fall2020.

asahicantu avatar asahicantu commented on August 14, 2024 1

We can suggest a deadline extension since we lost a lot of time trying to figure out these issues once the cells are updated or we are allowed to modify the asserts...

from uis-dat640-fall2020.

FebriantiW avatar FebriantiW commented on August 14, 2024 1

passed deadline.. bonus poeng then ?

from uis-dat640-fall2020.

BerntA avatar BerntA commented on August 14, 2024

Yep, I get 2748 now, previously I had 2743 indexed artists. (I still pass the tests I passed previously though)

from uis-dat640-fall2020.

FebriantiW avatar FebriantiW commented on August 14, 2024

Me too 2748 and i think it is a bug.. been trying for days and reading papers.... all saying the same thing but still different result...

from uis-dat640-fall2020.

BerntA avatar BerntA commented on August 14, 2024

Yeah, same here. The formula is very clear, but still applying the given formula does not give the result which the test expects.. The prof. or TA should run through the entire assignment, re-index, etc and confirm..

from uis-dat640-fall2020.

Wizdore avatar Wizdore commented on August 14, 2024

2748 as well.. doesn't pass the assertions. Stuck.

from uis-dat640-fall2020.

ChristofferHolmesland avatar ChristofferHolmesland commented on August 14, 2024

I'm getting 2748 entities and I'm passing the tests, except get_term_mapping_probs which seem to be bugged.

from uis-dat640-fall2020.

Chrystallic avatar Chrystallic commented on August 14, 2024

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.

I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.

from uis-dat640-fall2020.

Chrystallic avatar Chrystallic commented on August 14, 2024

Out of curiousity with the mapping probs. Has people used both the elastic search instance and the clm, or just one of them?.
I've tried several combinations but nothing seems to work... Which leads me to believe that the test is somehow bugged.
@Chrystallic

You only need the clm object if you implement it according to the supplied formula.

Ok, my current implementation utilizes only the 'clm' object, but I was curious if someone had gotten plausible answers using both the 'clm' and 'es' objects.

from uis-dat640-fall2020.

kbalog avatar kbalog commented on August 14, 2024

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually.
Also, getting less than 2500 items indexed will not be penalized.

from uis-dat640-fall2020.

a-nesse avatar a-nesse commented on August 14, 2024

Do any of you who have not passed the
'Tests for field mapping probabilities '

still pass the following
'Tests for PRMS retrieval' (the second to last cell, before the end of the assignment notebook)?

I assumed there was an issue due to the get_term_mapping_probs() being miscalculated.

from uis-dat640-fall2020.

Chrystallic avatar Chrystallic commented on August 14, 2024

Indeed, there is a bug in the field mapping probabilities tests. This part of the exercise will be corrected manually.
Also, getting less than 2500 items indexed will not be penalized.

@kbalog. If you were aware of this "bug" why have you not informed the students at an earlier date of this issue? Would it be possible to at least provide us with a correct value for one of the fields, for instance what is the expected correct output value for the description field in Pt_f_3_1?

clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pt_f_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pt_f_3_1['description'] == pytest.approx(?????)

from uis-dat640-fall2020.

tlinjordet avatar tlinjordet commented on August 14, 2024

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment,
and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as
code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:

# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1

The updated test cell for field mapping probabilities is:

# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)

The updated test cell for prms_retrieval is

# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']

and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable
names of the code for get_term_mapping_probabilities.

The correct description should be
"
Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$
for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model.
"
and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities
should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability
(normalized over the probabilities of all fields) that a given field $f$ should be the
field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:

def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.
    
    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 
        
    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t

Use the longer mathematical description under the heading "Retrieval method" in the
A4 notebook and note that the mapping probability is $P(f|t)$.

from uis-dat640-fall2020.

Wizdore avatar Wizdore commented on August 14, 2024

Are we going to get an extension?
because its just 1 hour till deadline

from uis-dat640-fall2020.

BerntA avatar BerntA commented on August 14, 2024

Hi,

There are two issues raised here, the number of entities to be indexed and the field mapping probabilities.

(1) Changing numbers of entities to be indexed

Regarding the number of entities to be indexed, it appears the updating of the subpages of
'Lists_of_musicians' has been happening more frequently than anticipated when creating the assignment,
and thus the tests will need to be updated.

After running through the code in the reference solution this afternoon (13:30), there were 2748 unique entities indexed.

The affected test cells are updated and provided below. You can insert and run these as
code cells in your notebook, but do remove them before submission. Tests will be used during grading based on updates that will be current at that time.

This is the current, updated test cell following bulk_index:

# Test index
tv_1 = es.termvectors(index=INDEX_NAME, id='Al Green', fields='catch_all')
assert tv_1['term_vectors']['catch_all']['terms']['1946']['term_freq'] == 23

tv_2 = es.termvectors(index=INDEX_NAME, id='Runhild Gammelsæter', fields='attributes')
assert tv_2['term_vectors']['attributes']['terms']['khlyst']['term_freq'] == 1

tv_3 = es.termvectors(index=INDEX_NAME, id='Music of Belarus', fields='description')
assert tv_3['term_vectors']['description']['terms']['kazakhstan']['term_freq'] == 1

tv_4 = es.termvectors(index=INDEX_NAME, id='MC HotDog', fields=['types', 'description'])
assert tv_4['term_vectors']['description']['terms']['microphon']['term_freq'] == 1

tv_5 = es.termvectors(index=INDEX_NAME, id='Deadmau5', fields=['types', 'description', 'related_entities'])
assert tv_5['term_vectors']['description']['terms']['italiano']['term_freq'] == 1

The updated test cell for field mapping probabilities is:

# Tests for field mapping probabilities
clm_3 = CollectionLM(es, analyze_query(es, 'gospel soul'))
Pf_t_3_1 = get_term_mapping_probs(es, clm_3, 'gospel')
assert Pf_t_3_1['description'] == pytest.approx(0.19345, abs=1e-5)
assert Pf_t_3_1['attributes'] == pytest.approx(0.13442, abs=1e-5)
assert Pf_t_3_1['related_entities'] == pytest.approx(0.30600, abs=1e-5)

Pf_t_3_2 = get_term_mapping_probs(es, clm_3, 'soul')
assert Pf_t_3_2['names'] == pytest.approx(0.01502, abs=1e-5)
assert Pf_t_3_2['types'] == pytest.approx(0.10632, abs=1e-5)
assert Pf_t_3_2['catch_all'] == pytest.approx(0.16552, abs=1e-5)

The updated test cell for prms_retrieval is

# Tests for PRMS retrieval

prms_query_1 = 'winter snow'
prms_retrieval_1 = prms_retrieval(es, prms_query_1)
assert prms_retrieval_1[:5] == ['Snow (musician)', 'Kurt Winter', 'Hank Snow', 'Derek Miller', 'Richard Manuel']

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

prms_query_3 = 'freedom jazz'
prms_retrieval_3 = prms_retrieval(es, prms_query_3)
assert prms_retrieval_3[:5] == ['Paul Bley', 'Stéphane Galland', 'Saifo', 'Christy Sutherland', 'K.d. lang']

and the final test cell comparing the above with baseline retrieval needs no update.

None of the other test cells in A4 need updating at this time.

(2) Field mapping probabilities

There was a small notational inconsistency in the description and variable
names of the code for get_term_mapping_probabilities.

The correct description should be
"
Implement the function get_term_mapping_probs to return $P(f|t)$ of all fields $f$
for a given term $t$ according to the description above of how PRMS extends the MLM retrieval model.
"
and the Dictionary of mapping probabilities to be returned by get_term_mapping_probabilities
should properly be named Pf_t rather than Pt_f, reflecting that mapping probability is the probability
(normalized over the probabilities of all fields) that a given field $f$ should be the
field to which a term $t$ would be mapped.

You may consider this your starting point for implementing that solution, if you find it helpful:

def get_term_mapping_probs(es, clm, term):
    """PRMS: For a single term, find their mapping probabilities for all fields.
    
    Arguments:
        es: An active Elasticsearch instance.
        clm: Collection language model instance.
        term: A single-term string. 
        
    Returns:
        Dictionary of mapping probabilities for the fields.
    """
    Pf_t = {}
    # YOUR CODE HERE
    raise NotImplementedError()
    return Pf_t

Use the longer mathematical description under the heading "Retrieval method" in the
A4 notebook and note that the mapping probability is $P(f|t)$.

The new tests still seem to be wrong, we all get the wrong answer (we all get the same wrong answer), either you have implemented field mapping probabilities wrong, or we are being heavily misleaded by the description in the assignment leading us to implement it wrong.. (pardon me if I'm being harsh here)

or, are we just indexing different content even though we have the same filtering procedure?

from uis-dat640-fall2020.

BerntA avatar BerntA commented on August 14, 2024

The second test seems more promising, but still fails.
help
While the first test is way off @ description
help2

from uis-dat640-fall2020.

FebriantiW avatar FebriantiW commented on August 14, 2024

I get the same mapping result as @BerntA

from uis-dat640-fall2020.

FebriantiW avatar FebriantiW commented on August 14, 2024

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

from uis-dat640-fall2020.

BerntA avatar BerntA commented on August 14, 2024

Anyone passed this one ? This the corrected test, wondering whether this is still a bug or not

prms_query_2 = 'summer sun'
prms_retrieval_2 = prms_retrieval(es, prms_query_2)
assert prms_retrieval_2[:5] == ['Joseph Summers (musician)', 'Sun Nan', 'Stefanie Sun', 'Virgil Sturgill', 'Robert Turner (composer)']

Yeah, the results here still depend on the indexing. I get,

'Joseph Summers (musician)',
 'Sun Nan',
 'Stefanie Sun',
 'Sylvia (singer)',
 'Virgil Sturgill'

from uis-dat640-fall2020.

FebriantiW avatar FebriantiW commented on August 14, 2024

okay, same same :). Thanks!

from uis-dat640-fall2020.

tlinjordet avatar tlinjordet commented on August 14, 2024

For anyone seeing this thread later, see the A4 errata.

from uis-dat640-fall2020.

kbalog avatar kbalog commented on August 14, 2024

in this case should we paste and keep the new test cell ? or remove it after our test ?

Replace the original test cell with the new one.

from uis-dat640-fall2020.

Related Issues (9)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.