Hello, I'm currently interested in testing Baleen using the HotpotQA

Training Script and Model Checkpoint for HotpotQA about baleen HOT 12 CLOSED

stanford-futuredata commented on September 24, 2024

Training Script and Model Checkpoint for HotpotQA

from baleen.

Comments (12)

okhat commented on September 24, 2024

Of course! Which is better? I can probably find the checkpoint more quickly but happy to provide either

from baleen.

hyukkyukang commented on September 24, 2024

The checkpoint will be the best!
Nevertheless, I am quite interested in understanding the implementation of latent hop ordering in code. It would be awesome if the training script can be shared as well :)

Thanks for the quick answer!

from baleen.

okhat commented on September 24, 2024

In a little bit (after the upload is completed), you should be able to download the HotPotQA checkpoints from:

wget https://downloads.cs.stanford.edu/nlp/data/colbert/baleen/unchecked.hotpotqa.checkpoints-v1.0.tar.gz

Notice I kept "unchecked" in the name. I didn't try to test these 25-month old files, but I'm >90% sure these are the right HotPotQA checkpoints.

The HotPotQA corpus is the same as HoVer's. But you'll need the HotPotQA dev queries, which I assume you have.

If you run the indexing and retrieval pipelines (with the compression-based ColBERTv2) and observe the results from the paper (modulo the effect from compression), I'm happy to put this release next to the official HoVer one that's in the README (i.e., and remove this "unchecked" label).

Let me know if you can confirm this. If so, I'll be happy to gather the training scripts that produced these checkpoints.

from baleen.

okhat commented on September 24, 2024

By the way, I'd be happy to check it myself if you prefer. But I didn't want to block you until I get a chance to do this.

from baleen.

okhat commented on September 24, 2024

Oh btw when running the pipeline don’t forget to set the number of hops for hotpot to 2. Not 4.

from baleen.

hyukkyukang commented on September 24, 2024

Thank you so much!
I'll check it today and share the result when it's ready!

from baleen.

hyukkyukang commented on September 24, 2024

I've conducted an evaluation using the provided checkpoint, however, the accuracy appears to be quite low. I'd appreciate any assistance to ascertain if I might have made an error in my process.

Here's a detailed outline of the steps I've taken:

Indexing and Inference:

I used the following scripts for indexing and inference:

hotpotqa_indexing.py

hotpotqa_inference.py

Evaluation:

I proceeded with the evaluation as follows:
python evaluation/eval.py --pred_file ./experiments/default/hotpotqa_inference/2023-05/18/11.28.51/hotpotqa_output.json --dev_file ./data/hotpotqa/dev/qas.json --eval_type doc
The script and data used for evaluation include:

eval.py

hotpotqa_output.json

qas.json

Results:

The results of the evaluation are as follows:

{
    "total": 7405,
    "exact": 50.182309250506414,
    "f1": 82.4313687662804,
    "hit5": 87.508440243079,
    "hit8": 87.73801485482782,
    "hit10": 87.88656313301823,
    "hit20": 88.6698176907495
}

While I am re-evaluating my steps for potential errors, I would greatly appreciate it if you could also examine it.

from baleen.

okhat commented on September 24, 2024

So the scripts in the repo are more directly useful for evaluating Psg-EM from Table 2, which for Baleen is 86.7% in the paper. Getting the correct top-20 for hit20 is a bit different, because you don't just want a "bag" of passages, but you want the right split of passages from the first hop and the second hop.

Let's first check Psg-EM. Here's my evaluation logic:


def f7(seq):
    seen = set()
    seen_add = seen.add
    return [x for x in seq if not (x in seen or seen_add(x))]


path = '/future/u/okhattab/experiments.Jan26/HotPotQA.Baleen/Condensers.L2/C2.Hn.cv1/inference/b10000.HotPotQA.Baleen.C2.Hn.dev.H2.new10k.cv0/condensed.json'

PsgEM = []
with open(path) as f:
    for line in f:
        example = ujson.loads(line)
        
        preds = example['prediction']
        
        # NOTE: your files have probably done some version of this filtering but it may be geared toward HoVer
        preds = sorted(preds, reverse=True) # sort by score
        preds = [x for score, x in preds] # remove the scores
        preds = list(map(tuple, preds[:K])) # at most 5 sentences
        
        # at most (or rather exactly) two PIDs
        if len(set([pid for pid, _ in preds])) > 2:
            first_two_pids = f7([pid for pid, _ in preds])[:2]
            preds = [(pid, sid) for pid, sid in preds if pid in first_two_pids]
        
        gold = Dev[example['qid']]['support_facts']
        ceil = list(map(tuple, gold))
        
        psg_em = set([pid for pid, _ in preds]) == set([pid for pid, _ in gold])
        PsgEM.append(psg_em)
    

sum(PsgEM)/ len(PsgEM)
# should be 86.7%

I have also checked my file above and it seems to overall match yours in its highest-scoring sentences.

Here is the top of my file (notice it's a different format):

{"qid":0,"question":"Were Scott Derrickson and Ed Wood of the same nationality?","prediction":[[8.234375,[536450,0]],[5.8203125,[967320,0]],[-3.794921875,[2373782,0]],[-9.2421875,[2398774,0]],[-9.3359375,[5032022,1]],[-9.3671875,[1883474,0]],[-9.234375,[1480374,0]],[-9.2265625,[1153807,0]],[-9.375,[430108,0]]]}
{"qid":1,"question":"What government position was held by the woman who portrayed Corliss Archer in the film Kiss and Tell?","prediction":[[8.09375,[2120546,0]],[4.66796875,[4683086,1]],[-9.1875,[967281,0]],[-0.2236328125,[4683086,0]],[-9.2734375,[3195441,1]],[-9.2734375,[1678729,1]],[-9.1171875,[3190546,1]],[-9.046875,[968157,0]],[-9.1171875,[260050,0]]]}
{"qid":2,"question":"What science fantasy young adult series, told in first person, has a set of companion books narrating the stories of enslaved worlds and alien species?","prediction":[[3.802734375,[1249454,1]],[2.923828125,[1249454,0]],[0.56298828125,[126385,0]],[-2.505859375,[157352,0]],[1.6318359375,[176789,2]],[-0.10064697265625,[157352,2]],[-2.232421875,[157352,3]],[-0.270263671875,[176789,0]]]}
{"qid":3,"question":"Are the Laleli Mosque and Esma Sultan Mansion located in the same neighborhood?","prediction":[[7.93359375,[1353673,0]],[8.296875,[2687076,0]],[-6.89453125,[607206,0]],[-8.3671875,[504261,0]],[-9.1015625,[3991928,0]],[-8.625,[3051166,0]],[-9.359375,[1723248,0]],[-8.6875,[4127983,0]]]}
{"qid":4,"question":"The director of the romantic comedy \"Big Stone Gap\" is based in what New York city?","prediction":[[8.1171875,[986521,0]],[8.296875,[5114725,0]],[-9.390625,[4705503,0]],[-9.3359375,[693237,0]],[-9.421875,[3530945,0]],[-8.15625,[5114725,1]],[-9.328125,[4768609,0]],[-9.3359375,[4767627,0]],[-9.3125,[1578479,0]]]}
{"qid":5,"question":"2014 S\/S is the debut album of a South Korean boy group that was formed by who?","prediction":[[6.421875,[3184586,0]],[7.48828125,[3333904,0]],[-6.64453125,[4651129,1]],[-8.671875,[4933676,0]],[-9.03125,[2325846,1]],[-8.6640625,[298793,1]],[-7.8359375,[298793,0]],[-8.2734375,[439851,0]],[-8.7890625,[4933676,3]]]}
{"qid":6,"question":"Who was known by his stage name Aladin and helped organizations improve their performance as a consultant?","prediction":[[7.41796875,[2374219,0]],[6.375,[261087,0]],[-8.9375,[4807944,0]],[-7.8671875,[1919128,0]],[-6.671875,[3192216,0]],[-8.875,[4289908,1]],[-8.203125,[1919128,3]],[-8.2421875,[3191341,0]],[-7.93359375,[3192216,2]]]}
{"qid":7,"question":"The arena where the Lewiston Maineiacs played their home games can seat how many people?","prediction":[[7.30859375,[3533981,0]],[3.328125,[3533979,1]],[0.1922607421875,[3533979,0]],[-9.2265625,[3601971,3]],[-9.2734375,[3532376,0]],[-9.3125,[3532378,0]],[-9.265625,[3532846,0]],[-9.28125,[5214971,0]],[-9.1953125,[608235,1]]]}
{"qid":8,"question":"Who is older, Annie Morton or Terry Richardson?","prediction":[[8.4140625,[3509753,0]],[8.546875,[330695,0]],[-9.0390625,[2204182,0]],[-9.2578125,[4436742,0]],[-8.875,[1829529,2]],[-9.2109375,[3530174,0]],[-9.2578125,[3266617,0]],[-9.265625,[1165865,0]],[-9.0859375,[4336666,0]]]}
{"qid":9,"question":"Are Local H and For Against both from the United States?","prediction":[[8.3984375,[4161666,0]],[8.359375,[4699440,0]],[-9.3125,[537904,0]],[-8.96875,[2744539,0]],[-9.2109375,[4444221,0]],[-9.140625,[4707143,0]],[-9.3671875,[1771660,0]],[-9.1953125,[3407789,0]],[-9.1015625,[2205580,0]]]}
{"qid":10,"question":"What is the name of the fight song of the university whose main campus is in Lawrence, Kansas and whose branch campuses are in the Kansas City metropolitan area?","prediction":[[3.8515625,[2202433,2]],[4.02734375,[2202433,1]],[0.83056640625,[2202433,0]],[-0.72998046875,[1277641,0]],[-4.56640625,[955584,0]],[-5.640625,[955584,3]],[-3.55078125,[4869057,0]],[-7.32421875,[839762,0]],[-6.15625,[1412468,0]]]}
{"qid":11,"question":"What screenwriter with credits for \"Evolution\" co-wrote a film starring Nicolas Cage and Te\u0301a Leoni?","prediction":[[7.45703125,[957780,0]],[5.078125,[1360245,1]],[0.55126953125,[1360245,0]],[-8.46875,[1360239,0]],[-8.84375,[3191607,0]],[-9.046875,[2209894,1]],[-9.2734375,[3556249,0]],[-8.5546875,[2957999,0]],[-8.4765625,[2209894,0]]]}
{"qid":12,"question":"What year did Guns N Roses perform a promo for a movie starring Arnold Schwarzenegger as a former New York Police detective?","prediction":[[4.3359375,[1418521,1]],[4.98828125,[541179,1]],[3.794921875,[1418521,0]],[2.240234375,[541179,0]],[-9.0078125,[3264084,1]],[-8.328125,[4314384,1]],[-8.15625,[2198244,0]],[-7.328125,[537904,0]],[-7.14453125,[537904,1]]]}
{"qid":13,"question":"Are Random House Tower and 888 7th Avenue both used for real estate?","prediction":[[8.0703125,[1453219,0]],[2.814453125,[5197046,0]],[0.4072265625,[5197046,2]],[-7.98046875,[937414,0]],[-7.04296875,[4378635,1]],[-4.37109375,[4378635,0]],[-6.39453125,[937414,1]],[-9.2265625,[3407328,1]],[-9.1484375,[3516718,1]]]}
{"qid":14,"question":"The football manager who recruited David Beckham managed Manchester United during what timeframe?","prediction":[[3.947265625,[363674,0]],[2.0546875,[2795318,3]],[-0.419677734375,[2795318,2]],[-3.294921875,[5196977,0]],[-0.296142578125,[2200838,0]],[-8.375,[341602,2]],[-7.34375,[1857379,0]],[-8.7734375,[4159347,2]],[-8.78125,[4693408,3]]]}
{"qid":15,"question":"Brown State Fishing Lake is in a country that has a population of how many inhabitants ?","prediction":[[6.125,[2627373,0]],[-5.7265625,[2198809,0]],[-7.97265625,[967436,1]],[-4.62109375,[2093873,0]],[-7.81640625,[1684547,2]],[-7.0078125,[3267142,6]],[-7.75390625,[968217,2]],[-8.0859375,[2200173,1]],[-9.203125,[5182786,2]]]}
{"qid":16,"question":"The Vermont Catamounts men's soccer team currently competes in a conference that was formerly known as what from 1988 to 1996?","prediction":[[3.873046875,[2039110,1]],[2.529296875,[2039110,0]],[6.3984375,[2377118,1]],[2.361328125,[2377118,0]],[-9.1484375,[2476154,0]],[-9.09375,[1513000,4]],[-9.078125,[375798,1]],[-9.171875,[4766903,1]],[-9.21875,[3917026,2]]]}

Let's get the Psg-EM evaluation to be the same and then I'll share how to evaluate hit20.

from baleen.

okhat commented on September 24, 2024

Btw I simplified the eval logic quickly now, when copying it here. Just a note to self that the full notebook is at:

/dfs/scratch0/okhattab/Jupyter/Work/2020-Dec/HotPotQA/2021-Apr-Eval.ipynb

from baleen.

okhat commented on September 24, 2024

Btw Psg-EM may be equivalent to hit2 in your evaluation logic. I didn't check but it seems so.

Based on that, I think your hit2 will be very close to 86.7% since most of the examples in your outputs have 2 PIDs only. But it's worth checking that explicitly.

from baleen.

hyukkyukang commented on September 24, 2024

I apologize for the late reply. I've been away due to health issue.

I've just had the chance to re-evaluate the Psg-EM using the evaluation logic you kindly provided.
I'm pleased to report that I achieved a score of 86.2%, which aligns closely with the 86.7% reported in the original paper.

Thank you for your help!

from baleen.

hyukkyukang commented on September 24, 2024

Btw, I would greatly appreciate it if you could share the training script when you have some time!

from baleen.

Training Script and Model Checkpoint for HotpotQA about baleen HOT 12 CLOSED

Comments (12)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent