iinemo / lm-polygraph Goto Github PK

License: MIT License

Jupyter Notebook 42.73% Python 53.77% CSS 0.76% JavaScript 1.98% HTML 0.57% Shell 0.11% Dockerfile 0.08%

lm-polygraph's Issues

Semantic entropy is using probabilites greater than 1

For semantic entropy they were using the classwise probability as defined by

Here is an example of that calculation from the same paper.

However, the way this is calculating it, they are adding up all the sample texts outputted, but not taking into account that often sample texts repeat which gets you probabilites greater than 1.
For example, lets say the model outputs 5 outputs. ['Paris','Paris','Paris','Its Paris','London'] with the following likelihoods [0.6,0.6,0.6,0.3,0.1]. Based on the way this library was calculating it theyd get the probability of the first class as 0.6+0.6+0.6+0.3=2.1 and the second class as 0.1. But how can that first class be a probability greater than 1? It shouldn't be because it should be only adding non-repeating classes together. So since those first three outputs are the same then the class probabilites should be 0.9 and 0.1.

In the code you can see it in semantic_entropy.py inside the estimators folder.
for i in range(len(hyps_list)): class_likelihoods = [ np.array(loglikelihoods_list[i])[np.array(class_idx)] for class_idx in self._class_to_sample[i] ] class_lp = [ np.logaddexp.reduce(likelihoods) for likelihoods in class_likelihoods ] if log_weights[i] is None: log_weights[i] = [0 for _ in hyps_list[i]] semantic_logits[i] = -np.mean( [ class_lp[self._sample_to_class[i][j]] * np.exp(log_weights[i][j]) for j in range(len(hyps_list[i])) ] )
class_lp portion is summing all outputs in each class instead of all unique outputs in each class.
This means that the more outputs you generate the larger the uncertainty will get.

Demo doesn't work.

Thank you for the amazing framework! Today when I was trying the following codes (simplest demo), I got the error massage saying that:

return UncertaintyOutput(ue[0], input_text, texts[0], model.model_path, estimator.level)
TypeError: UncertaintyOutput.__init__() takes 5 positional arguments but 6 were given

I am wondering whether the framework is ready to use or you are still implementing them?

from lm_polygraph.utils.model import WhiteboxModel
from lm_polygraph.estimators import *
from lm_polygraph.utils.manager import estimate_uncertainty

ue_method = MeanPointwiseMutualInformation()
estimator = SemanticEntropy()

model = WhiteboxModel.from_pretrained(
    "bigscience/bloom-560m",
    device="cuda:0",
)

input_text = "Who is George Bush?"
estimate_uncertainty(model, ue_method, input_text=input_text)

Error loading larger models - You shouldn't move a model when it is dispatched on multiple devices

The code

model = WhiteboxModel.from_pretrained(
    "tiiuae/falcon-40b-instruct",
    cache_dir="~/cache/",
    device_map='auto', 
    offload_folder="offload_folder"

Throws the error You shouldn't move a model when it is dispatched on multiple devices.

While

model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-40b-instruct", 
                                             trust_remote_code=True, 
                                             cache_dir="~/cache/",
                                             device_map="auto",
                                             offload_folder="offload_folder")

seems to work fine :/

using the openai API for a BlackBox model for non OPENAI hosted platforms

Hi,

thanks for providing the community with this library. I believe uncertainties of LLM queries are an important topic. I tried to play around with this library and am a bit stuck. So I'd like to use a remote model that is accessible through the openai library. For this, I have to provide a custom OPENAI_API_BASE and my OPENAI_API_KEY. However, the library tells me that it doesn't know how to query the remote model?

Here is the code that I drafted given your example:

def main():
    print(f":: black box test, using Mistral-7B-Instruct-v0.2 from {os.environ["OPENAI_API_BASE"]}")
    model = BlackboxModel(openai_api_key=os.environ["OPENAI_API_KEY"], model_path="Mistral-7B-Instruct-v0.2", parameters={"openai_api_base": os.environ["OPENAI_API_BASE"]})

    print(model.parameters)

    print(":: using estimator EigValLaplacian")
    estimator = EigValLaplacian(verbose=True)
    answer = estimate_uncertainty(
        model, estimator, input_text="When did Albert Einstein die?"
    )
    print(">>",answer)

So I get the following error:

:: using estimator EigValLaplacian
Traceback (most recent call last):
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/examples/./black_box.py", line 23, in <module>
    main()
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/examples/./black_box.py", line 16, in main
    answer = estimate_uncertainty(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/src/lm_polygraph/utils/manager.py", line 166, in estimate_uncertainty
    man()
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/src/lm_polygraph/utils/manager.py", line 400, in __call__
    batch_stats = self.calculate(batch_stats, self.stat_calculators, inp_texts)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/src/lm_polygraph/utils/manager.py", line 534, in calculate
    raise e
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/src/lm_polygraph/utils/manager.py", line 518, in calculate
    new_stats = stat_calculator(
                ^^^^^^^^^^^^^^^^
  File "/home/steinb95/development/lm-polygraph/lm-polygraph/src/lm_polygraph/stat_calculators/sample.py", line 46, in __call__
    temperature=model.parameters.temperature,
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'dict' object has no attribute 'temperature'

I tried a couple of things, but I am simply unclear where to supply the temperature?

Best
P

Example for normalizaiton

Hi, Team! Thank you very much for the library construction.
I see there are many normalizers to transform uncertainty score into a probability, can we have a notebook example of how to use it with different estimators? Like the next step after "estimate_uncertainty()".

Thank you very much!
Best,
Johnny

Entropy calculation maybe wrong?

Hello,

I am not 100% sure, but I believe the entropy calculation is wrong here: https://github.com/IINemo/lm-polygraph/blob/main/src/lm_polygraph/stat_calculators/entropy.py

On line 43, shouldn't you compute the sum instead of the mean?

Also, the entropy should be calculated with base 2. The log probabilities (logprobs) returned by the HuggingFace language models typically use the natural logarithm (base e).

What is the different between pe_uncertainties and ep_uncertainties?

Possible mismatching max_length and max_new_tokens in example eval script

I was running the polygraph_eval with this example config https://github.com/IINemo/lm-polygraph/blob/main/examples/configs/polygraph_eval_wmt14_ende.yaml

I got a warning about too long string

It didn't failed but i almost sure it would mess up with the results

My wild guess is that stat calculators max_lengths in not connected to max_generated_tokens in ue manager itself. But i didn't really look into it for now

Dockerfile adjustments

The path to requirements.txt is incorrect in the Dockerfile -- it is located in the main directory of the project.
The app directory and the CMD ["polygraph_server"] command are only required if one desires to run frontend. I did not use the frontend app so I skipped those. It might make sense to make a separate Dockerfile for those who only intend to use the methods of the framework to avoid installing extra packages.
There is no jupyter in the requirements which is needed to run the demo notebooks.
CUDA drivers are not pulled by default in the Dockerfile but I assume this depends on the specific hardware configurations. I have used nvcr.io/nvidia/pytorch:24.05-py3 image to use CUDA on my cluster.

Not 100% sure if sampling parameters is correct

https://github.com/IINemo/lm-polygraph/blame/00851972311db837de6358c90dac65616495d4de/src/lm_polygraph/stat_calculators/sample.py#L27

It's now setting both top_k=50 and top_p if we just set top_p, if i understand code correctly. Having both top_k and top_p is kind of weird to me.
Also there seems there is no do_sample; from the hf docs, we could accidentaly get contrastive_search() instead of sample()

Would be nice to test this

Get the uncertainty scores without rerun the models

Thanks again for your work!

I noticed that in your framework, we need to first run the model then get the uncertainty scores. While it's perfectly fine when using free models, it could be expensive when working with charging APIs like ChatGPT.

Specifically, I'm curious if there's a way to obtain uncertainty measures for previously generated texts without having to rerun the model.

Any information or suggestions you can offer in this regard would be greatly appreciated. I look forward to hearing from you and learning more about this possibility.

Get the uncertainty scores without rerun the models (for NumSets, Deg, Ecc)

Thank you for providing the codes for the previously generated text! They have been very helpful, and I've successfully used them for Lexical Similarity analysis. I'm planning to test them for other measurements, including NumSets, Degree matrix (Deg), and Eccentricity.

I noticed that these measurements require two additional statistics: semantic_matrix_entail and semantic_matrix_contra. According to the original paper, I know that these are calculated using DeBERTa over generated samples. I'm wondering if there are any short code snippets available to compute these matrices and feed them into the estimator function.

Thanks!

May I know how to run an ensemble-based uncertainty estimation method?

Hi, I have successfully run the information-based and density-based methods.

May I know how to run the ensemble-based uncertainty estimation method?

About the generation metric for questions with multiple correct answers

I am wondering for questions with multiple correct answers (or those with many alternative answers), can the current generation metrics handle this?

Thank you for the great codebase!

[Question] Pipeline integration (Langchain)

Good day! Is there any way to use the LM-polygraph in an LLM pipeline, created with the Langchain framework?

Load custom estimators and stat_calculators in the evaluation script

The evaluation script should be able to load custom estimators and stat_calculators (see https://github.com/IINemo/lm-polygraph/tree/proposal)
The evaluation script should not be aware how stat_calculators or estimators are created. This should be encapsulated in the correspondent factories.
The factory for stat_calculators should have an access to the "environment" object, so each factory is aware what objects were created by factories for other stat_calculators.

The proposal:

Implements loading stat_calculators using a custom python module. The factory module should implement the function: load_stat_calculator
Implements loading estimators using a custom python module.
Manager accepts the builder_stat_calculators (environmental object that allows factories for stat_calculators to communicate with each other). For now the construction of stat_calculators and estimators is implemented inside the Manager.
defaults -- implements default factories for implemented stat_calculators and estimators.

generate_texts on wbmodel ignores generation parameters and stopping critera.

What title says.

This causes foundation models to generate lots of unnecessary text, introduces potential discrepancy between sampling and greedy generation with generate, and possible other less obvious problems. Problematic behavior can be reproduced by only having blackbox_sample_texts in required stats with no sample_texts on any foundation model with few-shot continuation prompt.

This calls for some streamlining of generation when using white box. Do we really need a separate generation method to pretend we are black-box when calculating things like semantic matrix on WB model? Can we call self.generate instead of self.model.generate in generate texts?

@ArtemVazh @cant-access-rediska0123 your thoughts?

AutoModelForCausalLM max_length

Is there a reason why the max length is equal to 256?

In models.py:

AutoModelForCausalLM.from_pretrained(max_length=256)

iinemo / lm-polygraph Goto Github PK

lm-polygraph's Issues

Recommend Projects

Recommend Topics

Recommend Org