njucckevin / seeclick Goto Github PK

View Code? Open in Web Editor NEW

161.0 161.0 9.0 6.98 MB

The model, data and code for the visual GUI Agent SeeClick

License: Apache License 2.0

Python 20.82% Shell 0.19% JavaScript 21.22% CSS 2.76% HTML 55.00%

seeclick's People

Contributors

Stargazers

Watchers

Forkers

qiushisun natyren hyun-jin-park epinnock drroad kishanchan chuyg1005 xufangzhi yi-fy

seeclick's Issues

How to inference in 4bit?

When I load the model in 4bit:

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("cckevinn/SeeClick", device_map="auto", trust_remote_code=True, load_in_4bit=True, do_sample=True, temperature=1e-3).eval()
model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-VL-Chat", trust_remote_code=True)

I get the following error during inference:

RuntimeError: Input type (torch.cuda.ByteTensor) and weight type (torch.cuda.HalfTensor) should be the same

Any plan to open a demo page.

Hello author, thanks for the great work. Is there any plan to open a hf/modelscope demo page? I don't have GPUs to run Qwen, but I am very interested in experiencing the capabilities of SeeClick.

What kind of computing resources does seeclick need when sft is performed on downstream tasks (i.e. Mind2Web)? I tried to sft seeclick on 1*A100, even though batch_size is set to 1, cuda out of memory error is still reported. Thanks!

multi-step operations

Hi, thank you very much for your interesting work!
I have a question, how does LLM perform a series of web page operations after judging the location of the element that needs to be clicked?
I mean, how is LLM implemented for multi-step operations that require multiple pages?
Thank you for your reply.

Can not reproduce the finetuning results of Qwen and Seeclick on Mind2web.

The data format is like:

'Picture 1: /root/data/Mind2Web_related/qwen_image/013781df-4391-4533-bcb1-15f6819064f6-79c4a963-4aa9-49c1-9257-6b0d5069c551.jpg\n
Please generate the next move according to the ui screenshot, instruction and previous actions.
Instruction: What are the romantic reggae musics from BCD Studio that can be used in tik tok series in andorra. Previous actions:'

For images in Mind2web, i tried using the raw size and cropped size(the raw sizes are very large).

I didn't modify the code of finetuning, but the final results are not good. Can you provide me some advices for solving this problem? Thx

Model on huggingface

Hello,

very interesting fine tune!

Would you please upload the model to huggingface?

Question about Vision-Language Adapter

Hello, thanks for sharing your interesting research!

May I ask about the the structure of the Vision-Language Adapter mentioned in this paper and how it works during fine-tuning.
Is this adapter introduced to address a semantic gap between the embeddings of images obtained through ViT and instructions, which leads to poor performance when directly concatenated?

Are there some explanations about items in the mind2web_data_train.json

Hello, when I saw the mind2web_data_train.json, I found the following keys and values. However, it is difficult for me to understand the meaning of each key. Are there some explanations about these keys, such as the data_pw_testid_buckeye_candidate. Thank you.

{"website": "espn", "domain": "Entertainment", "subdomain": "Sports", "annotation_id": "e7e1616e-dd5f-4eb4-a7f1-b757c7880877", "confirmed_task": "Look up the scores for the previous day's NBA games", "action_reprs": ["[link] NBA -> HOVER", "[link] Scores -> CLICK", "[span] Mon -> CLICK"], "actions": [{"action_uid": "fbfa94eb-b0f2-40b4-a0ec-c95ea564d036", "operation": {"original_op": "HOVER", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "a", "attributes": "{"backend_node_id": "6959", "bounding_box_rect": "188.90625,66,38.796875,43", "name": "&lpos=sitenavcustom+sitenav_nba", "is_clickable": "true", "data_pw_testid_buckeye_candidate": "1"}", "is_original_target": false, "is_top_level_target": true, "backend_node_id": "6959", "score": 0.9763301610946655, "rank": 0, "choice": "(a id=16 (span (span NBA ) (span NBA ) )"}], "bbox": {"x": 188.90625, "y": 66.0, "width": 38.796875, "height": 43.0}}, {"action_uid": "8450177b-97fb-4355-8b95-ac90354952fa", "operation": {"original_op": "CLICK", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "a", "attributes": "{"backend_node_id": "46982", "bounding_box_rect": "198.90625,157.796875,200,40.796875", "name": "&lpos=sitenavcustom+nba_nbascoreboard", "is_clickable": "true", "data_pw_testid_buckeye_candidate": "1"}", "is_original_target": false, "is_top_level_target": true, "backend_node_id": "46982", "score": 0.9753608703613281, "rank": 0, "choice": "(a id=15 (span (span Scores ) (span Scores ) )"}], "bbox": {"x": 198.90625, "y": 157.796875, "width": 200.0, "height": 40.796875}}, {"action_uid": "c14eedcb-631e-4226-a163-698ee94a5047", "operation": {"original_op": "CLICK", "value": "", "op": "CLICK"}, "pos_candidates": [{"tag": "span", "attributes": "{"backend_node_id": "62559", "bounding_box_rect": "331,338.53125,28,14", "class": "Day__Name", "data_pw_testid_buckeye_candidate": "1"}", "is_original_target": true, "is_top_level_target": true, "backend_node_id": "62559", "score": 0.1542425900697708, "rank": 67, "choice": "(span id=12 Mon )"}], "bbox": {"x": 331.0, "y": 338.53125, "width": 28.0, "height": 14.0}}]}

seeclick-aitw checkpoint publish?

有计划发布seeclick在aitw微调后的模型权重嘛？🙏

processing the finetuning(qlora)

In processing the finetuning(qlora), I found the problem below. How do I fix this error? I applied trust_remote_code=True to finetune.py and it still doesn't work like that.

[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 598, in resolve_trust_remote_code
[rank0]: answer = input(
[rank0]: ^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 580, in _raise_timeout_error
[rank0]: raise ValueError(
[rank0]: ValueError: Loading this model requires you to execute custom code contained in the model repository on your local machine. Please set the option trust_remote_code=True to permit loading of this model.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/optimum/gptq/quantizer.py", line 369, in quantize_model
[rank0]: tokenizer = AutoTokenizer.from_pretrained(tokenizer)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/models/auto/tokenization_auto.py", line 752, in from_pretrained
[rank0]: config = AutoConfig.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/models/auto/configuration_auto.py", line 1085, in from_pretrained
[rank0]: trust_remote_code = resolve_trust_remote_code(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/dynamic_module_utils.py", line 611, in resolve_trust_remote_code
[rank0]: raise ValueError(
[rank0]: ValueError: The repository for cckevinn/SeeClick contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/cckevinn/SeeClick.
[rank0]: Please pass the argument trust_remote_code=True to allow custom code to be run.

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]: File "/home/jovyan/work/SeeClick/finetune/finetune.py", line 408, in
[rank0]: train()
[rank0]: File "/home/jovyan/work/SeeClick/finetune/finetune.py", line 312, in train
[rank0]: model = transformers.AutoModelForCausalLM.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
[rank0]: return model_class.from_pretrained(
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3780, in from_pretrained
[rank0]: quantizer.quantize_model(model, quantization_config.tokenizer)
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/jovyan/.local/lib/python3.11/site-packages/optimum/gptq/quantizer.py", line 371, in quantize_model
[rank0]: raise ValueError(
[rank0]: ValueError: We were not able to get the tokenizer using AutoTokenizer.from_pretrained
[rank0]: with the string that you have passed cckevinn/SeeClick. If you have a custom tokenizer, you can pass it as input.
[rank0]: For now, we only support quantization for text model. Support for vision, speech and multimodel will come later.

作者您好！关于finetune后模型测试时读取模型报错

您好！我用finetune下的finetune_lora_ds.sh脚本微调了我自己的aitw数据集，模型是原seeclick的base模型，然后用agent_tasks下的aitw_test.py测试时，在模型读取时发生了如下报错
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00, 2.38it/s]
Traceback (most recent call last):
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/main.py", line 39, in
cli.main()
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
run()
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
runpy.run_path(target, run_name="main")
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
return _run_module_code(code, init_globals, run_name,
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
_run_code(code, mod_globals, init_globals,
File "/home/u2020010349/.vscode-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
exec(code, run_globals)
File "/home/u2020010349/share/yc/FL_agent/SeeClick-main/agent_tasks/lh_make/test_my.py", line 74, in
model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="cuda", trust_remote_code=True).eval() # load with lora checkpoint
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/site-packages/peft/auto.py", line 123, in from_pretrained
tokenizer = AutoTokenizer.from_pretrained(
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/site-packages/transformers/models/auto/tokenization_auto.py", line 847, in from_pretrained
return tokenizer_class.from_pretrained(
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2089, in from_pretrained
return cls._from_pretrained(
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2311, in _from_pretrained
tokenizer = cls(*init_inputs, **init_kwargs)
File "/home/u2020010349/.cache/huggingface/modules/transformers_modules/checkpoint-1589/tokenization_qwen.py", line 120, in init
super().init(**kwargs)
File "/home/u2020010349/.conda/envs/FLAGENT/lib/python3.8/site-packages/transformers/tokenization_utils.py", line 367, in init
self._add_tokens(
File "/home/u2020010349/.cache/huggingface/modules/transformers_modules/checkpoint-1589/tokenization_qwen.py", line 229, in _add_tokens
if surface_form not in SPECIAL_TOKENS + self.IMAGE_ST:
AttributeError: 'QWenTokenizer' object has no attribute 'IMAGE_ST'
我有去qwen下查询相关issue，发现了一个类似的问题，他这里说的是tokenization_qwen.py在136行加入super().init(**kwargs)，奇怪的是我的文件中这行代码位于120行（整个QWenTokenizer类的init的第一句），我也照猫画虎的把这行放到了IMAGE_ST的下面，结果会发生新的报错
Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.transformer.wte.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 4096]) from checkpoint, the shape in current model is torch.Size([151860, 4096]).
size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 4096]) from checkpoint, the shape in current model is torch.Size([151860, 4096]).
File "/home/u2020010349/share/yc/FL_agent/SeeClick-main/agent_tasks/lh_make/test_my.py", line 74, in
model = AutoPeftModelForCausalLM.from_pretrained(model_path, device_map="cuda", trust_remote_code=True).eval() # load with lora checkpoint
RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
size mismatch for base_model.model.transformer.wte.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 4096]) from checkpoint, the shape in current model is torch.Size([151860, 4096]).
size mismatch for base_model.model.lm_head.modules_to_save.default.weight: copying a param with shape torch.Size([151936, 4096]) from checkpoint, the shape in current model is torch.Size([151860, 4096]).
您在当时有遇到类似的情况吗？有什么思路可以帮助我解决吗？感谢！

数据集貌似是英文的，请问有中文数据集吗？

Provide a finetuning script

You have used the Qwen-VL Model, thus I assume that the finetuning procedure is the same as described here?

https://github.com/QwenLM/Qwen-VL/tree/master/finetune

作者您好！关于aitw的数据集的验证我有个问题想请教！

感谢您的出色工作，对我很有启发，现在我想在您的基础上做一下延展工作。关于aitw的test结果我有个疑问，我观察到test是在aitw_test.py文件的第79行load了一个json文件，名为aitw_data_test.json，我看了这个数据的构成仅包含aitw子数据集下很少的部分数据，请问是以这个结果作为文章里的榜单结果吗？没有在更大一点的测试集上测试吗？

Any plan to release pretraining code?

Thank you for your great work. It seems that the pretraining code is not yet open. Do you plan to open-source it?

Wrong annotations in finetune dataset(widget caption)

The bbox annotation in your finetune dataset (widget caption) may be wrong. Every element in the same screen get the same bbox annotation, which is impossible.
For example:
{ "img_filename": "57800.jpg", "instruction": "delete", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] }, { "img_filename": "57800.jpg", "instruction": "dust pin and water drink reminder", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] }, { "img_filename": "57800.jpg", "instruction": "move to trash", "bbox": [ 0.8395833333333333, 0.79296875, 0.9854166666666667, 0.828515625 ] },

Broken link to ScreenSpot

I wanted to check out the annotated images for ScreenSpot. Unfortunately the link in the readme.md is broken ScreenSpot. Can you provide the benchmark in another way?

Thank you 😄

Pretraining and finetuning code

Would you please provided the pretraining and finetuning code?

Also, why did you pretrained this model and not just finetuned the qwen vl model?

No LICENSE specified in repo

I've noticed that unfortunately there is no license specified for either the code or the dataset represented in this repository.

Could you please add a LICENSE file in the repo to clarify under what conditions the code and data are released? Otherwise it is not possible for many people to try, reproduce, or help build on your work.

I'd like to suggest a permissive open source license, in case you don't already have a specific one in mind.

Thanks for your attention. Hope this issue can be sorted out soon.

Data cannot be downloaded from box.nju.edu.cn

Dear authors,

I just found the data you provided cannot be found. For example https://box.nju.edu.cn/d/5b8892c1901c4dbeb715/ and https://box.nju.edu.cn/f/6a804cf190dd490a808f/

I was able to access those data yesterday.

Another issue is that the web data in pretrain is really big (130GB) and I always got a downloading error due to long downloading time and the connection will be stopped after about 3 hours. Could you please split the data and reupload? Thank you very much.

many images not found when I process AITW dataset

I download AITW anno and process AITW dataset like this.

cd agent_tasks
python aitw_process.py --imgs_dir aitw_imgs

Is this normal.

Great work! I wonder if seeclick-MiniWob checkpoint and Miniwob 2.8k data are released :)

PC screens in ScreenSpot data are cropped

Dear authors, I am very interested in the ScreenSpot benchmark you created. I wish to ask if you have the original screens for PCs and would like to share them. The reason is that I found those in the ScreenSpot benchmark are mostly cropped. The cropped screens lost a lot of information.

Thanks a lot.

Which weights are updated while pretraining and finetuning?

Thanks for sharing good works:)

When dividing Qwen-VL into ViT, adapter, and LM,
can you clarify which weights are updated while pretraining and finetuning?

Also, I have a question for confirmation. In Figure 1 (a) of the paper, ViT and VL Adapter are not included in LVLM(yellow box).
I think the yellow box is LM, but is it LVLM?

How long did you train for pretraining?

I'm in pretraining Qwen-VL-chat model.

I processed the pretrain data (Table6) by running the code as is.
If you look at gui-grounding-pre-training, it says 3 epochs of learning. But how much learning is correct?

In the paper, Section 3.3, it says around 1 epoch. (... We train Qwen-VL on the dataset we constructed (as described in Section 3.2) for about 10k steps (around 1 epoch) to obtain our GUI base model SeeClick. ...)
Also if I use the options in the code as is, it seems to last much longer than 24 hours, unlike the paper.

I'll wait for your reply:)

Questions about specific APPs in ScreenSpot benchmark.

Which mobile apps, desktop apps, and web pages were used in your ScreenSpot benchmark? Can you list it in detail?

The gpu cost when running finetune/finetune_lora_ds.sh

Hello, may I ask about the gpu cost when running finetune_lora_ds.sh. I have a 49GB*4 machine, however, when I ran the shell, it appeared to be out of memory.

作者您好，在微调使用seeclick ckpt时loss为0

作者您好，
我在使用finetune/finetune_lora_ds.sh做微调，并使用huggingface上的seeclick模型作为base加载后，loss从一开始就为0，请问有遇到过这样的情况吗。
PS：使用QWEN VL Chat int4作为base时，没有上述问题

How to continue fine-tuning the SeeClick with Lora on Mind2WEb/AITW data?

Thanks for the great work! @njucckevin

I tried reproducing SeeClick's performances on AITW and Mind2WEb but encountered a problem.

After finetuing Qwen-VL with the 1M data mentioned in your paper, I got a LoRA checkpoint. Now I want to finetune this model with the LoRA checkpoint on the downstream Mind2Web training data.

When I set --model_name_or_path as the LoRA checkpoint folder named "checkpoint-5200", the finetuning program raised:

OSError: /data/reproduce_seeclick/checkpoint-5200 does not appear to have a file named config.json. Checkout 'https://huggingface.co//data/reproduce_seeclick/checkpoint-5200/tree/None' for available files.

I also tried to merge the LoRA with the Qwen-VL model and used this model as --model_name_or_path, but the finetuning program raised warnings:
`Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:11<00:00, 1.16s/it]
transformer.h.0.attn.c_attn not satisfy lora

transformer.h.0.attn.c_attn not satisfy lora

transformer.h.0.attn.c_attn not satisfy lora`

Could you please clarify further what this mean

pretrain-ckpt: base model for fine-tuning, e.g. SeeClick-pretrain or Qwen-VL

in here?

How is Qwen finetuned for evaluating?

Qwen and Seeclick both finetune the VIT and LLM, or does Qwen only finetune the LLM?

How to handle sliders and swipes?

So, from the provided paper only various click actions accross different domains have been shown?

What about sliders and swipes? These do need a from (x, y) and to (x, y) coordinates. How to obtain them?

What are those rationales for collecting the training datasets?

@njucckevin Hello! Thank you for open-sourcing this great work.

I'm a bit curious about how you collected the training data samples from public data sources.

The paper mentions "We collect approximately 300k web pages from the latest Common Crawl repository to serve as our training data for web UI.". Could you please provide some hints about how you selected the ~300k pages from Common Crawl? Did you consider the popularity or complexity of the selected web pages?

I would appreciate it if you could provide some hints. Thanks!

Question about data license

Hi, Thanks for sharing interesting research:)

What are the licenses for the data used for pretraining (GUI Grounding Pre-training Data) and evaluation (UI Grounding Benchmark: ScreenSpot)?

ValueError: Can't find 'adapter_config.json' at 'cckevinn/SeeClick'

solved