Giter Site home page Giter Site logo

Comments (13)

Dhaizei avatar Dhaizei commented on June 12, 2024

{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.1,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.62,
"task limit reached": 0.28,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 62.22,
"max_history_length": 91,
"min_history_length": 20
},
"custom": {
"overall": {
"total": 50,
"pass": 5,
"wrong": 45,
"success_rate": 0.1
}
}
}

from agenttuning.

lr-tsinghua11 avatar lr-tsinghua11 commented on June 12, 2024

Your output seems like there may be a mismatch in the evaluation setup you've used. Please ensure that you're using the evaluation code from ./AgentBench.old as mentioned in README, not the latest repo THUDM/AgentBench. Could you kindly provide your trajectories for a thorough review?

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

Yes, when I use the latest version of them, where do I send the trajectory information?

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

But I can get to 0.84 with gpt-4

{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.84,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.04,
"task limit reached": 0.12,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 50.56,
"max_history_length": 91,
"min_history_length": 21
},
"custom": {
"overall": {
"total": 50,
"pass": 42,
"wrong": 8,
"success_rate": 0.84
}
}
}

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

here is my trajectories for a thorough review in HH.
链接:https://pan.baidu.com/s/1Np291cysxDQDozzr4RiJDQ?pwd=1ijk
提取码:1ijk

from agenttuning.

lr-tsinghua11 avatar lr-tsinghua11 commented on June 12, 2024

As mentioned in https://github.com/THUDM/AgentTuning#held-in-tasks

The 6 held-in tasks are selected from AgentBench. However, since AgentBench is still under active development, the results from the latest branch might not fully reproduce the results reported in the paper. The evaluation code of this project is located in ./AgentBench.old.

Please use the AgentBench.old directory at AgentBench.old for Agent task evaluation.

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

But it's just a lot below the latest Agentbench test. a bit unexpected. Make sure that the uploaded model is okay.

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

How much epoch have you trained?

from agenttuning.

Btlmd avatar Btlmd commented on June 12, 2024

How much epoch have you trained?

The models are trained for 2k steps, batch size 64, sequence length 4096 with packing.

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

I use fastchat to fine tune llama2, but the effect was not very ideal. Can you use fastchat to achieve the effect of the paper after fine tuning? Although the batch size I set is not very large at 2, the improvement in completing tasks after fine-tuning is not significant. Do you have any good suggestions?
In addition, chatglm3-6B can reach 64% in HH tasks, which also proves the effectiveness of AgentTuning.

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

in addtion , one of AgentInstruct data is invalid :
{
"conversations": [
{
"from": "human",
"loss": false,
"value": "'''\n Menu

Model S Model 3 Model X Model Y
Email Address
Zip Code
Contact
\n'''\n\nBased on the HTML webpage above, try to complete the following task:\nTask: Schedule a demo drive for Model Y for Roy Adams with phone number 123-999-0000, email address [email protected] and zip code 90001 in the United States.\nPrevious actions:\n[link] Demo Drive -> CLICK\n[button] Model Y -> CLICK\n[textbox] Last Name -> TYPE: Adams\n[textbox] First Name -> TYPE: Roy\n[textbox] Phone Number -> TYPE: 123-999-0000\nWhat should be the next action? Please select from the following choices (If the correct action is not in the page above, please select A. 'None of the above'):\n\nA. None of the above\nB. Menu \nC. Model Y \nD. \nE. \nF. Contact "
},
{
"from": "gpt",
"loss": true,
"value": ""
}
],
"id": "mind2web_60"
}

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

Since I achieved poor results after fine-tuning with FastChat, I intend to further improve its capabilities by increasing the dataset size.
The approach of expanding the dataset size by using the training data from the AlfWorld dataset , and then evaluating it.
Can this approach be effective? Could you provide some advice?

from agenttuning.

Dhaizei avatar Dhaizei commented on June 12, 2024

Is alfworld's prompt "alfworld_multiturn_new.json" better than "alfworld_multiturn_react.json"?

from agenttuning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.