Comments (13)
{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.1,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.62,
"task limit reached": 0.28,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 62.22,
"max_history_length": 91,
"min_history_length": 20
},
"custom": {
"overall": {
"total": 50,
"pass": 5,
"wrong": 45,
"success_rate": 0.1
}
}
}
from agenttuning.
Your output seems like there may be a mismatch in the evaluation setup you've used. Please ensure that you're using the evaluation code from ./AgentBench.old
as mentioned in README, not the latest repo THUDM/AgentBench
. Could you kindly provide your trajectories for a thorough review?
from agenttuning.
Yes, when I use the latest version of them, where do I send the trajectory information?
from agenttuning.
But I can get to 0.84 with gpt-4
{
"total": 50,
"validation": {
"running": 0.0,
"completed": 0.84,
"agent context limit": 0.0,
"agent validation failed": 0.0,
"agent invalid action": 0.04,
"task limit reached": 0.12,
"unknown": 0.0,
"task error": 0.0,
"average_history_length": 50.56,
"max_history_length": 91,
"min_history_length": 21
},
"custom": {
"overall": {
"total": 50,
"pass": 42,
"wrong": 8,
"success_rate": 0.84
}
}
}
from agenttuning.
here is my trajectories for a thorough review in HH.
链接:https://pan.baidu.com/s/1Np291cysxDQDozzr4RiJDQ?pwd=1ijk
提取码:1ijk
from agenttuning.
As mentioned in https://github.com/THUDM/AgentTuning#held-in-tasks
The 6 held-in tasks are selected from AgentBench. However, since AgentBench is still under active development, the results from the latest branch might not fully reproduce the results reported in the paper. The evaluation code of this project is located in ./AgentBench.old.
Please use the AgentBench.old directory at AgentBench.old for Agent task evaluation.
from agenttuning.
But it's just a lot below the latest Agentbench test. a bit unexpected. Make sure that the uploaded model is okay.
from agenttuning.
How much epoch have you trained?
from agenttuning.
How much epoch have you trained?
The models are trained for 2k steps, batch size 64, sequence length 4096 with packing.
from agenttuning.
I use fastchat to fine tune llama2, but the effect was not very ideal. Can you use fastchat to achieve the effect of the paper after fine tuning? Although the batch size I set is not very large at 2, the improvement in completing tasks after fine-tuning is not significant. Do you have any good suggestions?
In addition, chatglm3-6B can reach 64% in HH tasks, which also proves the effectiveness of AgentTuning.
from agenttuning.
in addtion , one of AgentInstruct data is invalid :
{
"conversations": [
{
"from": "human",
"loss": false,
"value": "'''\n Menu
},
{
"from": "gpt",
"loss": true,
"value": ""
}
],
"id": "mind2web_60"
}
from agenttuning.
Since I achieved poor results after fine-tuning with FastChat, I intend to further improve its capabilities by increasing the dataset size.
The approach of expanding the dataset size by using the training data from the AlfWorld dataset , and then evaluating it.
Can this approach be effective? Could you provide some advice?
from agenttuning.
Is alfworld's prompt "alfworld_multiturn_new.json" better than "alfworld_multiturn_react.json"?
from agenttuning.
Related Issues (20)
- Dataset details 中找不到reward的计算方式 HOT 5
- 通用数据如何筛选 HOT 7
- 除了用docker运行,还有其他方式可以运行AgentLM吗? HOT 6
- Finetuning with Mistral or Yi? HOT 1
- 关于TRAJECTORY FILTERING问题 HOT 3
- 请问下agentlm-7b最少需要多少显存可以推理 HOT 5
- 基于fastchat部署,推理异常 HOT 3
- 期待用 Qwen72B 训练的模型。 HOT 1
- 可以给个简单点的工具调用示例吗 HOT 1
- Can I run AgentInstruct data on the AgentBench? HOT 1
- Can you point to the ShareGPT filtered/cleaned data used? HOT 1
- if it is possible to conduct RLHF from env HOT 1
- 训练数据是如何采样的? HOT 3
- 貌似hotpotqa测试脚本跑不起来? HOT 1
- weight decay确定是0.1吗? HOT 1
- 魔塔上的 AgentInstruct 数据集的 conversation 都是空值
- 请问哪里可以找到工作里对于数据库方面的训练数据 HOT 1
- 本地模型
- 训练数据中指令与模型行为不匹配
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from agenttuning.