Comments (2)
Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert.
The airoboros dataset generation tool inherently generates many separate types of training data via "instructors". Each instructor has it's own prompt/config, and can be used to generate task-specific training data.
https://github.com/jondurbin/airoboros/tree/main/airoboros/instructors
During the dataset generation process, the output data is labeled with a "category" field corresponding to the instructor that generated it. I then just split the data into the experts by using this category, an example here:
https://github.com/jondurbin/airoboros/blob/main/scripts/segment_experts.py
One of the routing options is faiss index search, which requires packaging up the fine-tuning data used to train each expert with the adapter. The lmoe package has routing data, training data, and adapters. Routing data here is just the system prompt + instruction, without response.
https://huggingface.co/jondurbin/airoboros-lmoe-70b-2.1/tree/main
To use faiss index, you specify --router-max-samples
(which specifies how many random samples to include in the index for each expert from the routing data, higher values producing better results, but slow to load), and --router-k
, which is the k
in the approximate knn search. The input system prompt + instruction is used to search against the faiss indices, and the average distance from the knn search is the selection mechanism, lowest score (most similar) wins.
One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.
This isn't a perfect solution by any means - an embedding model with larger context window would be better - but as it turns out you can actually average the embeddings for multiple chunks and still get reasonable performance, so long as the input document doesn't have a huge variety of topics:
https://github.com/jondurbin/airoboros/blob/main/airoboros/embeddings.py
Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.
Yes, there is a "function" adapter, that trains on data generated by the "rewoo" style execution planning and basic function calling instructors' data. Function router is loaded by default as the first adapter and used as routing mechanism each time if you specify --agent-router
. Dataset is custom synthetic data generated by this tool.
Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.
I have a vllm inference option, but the output quality is quite low, so something must be off; haven't had a chance to really dig into it.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/vllm.py
With vllm, you need to adjust the weights each time an adapter is selected, but again something is off here so I wouldn't use this option yet.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/lora.py
The other/default API server uses bettertransformers and flash attention to improve inference speed, but it's still fairly slow in comparison (but much higher quality).
The last routing option is just manually adding "expert": "{expert}"
in the JSON payload if you know ahead of time which adapter you'd like to use.
For this proof-of-concept, the LoRAs are all loaded into vram before the API server actually starts up. For the 7b model, for example, the base model load consumes ~13.5GB vram, and with all adapters loaded into vram it consumes ~17.5GB. In practice, this doesn't scale all that well on a single machine, you'd want multiple backend servers, placing the routing in front of those, with each server hosting a handful of adapters. You could dynamically load the adapters from a very fast memory store or something as well instead of caching in vram, but it would be significantly slower.
I'm hoping to create a hosted airoboros service where people can contribute to creating many of these adapters so we can indeed have thousands. I can optimize for that use case if and when we get to that point.
from airoboros.
Thank you for taking the time to respond to my question, as well as providing the links to the code.
You answered all the questions that I had. I am excited to see the future of this project!
from airoboros.
Related Issues (20)
- Requesting some flexibility in topic input HOT 4
- generating clinically related multiple choice question? HOT 1
- How much resource it took to fine tune llama7/13/70b? HOT 1
- Airoboros 2.0 has difficulty with OOC requests. HOT 3
- what is the prompt template? HOT 1
- [SUGGESTION] (for v2.1+) Classifier-free guidance HOT 1
- how to avoid generating "general" instructions? HOT 1
- Question about learning rate and epochs HOT 4
- [Question] How to create instruction datasets based on domain specific information. HOT 2
- [QUALITY] (Ycros Airoboros v1.4.1 L1-33b 16k) - Hallucinated numbers and narrative. HOT 5
- How can I help? HOT 6
- Speculative sampling and Llmoe? HOT 2
- airoboros-c34b-2.1 prompting HOT 3
- Break their nasty spines please or what is the point? Also maybe an improvement in the PS HOT 1
- Finetune in INSTRUCT format
- instructiors -> awareness bug? HOT 1
- Add support to Assistant API so Airoboros can use up-to-date data (via search tool) for creating datasets? HOT 1
- LoraAlpha less then LoraRank HOT 2
- Expanding Instruction Complexity with Evol-instruct-like Approaches?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from airoboros.