Good evening, I have been interested in using a Mixture of Experts f

Overview of the LMoe Process about airoboros HOT 2 CLOSED

jondurbin commented on July 30, 2024 1

Overview of the LMoe Process

from airoboros.

Comments (2)

jondurbin commented on July 30, 2024 2

Regarding the FAISS search option: How are the adapters being selected. In my implementation I performed K means on the data, generating a set of clusters. The centroids would be saved inside the FAISS vectorstore and the embeddings of the query would select the K closest centroids which are then combined to build a new adapter. Is your mechanism similar, how did you divvy up the training set to each expert.

The airoboros dataset generation tool inherently generates many separate types of training data via "instructors". Each instructor has it's own prompt/config, and can be used to generate task-specific training data.
https://github.com/jondurbin/airoboros/tree/main/airoboros/instructors

During the dataset generation process, the output data is labeled with a "category" field corresponding to the instructor that generated it. I then just split the data into the experts by using this category, an example here:
https://github.com/jondurbin/airoboros/blob/main/scripts/segment_experts.py

One of the routing options is faiss index search, which requires packaging up the fine-tuning data used to train each expert with the adapter. The lmoe package has routing data, training data, and adapters. Routing data here is just the system prompt + instruction, without response.
https://huggingface.co/jondurbin/airoboros-lmoe-70b-2.1/tree/main

To use faiss index, you specify --router-max-samples (which specifies how many random samples to include in the index for each expert from the routing data, higher values producing better results, but slow to load), and --router-k, which is the k in the approximate knn search. The input system prompt + instruction is used to search against the faiss indices, and the average distance from the knn search is the selection mechanism, lowest score (most similar) wins.

One thing I struggled with was the size of the embeddings. Many of the embeddings only support a context length of 512, which means large training samples would be truncated. The only embedding models I know with a respectable context length are the OpenAI embeddings.

This isn't a perfect solution by any means - an embedding model with larger context window would be better - but as it turns out you can actually average the embeddings for multiple chunks and still get reasonable performance, so long as the input document doesn't have a huge variety of topics:
https://github.com/jondurbin/airoboros/blob/main/airoboros/embeddings.py

Regarding the Agent Based Routing: If I understand the "function" agent is a separate LoRA trained on "executive level" function calling of the experts. You must be dynamically swapping between the function agent and whatever expert was selected. Furthermore, what dataset did you use to train your function agent.

Yes, there is a "function" adapter, that trains on data generated by the "rewoo" style execution planning and basic function calling instructors' data. Function router is loaded by default as the first adapter and used as routing mechanism each time if you specify --agent-router. Dataset is custom synthetic data generated by this tool.

Regarding the Inference Server: How were you able to get your dynamic system to work with inference servers such as vLLM? Do you need to restart the inference every time a new LoRA is selected or does the swapping work dynamically. In addition to this, where are the LoRAs stored? Are all of the experts pre loaded into video memory or can you pull them from the disk whenever necessary. If the experts are stored only on the disk and can be loaded as needed, theoretically you could store thousands of specialized adapters on the hard disk, giving you an inconceivable knowledge base.

I have a vllm inference option, but the output quality is quite low, so something must be off; haven't had a chance to really dig into it.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/vllm.py

With vllm, you need to adjust the weights each time an adapter is selected, but again something is off here so I wouldn't use this option yet.
https://github.com/jondurbin/airoboros/blob/main/airoboros/lmoe/lora.py

The other/default API server uses bettertransformers and flash attention to improve inference speed, but it's still fairly slow in comparison (but much higher quality).

The last routing option is just manually adding "expert": "{expert}" in the JSON payload if you know ahead of time which adapter you'd like to use.

For this proof-of-concept, the LoRAs are all loaded into vram before the API server actually starts up. For the 7b model, for example, the base model load consumes ~13.5GB vram, and with all adapters loaded into vram it consumes ~17.5GB. In practice, this doesn't scale all that well on a single machine, you'd want multiple backend servers, placing the routing in front of those, with each server hosting a handful of adapters. You could dynamically load the adapters from a very fast memory store or something as well instead of caching in vram, but it would be significantly slower.

I'm hoping to create a hosted airoboros service where people can contribute to creating many of these adapters so we can indeed have thousands. I can optimize for that use case if and when we get to that point.

from airoboros.

psych0v0yager commented on July 30, 2024

Thank you for taking the time to respond to my question, as well as providing the links to the code.

You answered all the questions that I had. I am excited to see the future of this project!

from airoboros.

Overview of the LMoe Process about airoboros HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent