Fine-tuning large language models (LLM) has become an important tool for businesses seeking to tailor AI capabilities to niche tasks and personalized user experiences. But fine-tuning usually comes with steep computational and financial overhead, keeping its use limited for enterprises with limited resources. To solve these challenges, researchers have created algorithms and techniques that cut the cost of fine-tuning LLMs and running fine-tuned models. The latest of these techniques is S-LoRA, a collaborative effort between researchers at Stanford University and University of California-Berkeley (UC Berkeley). S-LoRA dramatically reduces the costs associated with deploying fine-tuned LLMs, which enables companies to run hundreds or even thousands of models on a single graphics processing unit (GPU). This can help unlock many new LLM applications that would previously be too costly or require huge investments in compute resources. The classic approach to fine-tuning LLMs involves retraining a pre-trained model with new examples tailored to a specific downstream task and adjusting all of the model’s parameters. Given that LLMs typically have billions of parameters, this method demands substantial computational resources. Parameter-efficient fine-tuning (PEFT) techniques circumvent these costs by avoiding adjusting all of the weights during fine-tuning. A notable PEFT method is low-rank adaptation (LoRA), a technique developed by Microsoft, which identifies a minimal subset of parameters within the foundational LLM that are adequate for fine-tuning to the new task.
Full research : Running thousands of LLMs on one GPU is now possible with S-LoRA.