Training large language models better. Combining models trained on different datasets can reduce computational costs by 91%. Tuning a pretrained model to different data distributions increases task performance
Large language models (LLMs) undergo pretraining, instruction tailoring, and reinforcement learning from human input on heterogeneous datasets with different distributions
At the recent Conference on Empirical Methods in Natural-Language Processing (EMNLP), it can reduce LLM or neural-network-based model training costs by 91% even with mixed data distributions. Additionally, the technique improves model quality
This requires a great deal of time, money, and flexibility because, once trained, the model cannot be altered without paying further expenses
Subtract the original model's parameter values from the fine-tuned models. To generate a composite model, a weighted sum of the distribution vector is added to the original model's parameters
The final model is called a distribution-edited model to stress weight vector arithmetic during model editing. The weights are based on each fine-tuned model's perplexity, or possibility of predicting its parameter values from the original model
The initial model is trained on individual data distributions using traditional methods. Checkpoints of the model state after training on a dataset are retained for later use
Trained LLMs with increasing sizes from 3 billion to 13 billion parameters during instruction-tuning to evaluate the approach
Tests on datasets like MathQA, Super-Natural Instructions (SNI), and Chain-of-Thought (CoT) show that DEM is effective in a number of areas
Future studies should test the framework in additional training conditions and with different model designs, such as mixture-of-experts or encoder-decoder frameworks