Amazon’s SeRA and Direct Preference Optimization Explained

Amazon’s SeRA and Direct Preference Optimization Explained

DPO is a reinforcement learning technique where an LLM selects between two outputs, favoring the one preferred by human annotators, eliminating the need for a separate reward model

DPO treats all training pairs equally, regardless of the degree of preference, which can lead to the model learning spurious correlations, such as associating response length with quality

Amazon introduced SeRA (Self-Reviewing and Alignment) to address DPO’s limitations by reducing false correlations and improving model alignment with human preferences

SeRA uses an initial DPO-trained model to generate new training examples and assigns preference scores to responses, retaining only pairs with significant preference differences

SeRA combines filtered samples from the original human-annotated dataset with newly generated samples, repeating the process until model performance converges

SeRA emphasizes the intended contrasts in datasets (e.g., toxic vs. non-toxic responses) while minimizing unintended contrasts

Benchmark Testing: Amazon tested SeRA on four benchmark datasets, showing consistent performance improvements of 20% to 40% over baseline models

SeRA can be generalized to other direct-alignment techniques beyond DPO, making it a versatile approach for improving LLM training

SeRA mitigates the risk of feedback loops by basing rewards on both the current and previous iterations, ensuring consistency in training data properties

SeRA reduces the need for extensive human annotations by leveraging model-generated data, making the training process more scalable