Bias in Training Data
How historical inequities get baked into AI models, and what we can do about it
What it is
LLMs learn from internet-scale data that reflects historical human biases, racial, gender, cultural, and socioeconomic. Models trained on this data inherit these biases: generating more positive associations for certain demographic groups, underrepresenting minority perspectives, and performing worse on tasks involving underrepresented languages or cultures.
Post-training (RLHF) can partially correct for biases, but rater pools for preference data are themselves non-representative, often skewed toward English-speaking, Western, educated populations.
Mitigations include diverse training data curation, bias auditing benchmarks, adversarial testing, and intentional RLHF focus on bias reduction, none of which fully solve the problem.