An AI-Powered Audit: Do Chatbots Reproduce Political Pluralism?

Large language models (LLMs) have emerged as one of the most prominent and widely used AI technologies of our era. They are already reshaping our information landscape, and as major companies continue to integrate LLM-based products into the market – such as through internet search engines – their influence is poised to grow even further. Yet bias in these models has been acknowledged as a risk by academics, civil society, and AI providers alike. Designing them to be fair and unbiased is an ongoing challenge, especially when dealing with contentious topics such as political, economic, and social issues.  

Although significant research has examined bias and discrimination in LLMs, there has been relatively little focus on their ability to provide diverse responses to political questions. One potential solution involves training leading models to generate responses that reflect a range of diverse perspectives and opinions on a topic, rather than adopting a singular stance. In this context, we believe pluralism – representing a range of different perspectives and viewpoints – offers a valuable approach for LLMs to handle sensitive and controversial topics. Specifically, by adopting a human rights-based approach to fostering pluralism, we can establish an effective framework to reduce political bias in model responses. We explored this solution in our study, examining the responses of five major models (ChatGPT4, Claude, Falcon 7b, Gemini, and Llama) across three languages (English, Colombian Spanish, and Serbian) to contentious political, social, and economic questions. This report was done using semi-automated methods, with LLMs being employed to evaluate the degree of pluralism in model responses at scale. These results were then reviewed by human evaluators to assess the efficacy of our methods. 

AI Powered Audit Cover

Download Report

Insights, Challenges, and Findings 

Our goal was to determine how well these models align with our human rights-based definition of pluralism, and how their responses represent different political perspectives. Our findings suggest that:  

  • Several of the leading models evaluated demonstrate high levels of pluralism as a default when responding to prompts on political, social, and economic topics. 
  • We did not find any significant differences in the pluralism scores across languages, but Serbian received the highest average scores by the models.
  • LLMs examined were generally effective at conveying multiple competing perspectives, ideologies, and values. 
  • Although models generally produce clear and neutral responses, they struggle with cultural sensitivity and ensuring a balanced representation of viewpoints.
  • LLMs scored lowest on topics about the state’s role, especially its use of force or involvement in the economy – they scored higher on gender-related topics, showing they can present perspectives beyond the male/female binary. 

Using LLMs as evaluators of LLM output proved a useful tool for making the audit more scalable through automation. While our methods were generally valid, we recognise their limitations, as we found that model responses occasionally didn’t align with human evaluations regarding diversity. We nonetheless believe that our methodology offers a strong and easily replicable framework for civil society organisations interested in auditing LLMs for political bias.  

You can read the full report below.

Co-organised by Democracy Reporting International, Forum Transregionale Studien, 
Berliner Landeszentrale für politische Bildung and Verfassungsblog.

Thursday 20 February 2025
Revaler Str. 29, 10245 Berlin

18:30 – 20:00

Supported by

The German Federal Foreign Office

Related posts