Using GPT to Assess the Political Bias of News Outlets

May 21, 2024

This blog was written by Raphael Hernandes. It was initially presented as an essay for the MPhil in Ethics of AI, Data and Algorithms at the Leverhulme Centre for the Future of Intelligence, University of Cambridge.

OpenAI’s GPT-4 model, famous for being the engine behind the chatbot ChatGPT, can classify news outlets in terms of their political inclinations based only on their web addresses. That means researchers can use it to respond to something like “rate the website ‘’ on a scale including far-left, left, center-left, center, center-right, right, far-right.” There are, however, some caveats to consider, which I’ll explain in a bit.

To test this ability, my research used GPT to classify 5,877 news sources already rated by humans at Media Bias/Fact Check (MBFC), a service specializing in these evaluations, and then compared the results. It found a high correlation (Spearman’s ρ=.89, p<0.001). This shows that when humans rated an outlet more to the left or right, the system followed. Correlation was deemed the best measurement as the primary goal was to spot this movement rather than seeing if GPT could nail the exact classification used by MBFC.

The correlation was strong across multiple categories, such as countries or media types (newspaper, magazine, TV station, etc.). GPT returned only a small set of values on the opposite side of the expected spectrum. The most extreme scenario (far-right being tagged as far-left or vice-versa) only happened once.

However, it showed some significant limitations. GPT’s performance was worse on less-known websites. News sources were grouped into four sets of similar sizes based on their popularity rating (measured by their Open PageRank scores). Correlation peaks in the medium-high category, dropping with the most popular websites, and is the weakest in the lowest category. It is still strong across all buckets.

Popular websites are more likely to appear in diverse contexts within the training data, relating to “common-token bias”. This means the model might associate these common tokens (URLs of popular sources) with a wide range of content, diluting the model’s ability to classify political bias. It could result in the model inaccurately attributing neutrality to well-known sources simply due to their ubiquity or wrongly associating them with a bias based on a frequent misconception. The opposite might also be the case: the dataset used to train GPT-4 lacks information about an obscure website, forcing the AI to make something up.

Performance was aided by allowing the model to not assign a rating if uncertain. This led to 3,902 websites (66.4%) being left without a label. However, given that LLMs are often criticized for their hallucinations (wrong or made-up outputs), allowing the model to say “I do not know” instead of providing nonsensical results makes it more trustworthy.

And then another issue arises. There was a tendency to abstain from rating less popular and more central sources, showing a bias towards mainstream media and polarized classifications. Furthermore, GPT-4’s ratings leaned slightly more to the left than MBFC’s, which relates to previous research that also found a left-skewness in the model.

Taking those limitations into account, GPT-4 shows promise in enhancing the scope and efficiency of political bias classification of news websites with proper human oversight.

Traditionally, these sorts of labeling are carried out by humans using techniques that rely on perception or behavior. That means either analyzing how consumers see those news outlets or judging patterns in, for example, the language they use (see Articles in Science,  EconometricaPew Research Centre and AllSides). These processes can produce databases with the political orientation of these news sources (imagine a table that goes “Outlet 1 – Left”, “Outlet 2 – Far-right”, and so on).

This carries a degree of subjectivity. The classification might depend on who you ask, the criteria used, or how each political classification is defined. If you ask the same person multiple times, they might make different calls. Moreover, this data tagging is expensive and hard to scale, as these labeling processes are very complex.

Yet, these classifications are useful to understand what kind of news people consume or identify bias in what is offered to readers. This is relevant to check if algorithms that deliver news are skewed one way or another and for regulating the industry when analyzing whether a political slant is dominant in an area.

The idea of using a Large Language Model (LLM), GPT-4, to do this kind of analysis is to make it scalable and more reproducible, as the artificial intelligence system has constraints that help yield consistent results. Others using the same configuration I used are likely to get similar outputs. For instance, the temperature setting, which controls how random the results are, was set to 0 to facilitate this reproducibility. In these tests, GPT did not have internet access, and it did not have additional training to perform these classifications.

Previous research has shown that LLMs can be useful in similar data annotation tasks, assessing the quality of textsdetermining if a tweet relates to a political topic, or classifying the political affiliation of a Twitter user based on the content of a single post. In a context closer to this research, GPT assigned ratings for news outlets’ credibility in a way that correlated to human-assigned labels.

GPT has also shown the capacity to classify US senators in terms of their liberal-conservative ideology, support of gun control, and support of abortion rights, based only on their names and parties. The rankings are not simply mimicking other scales, contrasting with the idea that these systems might be repeating patterns from their training data, a behavior that has been criticized in the past. Instead, the research shows that the ratings came from a mix of senators’ behaviors and how these politicians are perceived.

This indicates that LLMs’ capabilities in political classification tasks warrant further investigation, as they might offer not only a cheaper and faster way of labeling the data but also a new scale altogether. It is important to note that this phenomenon and its caveats are not entirely understood. This research represents an initial exploration of a potential capability in that realm.

Author BIO