Distant Political News Classification: Facilitating Machine Learning Identification of News Topics Across Multilingual Text Corpora


The increasing volume of online news has made it considerably more difficult for scholars to identify political news. Ongoing advances in computational methods and natural language processing help to tackle this challenge. Yet, scholars who aim to take a comparative approach to study political news content are facing the challenge of addressing multiple languages. Training individual supervised machine learning classifiers for multiple languages is a costly and time-consuming process. Instead of relying on labels generated by manual coding, we explore the use of `distant’ labels created by cues in article URLs. Specifically, we explore how sections reflected in URLs (e.g., nytimes.com/politics/) can help create training material for supervised machine learning classifiers. Using cues provided by news media organizations, such an approach allows for efficient political news identification at scale, while also allowing easy implementation across languages. We rely on an existing data set that consists of approximately 870,000 URLs of news-related content from four different countries (Italy, Germany, Netherlands, Poland), with a large sample of hand-labelled articles for each country. We test this method by providing a comparison to ‘classical’ supervised machine learning and a multilingual BERT model. We also expand topic identification to sports, entertainment, and economic news. Our results suggest that the use of URL section cues to distantly annotate texts provides a cheap and easy-to-implement way of classifying large volumes of news texts that can save researchers much valuable resources without necessarily having to sacrifice quality.