Picking Up the Scraps—Creating a Specialized Corpus Using Web-Scraping Tools

Authors

Frane Malenica
Department of English Studies, University of Zadar, Croatia
https://orcid.org/0000-0002-1926-1353 (unauthenticated)

Synopsis

The methods for creating corpora from websites have been in use for almost two decades (Baroni and Ueyama 2006; Baroni et al. 2009), and numerous tools for extracting textual data and metadata from websites have been developed since either as standalone programs, browser extensions, or as packages and libraries in programming languages such as Python and R (cf. Bradley and James 2019; Diouf et al. 2019; Kumar and Roy 2023). The widespread availability of these tools has allowed scholars to create custom corpora on a wide array of very specific topics, such as song lyrics (Kreyer and Mukherjee 2009; Werner 2012; Motschenbacher 2016), comics (Dunst et al. 2017; Unser-Schutz 2011), video games (Heritage 2020), and video game reviews (Guzsvinecz 2022; Arik 2022; HaCohen Kerner et. al. 2020). Previous research in this domain, conducted by Cho et al. (2020), has also demonstrated the effectiveness of NLP methods in extracting and identifying the main themes of video games. In this paper, I will present the results of research conducted on a corpus of video game reviews collected from the GameSpot website (www.gamespot. com) using the rvest package (Wickham 2021) for web scraping in R, and analysed using a combination of traditional corpus linguistic (CL) methods and Natural Language Processing (NLP) methods available in the quanteda package (Benoit et al. 2018). The main aims of this paper are to: i) identify words and phrases typical for different genre of video game reviews; ii) test the applicability of web scraping and NLP methods for linguistic research. While frequency-based analysis is good for a cursory glance at words and phrases typical for this register, the keyword analysis offers more useful results. The results of the sentiment analysis show statistically significant correlation between polarity and ratings, further highlighting the usefulness of these methods.

Downloads

Published

January 9, 2025

How to Cite

Malenica, F. . (2025). Picking Up the Scraps—Creating a Specialized Corpus Using Web-Scraping Tools. In L. . Grčić & M. . Brkić Bakarić (Eds.), Corpora in Language Learning, Translation and Research: Proceedings of the International Conference Corpora in Language Learning, Translation and Research held at the University of Zadar (August 23–24, 2023) (pp. 49-71). Morepress Books. https://doi.org/10.15291/9789533315355.05