The Effects of Stopwords, Stemming, and Lemmatization on Pre-trained Language Models for Text Classification: A Technical Study

Chaerul Haviana S.F., Mulyono S., Badie'ah

Abstract

Pre-trained models such as BERT and other models, including their variants, have different approaches to the embedding process and the use of vocabulary files. Therefore, the use of preprocessing on the data will affect how the input vector is formed. The lack of studies studying the effect of preprocessing on pre-trained models in natural language processing tasks such as text classification is an opportunity for more research. This study aims to seize this opportunity by conducting experiments on fine-tuning eight different pre-trained models for text classification tasks to investigate the effect of preprocessing techniques on a wider variety of pre-trained models. The combination of these preprocessing techniques was applied to see their impact on classification performance. Each model was also studied for its performance in the defined configuration. As a result, the use of preprocessing techniques such as stopword removal, stemming, and lemmatization does not significantly improve the performance of the classification model. The results also indicate that the use of stemming tends to decrease performance. By obtaining this information, it becomes clearer whether the use of such preprocessing techniques is still required when developing text classification applications in the world of pre-trained models.

Journal
International Conference on Electrical Engineering Computer Science and Informatics Eecsi
Page Range
521-527
Volume
Issue Number
Publication date
2023
Total citations

References 0

Cited By 3

Ï