Publication

The Effects of Stopwords, Stemming, and Lemmatization on Pre-trained Language Models for Text Classification: A Technical Study

Badie'ah, Chaerul Haviana S.F., Mulyono S.

Abstract

Pre-trained models such as BERT and other models, including their variants, have different approaches to the embedding process and the use of vocabulary files. Therefore, the use of preprocessing on the data will affect how the input vector is formed. The lack of studies studying the effect of preprocessing on pre-trained models in natural language processing tasks such as text classification is an opportunity for more research. This study aims to seize this opportunity by conducting experiments on fine-tuning eight different pre-trained models for text classification tasks to investigate the effect of preprocessing techniques on a wider variety of pre-trained models. The combination of these preprocessing techniques was applied to see their impact on classification performance. Each model was also studied for its performance in the defined configuration. As a result, the use of preprocessing techniques such as stopword removal, stemming, and lemmatization does not significantly improve the performance of the classification model. The results also indicate that the use of stemming tends to decrease performance. By obtaining this information, it becomes clearer whether the use of such preprocessing techniques is still required when developing text classification applications in the world of pre-trained models.

Journal

International Conference on Electrical Engineering Computer Science and Informatics Eecsi

Page Range

521-527

Publication date

2023

Total citations

Cited by 4

Character-level convolutional networks for text classification

Lecun Y., Zhang X., Zhao J.

No Title

Alammar J.

XLNet: Generalized autoregressive pretraining for language understanding

Carbonell J., Dai Z., Dai Z., Le Q.V., Salakhutdinov R., Yang Y., Yang Z.

No Title

Beltagy I., Cohan A., Lo K.

No Title

Liu Y.

No Title

Clark K., Le Q.V., Luong M.-T., Manning C.D.

No Title

Chaumond J., Debut L., Sanh V., Wolf T., Chaumond J., Debut L., Sanh V., Wolf T.

No Title

Chen W., Gao J., He P., Liu X.

No Title

Chen M., Gimpel K., Goodman S., Lan Z., Sharma P., Soricut R.

No Title

Chang M.-W., Devlin J., Lee K., Toutanova K., Chang M.-W., Devlin J., Lee K., Toutanova K.

Topic Classification of Interviews on Emergency Remote Teaching

Kermanidis K.L., Nikiforos S., Mouratidis D., Nikiforos M.N., Tzimiris S.

Information Switzerland

Something to Do with Paying Attention: A Review of Transformer-based Deep Neural Networks for Text Classification in Digital Humanities and New Testament Studies

Rich D.

Open Theology

Sentiment Analysis of Lithuanian Online Reviews Using Large Language Models

Lukosevicius M., Stankevicius L., Vileikyte B.

Ceur Workshop Proceedings

Technological trajectories in circular economy: Bridging patent analytics with sustainable development goals

Mohammadi N., Maghsoudi M., Soghi M., Sabet M.

Journal of Environmental Management

Access to Document

10.1109/EECSI59885.2023.10295797