DESIGNING A PYTHON BASED TEXT PRE-PROCESSING APPLICATION FOR TEXT CLASSIFICATION

Authors

  • Hermawan Arief Purwanto
  • Taufiq Rizaldi

Abstract

The first step that is always passed by documents in natural language processing is pre-processing text. These steps are needed for transferring text from human language to machine-readable format for further processing. However, not many special applications have been found that function as text pre-processing. This has led to any research on natural language processing having to create its own program code for the pre-processing text phase. The main focus of this research is to create an integrated text pre-processing application that can be accessed by any researcher who needs it. Several issues discussed in this study include the design, implementation, testing and integration of each text pre-processing feature. Text preprocessing which is integrated in this research includes case folding, tokenizing, and feature selection. The tools used in this research are the NLTK library of python and Django framework. The design of the text pre-processing application can be made using the waterfall method. For the application stage, the utilization of the NLTK Library can be applied precisely and systematically. This library also facilitates the implementation phase because of the large number of NLP classes that can be directly applied.

References

M. Allahyari, S. Pouriyeh, S. Assefi, S. Safaei, E. Trippe, J. Gutierrez and K. Kochut, "A Brief Survey of Text Mining: Classification, Clustering and Extraction Techniques," arXiv preprint, 2017.

M. Ghiassi, M. Olschimke, B. Moon and P. Arnaudo, "Automated text classification using a dynamic artificial neural network model," Expert Systems with Applications, pp. 1096710976, 2012.

S. Gunal, S. Ergin, M. Gulmezoglu and O. N. Gerek, "On feature extraction for spam email detection," Lecture Notes in Computer Science, p. 635–642, 2006.

A. K. Uysal and S. Gunal, "The Impact of Preprocessing on Text Classification," Information Processing and Management, pp. 104-112, 2013.

H. A. Putranto, O. Setyawati and Wijono, "Pengaruh Phrase Detection dengan POSTagger terhadap Akurasi Klasifikasi Sentimen menggunakan SVM," Jurnal Nasional Teknik Elektro dan Teknologi Informasi, vol. 5, no. 4, pp. 252-259, 2016.

Published

27-12-2019