Pemanfaatan News Crawling Untuk Pembangunan Corpus Berita Menggunakan Scrapy dan Xpath


  • Taufiq Rizaldi
  • Hermawan Arief Putranto


Linguistically, language corpus is a collection of written (textual) or test hypotheses about language structure. However, the existence of the language corpus, especially the Indonesian corpus today is still very less. It's caused by the use of language corpus for Natural Language Processing is rare and most of them still using the same corpus which is used by previous research. In addition, the construction of the corpus itself takes a long time and big costs. To overcome this problem, this research proposed a development of language corpus, especially Indonesian corpus, using web crawling engine Scrapy and guided X-path. So with the use of guided web crawling technology is expected to build a corpus language data in accordance with the needs of research and net of unexpected codes and links without much time and energy consuming. The result shows that the development of news corpus using Scrapy and Xpath is successfully meet the expected target. This is characterized by the resulting corpus news that has been divided into three categories of news namely, entertainment, community and culinary news. In addition, from the parameters tested it can be concluded that the use of resources on the server computer is directly proportional to the number of items obtained and the file size. This means that the more items obtained and successfully stored the greater the size of the file and resource memory used. Thus, to limit memory usage on server computers, we can limit what items will be taken at the time of the scraping process by limiting the number of links crawled by the spider or limiting the number of items to be searched. 


Keywords— Language Corpus, Natural Language Processing, Scrapy, Web Crawling, XPath



Akbar, Subhan Agus, Eko Sediyonob, Oky Dwi Nurhayati. 2015. Analisis Sentimen Berbasis Ontologi di Level Kalimat untuk Mengukur Persepsi Produk. Jurnal Informasi Bisnis. 02. 2015. Universitas Diponegoro. Semarang

Kouzis-Loukas, Dimitrios. 2016. Learning Scrapy. Packt Publishing: Birmingham-Mumbai.

Hatzi, Vassiliki, dkk. 2014. Web Page Download Scheduling Policies for Green Web Crawling. 22nd International Conference on Software, Telecommunications and Computer Networks (SoftCOM).

Jing Wang, Yuchun Guo. 2012. Scrapy-based Crawling and User-behavior Characteristics Analysis on Taobao. 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discover.

Wijaya, Akhmad Pandu, Heru Agus Santoso. 2016. Naive Bayes Classification pada Klasifikasi Dokumen Untuk Identifikasi Konten E-Government. Journal of Applied Intelligent System. 2016;1(1):48-55