Automatic Web News Content Extraction

Gusti Lanang Putra Eka Prismana

doi:10.59141/jrssem.v1i7.107

Automatic Web News Content Extraction

Authors

Gusti Lanang Putra Eka Prismana Information System Department, Faculty of Engineering, Universitas Negeri Surabaya

DOI:

https://doi.org/10.59141/jrssem.v1i7.107

Keywords:

DOM tree; web; news; extraction; firefly.

Abstract

The extraction of the main content of web pages is widely used in search engines, but a lot of irrelevant information, such as advertisements, navigation, and junk information, is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. This study aimed to extract web pages using DOM Tree in the rationality of segmentation results and efficiency based on the information entropy of nodes from the DOM Tree. The first step of this research was to classify web page tags and only processed tags that affected the structure of the page. The second step was to consider the content features and structural features of the DOM Tree node comprehensively. The next was to perform node fusion to obtain segmentation results. Segmentation testing was carried out with several web pages with different structures so that it showed that the proposed method accurately and quickly segmented and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would be matched with the database to eliminate information noise using the Firefly Optimization algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect were done to detect and eliminate web page noise and produce clear documents.

Downloads

PDF
HTML

Published

2022-02-18

How to Cite

Putra Eka Prismana, G. L. (2022). Automatic Web News Content Extraction. Journal Research of Social Science, Economics, and Management, 1(7), 785–794. https://doi.org/10.59141/jrssem.v1i7.107

Download Citation

Issue

Vol. 1 No. 7 (2022): Journal Research of Social Science, Economics, and Management

Section

Articles

License

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International. that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.

Automatic Web News Content Extraction

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

License

New Sidebar jrssem