Automatic Web News Content Extraction
DOI:
https://doi.org/10.59141/jrssem.v1i7.107Keywords:
DOM tree; web; news; extraction; firefly.Abstract
The extraction of the main content of web pages is widely used in search engines, but a lot of irrelevant information, such as advertisements, navigation, and junk information, is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. This study aimed to extract web pages using DOM Tree in the rationality of segmentation results and efficiency based on the information entropy of nodes from the DOM Tree. The first step of this research was to classify web page tags and only processed tags that affected the structure of the page. The second step was to consider the content features and structural features of the DOM Tree node comprehensively. The next was to perform node fusion to obtain segmentation results. Segmentation testing was carried out with several web pages with different structures so that it showed that the proposed method accurately and quickly segmented and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would be matched with the database to eliminate information noise using the Firefly Optimization algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect were done to detect and eliminate web page noise and produce clear documents.
Published
How to Cite
Issue
Section
License
Copyright (c) 2022 Gusti Lanang Putra Eka Prismana
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International. that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work.