AUTOMATIC WEB NEWS CONTENT EXTRACTION

. The extraction of the main content of web pages is widely used in search engines, but a lot of irrelevant information, such as advertisements, navigation, and junk information, is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. This study aimed to extract web pages using DOM Tree in the rationality of segmentation results and efficiency based on the information entropy of nodes from the DOM Tree. The first step of this research was to classify web page tags and only processed tags that affected the structure of the page. The second step was to consider the content features and structural features of the DOM Tree node comprehensively. The next was to perform node fusion to obtain segmentation results. Segmentation testing was carried out with several web pages with different structures so that it showed that the proposed method accurately and quickly segmented and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would be matched with the database to eliminate information noise using the Firefly Optimization algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect were done to detect and eliminate web page noise and produce clear documents.


INTRODUCTION
Online news is one of the big data sources. Information in the form of news articles is published every minute (Allen, Howland, Mobius, Rothschild, & Watts, 2020). There is a lot more information than the person doing the analysis, and this is a potential problem where a lot of data can be ignored (Newman & Cain, 2014). Search engines are often used to obtain information. Search engines use web spiders to surf the web and retrieve links that may contain the information sought, and present the information in the form of a collection of hyperlinks. Search engines are capable of retrieving information from the web but not from the unseen or hidden web, so this makes data extraction a very impractical task. The challenges faced by extractors include heterogeneous formats, changes in the structure of web pages, the introduction of more and more advanced technologies to improve UX, and others.
Extracting information from multiple sources has many problems such as finding useful information, extracting knowledge from large data sets, and studying individual users. Various methods and techniques have been developed (Abburu & Golla, 2015). Because the amount of information obtained on the web increases radically, the amount of redundant web content also grows at the same time.
Therefore, updating the incoming data and retrieving useful information without duplicating data from the web, the web mining research community pays attention to an existing activity related to the information from the web quickly and efficiently (Dey & Jain, 2020

A. System Architecture
The system architecture of the information extraction is an adaptation of the general architecture of the

B. Preprocessing
The initial stage of information extraction to perform preprocessing on text input which aims to prepare the text into data that can be processed as input to the information extraction system. The parsing process was carried out on all text documents to identify all nouns or noun phrases (NP) and their  The next step is initializing the firefly population. Firefly i is attracted to firefly j which is more attracted, its movement is evaluated using the following fouls: = + ( − ) +

RESULT AND DISSCUSION
To verify the effect of the proposed method, the algorithm was implemented using the DOM method to analyze HTML tags which were then processed using an optimization algorithm to remove page noise using the firefly method.

Description of Datasets
To carry out the experiment, datasets were collected from different web pages. These web pages contain meaningful content and also have noise such as advertising banners, copyright links, page icons, irrelevant navigation, and junk information included in the web pages (Kaur, 2014).

Performance Measure
In this study, valid blocks, invalid

Scrapping Web Using DOM
The first step done in this research was to classify the web page tags and only process the tags that affected the page structure. The second step was to comprehensively consider the content features and structural features of the DOM tree node. This calculated the node information entropy and the maximum sub-node text density to determine whether or not a node was an independent page block. Next, the node fusion was carried out to obtain segmentation results. This research was applied to several pages from different websites for scraping news content on web pages. The following was a DOM standard used for initial website identification.

Figure 2. DOM Standard
The input in this process was the URL address or the HTML code of the web page to be extracted. The initial step was to convert HTML documents from web pages into text. The next data process used variable processing in the form of a string.   c. The use of raw and valid datasets also has varying effects on precision, recall, and f-measure performance. It depends on the validation process of the tag structure of the website page.
d. The researcher suggests that extracting web page content can be tried using other methods that will produce a better approach for data extraction.