JRSSEM 2022, Vol. 01, No. 7, 785 – 794

E-ISSN: 2807 - 6311, P-ISSN: 2807 - 6494

DOI : 10.36418/jrssem.v1i7.107

AUTOMATIC WEB NEWS CONTENT EXTRACTION

Gusti Lanang Putra Eka Prismana*

Universitas Negeri Surabaya

e-mail: lanangp[email protected]

*Correspondence: l[email protected]

Submitted: 27 January 2022, Revised: 06 February 2022, Accepted: 18 February 2022

Abstract. The extraction of the main content of web pages is widely used in search engines, but a

lot of irrelevant information, such as advertisements, navigation, and junk information, is included

in web pages. Such irrelevant information reduces the efficiency of web content processing in

content-based applications. This study aimed to extract web pages using DOM Tree in the

rationality of segmentation results and efficiency based on the information entropy of nodes from

the DOM Tree. The first step of this research was to classify web page tags and only processed tags

that affected the structure of the page. The second step was to consider the content features and

structural features of the DOM Tree node comprehensively. The next was to perform node fusion

to obtain segmentation results. Segmentation testing was carried out with several web pages with

different structures so that it showed that the proposed method accurately and quickly segmented

and removed noise from web page content. After the DOM Tree was formed, the DOM Tree would

be matched with the database to eliminate information noise using the Firefly Optimization

algorithm. Then, testing and evaluating the Firefly Optimization method in effectiveness aspect

were done to detect and eliminate web page noise and produce clear documents.

Keywords: DOM tree; web; news; extraction; firefly.

Gusti Lanang Putra Eka Prismana | 786

DOI : 10.36418/jrssem.v1i7.107

INTRODUCTION

Online news is one of the big data

sources. Information in the form of news

articles is published every minute (Allen,

Howland, Mobius, Rothschild, & Watts,

2020). There is a lot more information than

the person doing the analysis, and this is a

potential problem where a lot of data can

be ignored (Newman & Cain, 2014). Search

engines are often used to obtain

information. Search engines use web

spiders to surf the web and retrieve links

that may contain the information sought,

and present the information in the form of

a collection of hyperlinks. Search engines

are capable of retrieving information from

the web but not from the unseen or hidden

web, so this makes data extraction a very

impractical task. The challenges faced by

extractors include heterogeneous formats,

changes in the structure of web pages, the

introduction of more and more advanced

technologies to improve UX, and others.

Extracting information from multiple

sources has many problems such as finding

useful information, extracting knowledge

from large data sets, and studying

individual users. Various methods and

techniques have been developed (Abburu

& Golla, 2015). Because the amount of

information obtained on the web increases

radically, the amount of redundant web

content also grows at the same time.

Therefore, updating the incoming data and

retrieving useful information without

duplicating data from the web, the web

mining research community pays attention

to an existing activity related to the

information from the web quickly and

efficiently (Dey & Jain, 2020).

The articles published on a website are

mostly in the form of unstructured

information because they usually contain

main information or main content,

advertisements, navigation, and other

additional information. The amount of that

information resulted in the difficulties of

getting the main core information and

finding relevant values and knowledge in

the form of structured information, such as

the form of a database. The mechanism for

extracting a collection of texts to obtain

facts in the form of events, entities, and

relationships in the form of structured

information as input to a database or

ontology is called information extraction

(Kara et al., 2012).

This study aimed to extract DOM Tree-

based web pages in the rationality of

segmentation and efficiency results. This

study used a method based on the

information entropy of nodes from the

DOM Tree. The first step carried out in this

research was classifying web page tags and

only processing tags that affected the

structure of the page. The second step was

considering the content features and

structural features of the DOM Tree node

comprehensively, calculating the

information entropy of the nodes and the

maximum text density of subnodes, and

determining whether a node was a block

page or independent. The third step was

performing node fusion to obtain

segmentation results. After getting the

segmentation results, the web page noise

will be removed to match the DOM Tree

built with the database (Velloso & Dorneles,

2013). After the DOM Tree was formed,

then the DOM Tree was matched with the

database to eliminate information noise by

787 | Automatic Web News Content Extraction

using the Firefly Optimization algorithm.

Furthermore, testing and evaluating the

Firefly Optimization method in

effectiveness is done to detect and

eliminate web page noise and produce

clear documents (Yu & Jin, 2017). This

research is expected to get a better

approach for extracting data from semi-

structured documents, both based on

structure and data using several

optimization methods, techniques, and

algorithms as well as finding a method for

removing web page noise that often arises

from web content extraction.

METHODS

A. System Architecture

The system architecture of the

information extraction is an adaptation

of the general architecture of the

information extraction system and the

general architecture of the system. The

input of the information extraction

system is in the form of unstructured

natural language text from the source

text on the web page (HTML text). The

information extraction process

according to the OBIE concept will

involve an ontology as an extraction

guide and produce output in the form

of extracted information which is

represented in the form of XML and

annotated text. The ontology-guided

extraction process will extract things

such as classes, properties, and

instances (Wimalasuriya & Dou, 2010).

The information extraction system

architecture in this study was divided

into three phases, namely the training

phase, the development phase, and the

evaluation phase. In the training phase,

the system identifies patterns and lists

of dictionaries (called semantic lexicon),

which were learned using a

bootstrapping approach. Previously,

the corpus had to go through the

preprocessing stage before the training

process. The purpose of the training

phase was to generate patterns and

semantic lexicon. The development

phase was a phase to identify and

classify relevant information in a new

collection of texts. The text used was

not included in the corpus in the

training process. The pattern was

generated to get the extraction rules. In

the development phase, the input text

was passed to the OBIE system to

produce an output. The last phase was

the evaluation or testing phase. The

architecture of the OBIE system in this

study can be seen in Figure 1 below:

Figure 1. System Architecture

B. Preprocessing

The initial stage of information

extraction to perform preprocessing on

text input which aims to prepare the

text into data that can be processed as

input to the information extraction

system. The parsing process was carried

out on all text documents to identify all

nouns or noun phrases (NP) and their

Gusti Lanang Putra Eka Prismana | 788

contexts. The parsing process in

preprocessing consists of sentence

detection, cutting sentences into

tokens or words (tokenization),

providing syntactic information (POS

tagging), and cutting phrases (NP

chunker). After parsing the text was

complete, then the indexing and

filtering process was carried out. The

sentence indexing process was done by

detecting words/ NP phrases in

sentences, tokens on the left of NP, and

tokens on the right of NP. After the

indexing process was complete, the

output (called a document-set) was

stored in the database for processing.

C. DOM Tree

The Document Object Model

(DOM) specification is an object-based

interface developed by the World Wide

Web Consortium (W3C) that constructs

XML and HTML documents as tree

structures in memory. Applications

access XML data through an in-memory

tree, which is a replication of how the

data is structured. The DOM also allows

users to dynamically traverse and

update XML documents. It provides a

model for the entire document, not just

for a single HTML tag. The Document

Object Model represents a web

document as a tree. It is highly

adaptable and can be used to renovate

entire web pages. This is an explicit

HTML document model. Some HTML

tags do not include closing brackets.

For some of these tags, the closing

parenthesis is inferred by the following

tag, for example, the <LI> tag is closed

by the following </LI> tag. To analyze a

web page, first, check the HTML

document syntax because most HTML

Web pages are not well-formed. After

that, it passes the web page through an

HTML parser which increases the

markup and generates a DOM tree.

Then the system breaks it down into

several sub-trees according to the

threshold value. Different websites

have different layouts and serving

styles, therefore the depth of the Web

page tree varies according to the

presentation style (Kim & Lee, 2017).

The system must know the

maximum level of the DOM tree to

select the best option. DOM tree-based

method, as a segmentation method of

high interest, is proposed based on the

characteristics in which after parsing

the DOM, HTML documents can form a

tree structure that can accurately

describe the hierarchical relationship

between elements in a web page and is

convenient for computer processing.

After the web pages were parsed into a

DOM tree, the algorithm grouped the

web pages mainly by content features

and DOM tree structure features. DOM

is a common tool for representing web

pages. In the DOM, the web page will

be represented as a set of tags and a

hierarchical relationship between the

tags with the function of each tag,

which allows the user to classify a

message and an HTML tag (Sun, Song,

& Liao, 2011).

D. Firefly Algorithm

This algorithm is inspired by nature

which is based on a firefly's flash of light

and mimics how fireflies interact with

each other. In the firefly algorithm,

some web documents or web pages are

789 | Automatic Web News Content Extraction

taken as input. The following are the

steps in implementing the firefly

algorithm as shown in the figure below.

After reading the web page, the

checked HTML tags were given in step

3 then consider the web document with

various tags. In step 5, the objective

function was calculated and generated

the initial population of fireflies in step

6. The light intensity was formulated in

step 7 and determines the absorption

coefficient in step 8. In step 9, the

maximum generation was evaluated

based on the new solution of updated

light intensity. In steps 18 and 19, noisy

information is identified and eliminated

(Bumbaca et al., 2011).

Finally, the main content was

extracted. All information related to

web pages was stored for efficient

pattern retrieval using the Firefly

technique. A database was created

using an artificial neural network to

store related data from web pages.

Matching the constructed DOM tree

with the database was to eliminate

noisy information (Mangat, 2014). In

the end, we can get the main content.

The initialization of the objective

function f (wi) is calculated using the

light intensity I (o) which varies

according to the inverse square law

with the following formula:



󰇛



󰇜











……………… (1)

I(o) is the intensity at the source and

r is the distance of the observer. The

light intensity I varies with the square of

the distance d. The absorption

coefficient s calculated using the

following formula:











……………… (2)

The attraction of fireflies is

proportional to the intensity of light

perceived by other fireflies. The

brightness observed by adjacent

fireflies calculated using the formula

below:











…………….. (3)

The next step is initializing the

firefly population. Firefly i is attracted

to firefly j which is more attracted, its

movement is evaluated using the

following fouls:





 



 







 







……………… (4)

The webpage noise removal funtto

is calculated using the following

formula:

  















……………….

(5)

Ttot is represented as the total

number of tags on a web page,

meanwhile, Tneg is the negative tag on

Gusti Lanang Putra Eka Prismana | 790

a web page. Then, F is denoted as F and

β is firefly attraction.

RESULT AND DISSCUSION

To verify the effect of the proposed

method, the algorithm was implemented

using the DOM method to analyze HTML

tags which were then processed using an

optimization algorithm to remove page

noise using the firefly method. There were

trials of several pages from different

websites, such as Baidu Encyclopedia, Sina

Blog, Tencent News, Blog Park, and other

websites. This page has a clear distinction

in content and structure and can illustrate

the implementation of the algorithm well.

This study proposed a DOM tree-based

web page segmentation method that

comprehensively considered the structural

features and content features of web

pages, used node information entropy to

group nodes from the parsed DOM tree,

and obtained the final segmentation results

with node fusion in the form of news text

content.

1. Description of Datasets

To carry out the experiment,

datasets were collected from different

web pages. These web pages contain

meaningful content and also have noise

such as advertising banners, copyright

links, page icons, irrelevant navigation,

and junk information included in the

web pages (Kaur, 2014).

2. Performance Measure

In this study, valid blocks, invalid

blocks, execution time, precision, recall,

and F-Measure were considered as

performance factors that were

evaluated based on the number of

negative tags and the total number of

tags. Experiments were carried out to

test and evaluate the proposed

method, the effectiveness of detecting

and removing noise to block web pages

and produce clear documents. The

validity and accuracy of the proposed

algorithm were checked using recall,

precision, and F-measures from the

information retrieval field. The datasets

used in the experiment consisted of

several pages from different websites.

3. Scrapping Web Using DOM

The first step done in this research

was to classify the web page tags and

only process the tags that affected the

page structure. The second step was to

comprehensively consider the content

features and structural features of the

DOM tree node. This calculated the

node information entropy and the

maximum sub-node text density to

determine whether or not a node was

an independent page block. Next, the

node fusion was carried out to obtain

segmentation results. This research was

applied to several pages from different

websites for scraping news content on

web pages. The following was a DOM

standard used for initial website

identification.

Figure 2. DOM Standard

The input in this process was the

URL address or the HTML code of the

web page to be extracted. The initial

step was to convert HTML documents

791 | Automatic Web News Content Extraction

from web pages into text. The next data

process used variable processing in the

form of a string.

Figure 3. DOM Process

There are several sections on a web

page such as texts, images, lists, or

tables. Therefore, the first step was to

determine which part was the text of

the HTML document. The texts were

detected by making use of the string

match function. Texts in HTML

documents can be identified by the tag

<p> at the beginning and <\p> at the

end of the table. The tag <p> was used

to detect reading in the form of a

paragraph. If there are two or more

paragraphs on a web page, the

application stores the table at the

second index of the variable in the form

of an array. The next step was to create

a DOM tree from the detected

paragraphs. The DOM tree is composed

of text-forming tags, namely tags <p>.

This tag is detected to determine the

news content portion of the paragraph.

The third step was to extract or retrieve

the data portion of the DOM tree. After

the data section in the table was

obtained, the next step was to save the

data in the form of a CSV file. The

selection of the form of the storage file

is intended so that the extraction results

can be used for subsequent needs, such

as integration with data from other

tables or to be stored in a database.

Different websites have different

layouts and serving styles, therefore,

the depth of the Web page tree varies

according to the presentation style. The

system must know the maximum level

of the DOM tree to select a good choice

of threshold levels. That is why the

system traverses the entire DOM tree to

obtain the maximum DOM depth.

Figure 4. Scrapping 1

Input web

page

Determine

text location

Transformatio

n to DOM

Tree

Texts

Gusti Lanang Putra Eka Prismana | 792

Figure 5. Scrapping 2

Figure 6. Scrapping 3

Figure 7. Scrapping 4

4. Optimization of Firefly Algorithm

Method

All information related to the web

pages was stored for efficient pattern

retrieval using the Firefly technique. A

database is created using an artificial

neural network to store related data

from web pages. Matching the created

DOM tree with the database is to

remove the noise information to get

793 | Automatic Web News Content Extraction

the main content (Liu, 2017). The

initialization of the objective function f

(wi) is calculated using the light

intensity I(o) which varies according to

the inverse square law. The formula

used I as follows



󰇛



󰇜











I(o) is the intensity at the source and

r is the observer’s distance. The light

intensity I varies with the square of the

distance d. The absorption coefficient γ

is calculated using the following

formula:











The steps of the firefly algorithm

can be described as follows:

a) Step 1: Access several web pages

b) Step 2: Read every web page, one

by one

c) Step 3: Check web HTML tags

d) Step 4: Consider documents with

multiple tags

e) Step 5: Objective function of f (wi)

w = (w1, w2, w3 ..)

f) Step 6: Produce an initial

population of fireflies

g) Step 7: Formulate light intensity

h) Step 8: Determine the absorption

coefficient γ

i) Step 9: Meanwhile, (t <Max_

Generation)

j) Step 10: For i = 1: n

k) Step 11: For j = 1: n (n of fireflies)

l) Step 12: If (Ij> Ii)

m) Step 13: Move the fireflies to j

n) Step 14: Calculate the new solutions

and renewing light intensity

o) Step 15: End if

p) Step 16: End for j

q) Step 17: End for i

r) Step 18: Identify noisy information

s) Step 19: Eliminate noise

t) Step 20: End it

CONCLUSIONS

Based on the study having been

conducted entitled Automatic Web News

Content Extraction, there are several things

that can be concluded as follows:

a. Automatic Web News Content

Extraction Algorithm can be used in

optimal extraction of main web page

content (news) and results in a better

approach.

b. The use of different datasets has a

varying effect on the performance of

the indicated precision, recall, and f-

measure parameters. This depends on

the page reference level of similarity

with the extracted page. The more

similar, the more stable the

performance shown.

c. The use of raw and valid datasets also

has varying effects on precision, recall,

and f-measure performance. It depends

on the validation process of the tag

structure of the website page.

d. The researcher suggests that extracting

web page content can be tried using

other methods that will produce a

better approach for data extraction.

REFERENCES

Abburu, Sunitha, & Golla, Suresh Babu.

(2015). Satellite image classification

methods and techniques: A review.

International Journal of Computer

Applications, 9(8).

Gusti Lanang Putra Eka Prismana                                                                                          | 794 
 
 
Allen,  Jennifer,  Howland,  Baird,  Mobius, 
Markus,  Rothschild,  David,  &  Watts, 
Duncan  J.  (2020).  Evaluating  the  fake 
news  problem  at  the  scale  of  the 
information  ecosystem.  Science 
Advances,  6(14),  3539. 
10.1126/sciadv.aay3539 
Bumbaca,  Daniela,  Wong,  Anne,  Drake, 
Elizabeth,  Reyes  II,  Arthur  E.,  Lin, 
Benjamin  C.,  Stephan,  Jean  Philippe, 
Desnoyers,  Luc,  Shen,  Ben  Quan,  & 
Dennis, Mark S. (2011). Highly specific 
off-target  binding  identified  and 
eliminated during the humanization of 
an  antibody  against  FGF  receptor  4. 
MAbs, 3(4), 376–386.  Taylor & Francis. 
https://doi.org/10.4161/mabs.3.4.1578
6 
Dey,  Arnab,  &  Jain,  Sudhanshu.  (2020). 
Automatic skimming of web pages on a 
single  click  efficiently.  2020  4th 
International  Conference  on  Trends  in 
Electronics  and  Informatics 
(ICOEI)(48184),  596–602.  IEEE. 
10.1109/ICOEI48184.2020.9143003 
 
Kara, Soner, Alan, Özgür, Sabuncu, Orkunt, 
Akpınar,  Samet,  Cicekli,  Nihan  K.,  & 
Alpaslan, Ferda N. (2012). An ontology-
based  retrieval  system  using  semantic 
indexing.  Information  Systems,  3(4), 
294–305. 
https://doi.org/10.1016/j.is.2011.09.004 
 
Kim,  Yeongsu,  &  Lee,  Seungwoo.  (2017). 
SVM-based  web  content  mining  with 
leaf classification unit from DOM-tree. 
2017  9th  International  Conference  on 
Knowledge and Smart Technology (KST), 
359–364.  IEEE. 
10.1109/KST.2017.7886134 
 
Newman,  George  E.,  &  Cain,  Daylian  M. 
(2014).  Tainted  altruism:  When  doing 
some good is evaluated as worse than 
doing  no  good  at  all.  Psychological 
Science,  25(3),  648–655. 
https://doi.org/10.1177/095679761350
4785 
 
Sun,  Fei,  Song,  Dandan,  &  Liao,  Lejian. 
(2011).  Dom-based  content  extraction 
via text density. Proceedings of the 34th 
International ACM SIGIR Conference on 
Research  and  Development  in 
Information  Retrieval,  245–254. 
https://doi.org/10.1145/2009916.2009
952 
 
Velloso,  Roberto  Panerai,  &  Dorneles, 
Carina  F.  (2013).  Automatic  web  page 
segmentation  and  noise  removal  for 
structured  extraction  using  tag  path 
sequences. Journal  of Information  and 
Data Management, 4(3), 173. 
 
Wimalasuriya,  Daya  C.,  &  Dou,  Dejing. 
(2010).  Ontology-based  information 
extraction: An introduction and a survey 
of  current  approaches.  Journal  of 
Information Science, Vol. 3, pp. 306–323. 
Sage  Publications  Sage  UK:  London, 
England. 
https://doi.org/10.1177/016555150936
0123 
 
Yu,  Xin,  &  Jin,  Zhengping.  (2017).  Web 
content  information  extraction  based 
on  DOM  tree  and  statistical 
information.  2017  IEEE  17th 
International  Conference  on 
Communication  Technology  (ICCT), 
1308–1311.  IEEE. 
10.1109/ICCT.2017.8359846 
 
 
© 2022 by the authors. Submitted  
for possible open access publication  
under  the  terms  and  conditions  of  the  Creative 
Commons  Attribution  (CC  BY  SA)  license 
(https://creativecommons.org/licenses/by-sa/4.0/).