Text Extraction Algorithm for Web Text Classification

Mustafa Muwafak Alobaedy

doi:10.13140/rg.2.2.29447.66729

Year: 2010

Venue: Master Project, Universiti Utara Malaysia

Type: thesis

DOI: 10.13140/rg.2.2.29447.66729

External link: https://doi.org/10.13140/rg.2.2.29447.66729

Abstract

Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naïve-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naïve-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification.

Keywords

Text Extraction AlgorithmWeb Text Classification

📄 Full text (109,199 characters)extracted from the PDF · click to expand

TEXT EXTRACTION ALGORITHM FOR WEB TEXT 
CLASSIFICATION 
 
 
 
 
 
MUSTAFA MUWAFAK THEAB 
 
 
 
 
 
 
 
UNIVERSITI UTARA MALAYSIA 
2010 
 

2 
 
 
 
TEXT EXTRACTION ALGORITHM FOR WEB TEXT 
CLASSIFICATION 
 
 
 
A thesis submitted to the Faculty of Information Technology 
in partial fulfillment of the requirements for the degree 
Master of Science (Intelligent Systems) 
Universiti Utara Malaysia 
 
 
 
By 
Mustafa Muwafak Theab 
 
 
 
 
 
 
 
© Mustafa Muwafak, 2010. All rights reserved.

i 
 
 
PERMISSION TO USE 
 
In presenting this thesis in partial fulfillment of the requirements for a 
postgraduate degree from University Utara Malaysia, I agree that the University 
Library may make it freely available for inspection. I further agree that permission for 
copying of this thesis in any manner, in whole or in part, for scholarly purpose may be 
granted by my supervisor(s) or, in their absence by the Dean of the Graduate School. 
 It is understood that any copying or publication or use of this thesis or parts thereof 
for financial gain shall not be allowed without my written permission. It is also 
understood that due recognition shall be given to me and to University Utara Malaysia 
for any scholarly use which may be made of any material from my thesis. 
Requests for permission to copy or to make other use of materials in this thesis, in 
whole or in part should be addressed to: 
 
 
Dean of Faculty of Information Technology 
Universiti Utara Malaysia 
06010 UUM Sintok 
Kedah Darul Aman 

ii 
 
 
ABSTRACT 
 
Explosive expand of web pages in the World Wide Web makes it difficult for 
search engine and web directory to give relevant results to the user 
requirements. Web pages need automatic classification techniques with high 
classification accuracy. This study provides a text extraction algorithm for 
web text classification. The extraction algorithm consists of three phases 
namely web page extraction, rule formulation, and algorithm validation. A 
text extraction prototype is built using Visual C# 2008 to validate the 
algorithm. It is a windows application mixed with web connection protocol. 
The prototype offers the creation of Binary data set as well as term frequency 
inverse document frequency (tf-idf) data set. In this study, the experiment 
was conducted on five English educational websites. The created data sets 
are then classified using Naïve-Bayes and C4.5 algorithms provided in 
WEKA application. The experimental results show that Naïve-Bayes 
classifier with web text extraction algorithm proves to be the best method for 
web text classification. 
 
 

iii 
 
 
ACKNOWLEDGEMENTS 
 
 I would like to express my gratitude sincere appreciation to both of my supervisor, 
Prof. Dr. Ku Ruhana Ku Mahamud and Dr. Faudziah Ahmad for guiding the research 
presented in this project. I appreciate and thank them for their continuous support, 
advice, and comment. The knowledge I have learned from them is the most valuable 
things I had learn in this world of research. 
 
A special thanks to all my lecturers in Universiti Utara Malaysia for their support. 
My deepest appreciation goes to my beloved family who has been very supportive, 
patient and understanding all the time. 
 
Lastly, this wish of thanks is dedicated to beloved friends and others who involve 
direct or indirect with this project. 
Thank you everyone. 
 
 

iv 
 
Table of Contents 
 
PERMISSION TO USE.......................................................................i 
ABSTRACT....................................................................................ii 
ACKNOWLEDGEMENTS..................................................................iii 
TABLE OF CONTENT......................................................................iv 
LIST OF TABLES.............................................................................vii 
LIST OF FIGURES...........................................................................viii 
LIST OF APPENDICES.....................................................................ix 
 
CHAPTER ONE: INTRODUCTION 
1.0 Introduction........................................................................................................1 
1.1 Problem Statement.............................................................................................2 
1.2 Objective........................................................................................3 
1.3 Significance of the study......................................................................3 
1.4 Scope of the study...........................................................................4 
1.5 Limitations of the study.....................................................................4 
1.6 Organization of the report...................................................................5 
 
CHAPTER TWO: LITERATURE REVIEW 
2.1 Website classification.......................................................................6 
2.2 Text Mining in Web........................................................................11 
2.3 HTML Parser for text extraction.........................................................13 
2.4 Summary.....................................................................................15 
 
 
 

v 
 
CHAPTER THREE: RESEARCH METHODOLOGY 
3.1 Web Text extraction algorithm .........................................................16 
3.2 Web Page Extraction ......................................................................17 
3.3 Rule Formulation..........................................................................18 
3.3.1 Data Processing...............................................................19 
I. Remove HTML Tags................................................19 
II. Remove Digits........................................................19 
III. Split lines to words..................................................20 
IV. Remove duplicate words from the list............................20 
V. Remove the words have less than specific 
 number of characters........................................................21 
VI. Remove stop-words...................................................21 
VII. Stemming the words................................................22 
VIII. Remove duplicate words after stemming........................23 
3.3.2 Data Set Creation............................................................23 
I. Create Binary Data Set.............................................23 
II. Create tf-idf Data Set...............................................24 
III. Build Physical Data Set............................................27 
3.3.3 Text Classification...........................................................29 
3.4 Algorithm Validation.....................................................................30 
3.5 Developing of the prototype ...............................................................30 
3.6 Summary....................................................................................31 
 
 

vi 
 
CHAPTER FOUR: PROTOTYPE DEVELOPMENT 
4.1 Implementation Details.................................................................32 
4.2 Classes Implementation.................................................................34 
4.3 Graphical User Interface...............................................................36 
4.4 Summary....................................................................................37 
 
CHAPTER FIVE: RESULTS AND FINDINGS 
5.1 Experiment Design .........................................................................38 
5.2 Extraction processing.......................................................................38 
5.3 Classification Activities.....................................................................41 
5.4 Summary....................................................................................45 
 
CHAPTER SIX: CONCLUSION 
6.1 Project Conclusion .........................................................................46 
6.2 Recommendation for Future Research ...................................................47 
 
REFERENCES.................................................................................48 
 
 
 
 
 

vii 
 
LIST OF TABLES 
 
Table 5.1: Classification results (60%-40%)..............................................41 
Table 5.2: Classification results (65%-35%)..............................................41 
Table 5.3: Classification results (70%-30%)..............................................42 
Table 5.4: Classification results (60%-40%)..............................................42 
Table 5.5: Classification results (65%-35%)..............................................43 
Table 5.6: Classification results (70%-30%)..............................................43 
Table 5.7: Classification accuracy Details (DragPushing)..............................44 
 
 
 
 

viii 
 
LIST OF FIGURES 
 
Figure 3.1: Text Extraction Algorithm.......................................................17 
Figure 4.1: Classes Diagram...................................................................33 
Figure 4.2: HTML_Parser Interface...........................................................37 
Figure 5.1: tf-idf Data set.......................................................................40 
Figure 5.2: Binary data set.....................................................................40 
 
 

ix 
 
LIST OF APPENDICES 
 
 
Appendix A....................................................................................51 
Appendix B....................................................................................67 
 
 
 

1 
 
CHAPTER ONE 
 
INTRODUCTION 
 
The rapid growth of information on the internet has made it necessary for users to 
identify relevant information that were obtained from the process of information 
retrieval (Suanmali, Salim, & Binwahlan, 2009). One of the most important aspects in 
any information retrieval (IR) process, text query, search engine, and web 
classification is to distinguish each word to give better results to the query. 
Cooley, Mobasher and Srivastava (1997) define Web mining as “the discovery and 
analysis of useful information from the World Wide Web. This describes the 
automatic search of information resources available on-line, i.e. Web content mining, 
and the discovery of user access patterns from Web servers, i.e., Web usage mining”. 
According to Wang, Huang, Wu, and Zhang (1999) “web mining is an integrated 
technology in which several research fields are involved, such as data mining, 
computational linguistics, statistics, information and so on” (p. 137). 
From the previous definitions, Web mining has different methods from data mining 
because web pages are semi-structure or non-structure documents. Therefore some 
traditional data mining methods such as Linear Regression (Aaron, 2004) are not 
applicable to Web mining. 
Web classification is an important component in search engines and is use to classify 
the type of web pages. Methods like clustering and support vector machine can be 
used in Web classification (Yu, Han, & Chang, 2002). Web classification research 

2 
 
focuses on web text classification, web image classification, web hyperlinks 
classification and web usage classification (Morales, Fandino, & Rodrivguez, 2009). 
Text Mining is another important component in Web mining. According to Ronen and 
James (2007) “Text Mining can be broadly defined as a knowledge-intensive process 
in which a user interacts with a document collection over time by using a suite of 
analysis tools”. 
Text Mining deals with extracting useful information from text document through the 
identification and exploration of patterns using different formulas. Text Mining uses 
machine learning algorithms such as Naïve Bayse (NB) classifier, Bayesian logistic 
regression, decision tree classifiers, Rocchio method and neural network (Ronen & 
James, 2007) to extract useful information. Text mining also includes clustering 
methods for unsupervised process through which objects are classified into groups 
called clusters. Algorithms such as K-means, EM-based probabilistic clustering, 
hierarchical agglomerative clustering and minimal spanning tree are used for 
clustering activities (Ronen & James, 2007). 
Each method and algorithm has advantages and disadvantages. Some applications 
combine two or more algorithms in order to get better classification accuracy because 
web mining involves many processing steps which require different algorithms for 
each step (Feng & Cherkassky, 2009). 
 
1.1 PROBLEM STATEMENT 
Inefficient web classification algorithms will affect the results of IR. Web 
classification needs efficient algorithm to produce high accuracy. The highest 

3 
 
accuracy from existing text extraction algorithm is around 80% (Fresno & Ribeiro, 
2004). The problem in existing algorithm is the deficiency to extract meaningful 
words from web pages (Liu, 2009). In Web searching, due to the growth of 
information on the Web, matched web pages for a given query could be millions. 
Ranking web pages in terms of normalized frequency of keywords does not work 
properly, because short pages with only the query in them would likely been ranked 
on the top. 
 
1.2 OBJECTIVE 
The main objective of this study is to develop a text extraction algorithm for web text 
classification. Specific objectives are: 
1. To develop feature extraction techniques to extract features from the web 
pages. 
2. To develop a HTML parser that can remove HTML tags. 
3. To develop a prototype for web text extraction. 
 
1.3 SIGNIFICANCE OF THE STUDY 
An algorithm for web classification and text extraction will be formulated from this 
project. This algorithm will be the contribution to the body of knowledge in the area 
of web text classification. 
The algorithm has the facility to create Binary data set as well as tf-idf data set from 
the extracted text which are necessary for classification activity. 

4 
 
The developed prototype can automate the process of text extracting and date set 
creation will be a contribution to the society particularly for users to perform web text 
classification. This prototype can be easily used by any user with little experience for 
example the search engine developer for web classification and web directory 
developer. 
 
1.4 SCOPE OF THE STUDY 
The study focused on English educational websites. Five different Universities 
Websites used in this study. The classification applied only on web text page. 
 
1.5 LIMITATIONS OF THE STUDY 
By using Core 2 Duo, 2.00 GHz processor with 3.00 GB RAM, the prototype took 
around 3 minutes to process six hundred web pages. The maximum number of web 
pages could be processed by the prototype without system hanging was around 
thousand pages. 
The rows and columns in the created data set should be within 1048576 and 16384 
respectively otherwise; Microsoft Excel 2007 will not be able to view the data set. 
 

5 
 
1.6 ORGANIZATION OF THE REPORT 
This report divided into six chapters. The chapters that are presented in this report are: 
Chapter one gives an overview and introduction about Text Extraction algorithms and 
techniques used in web classification. This chapter also discussed about problem 
statement, objective, significance of the study, and scope of the study. Limitations of 
the study are also presented in this chapter. Chapter two discussed about literature 
review, which are related to this project. The literature review divided into three 
sections which are web classification, Text Mining in Web, and HTML Parser for text 
extraction. Chapter three discussed methodology that been used in this project. The 
methodology consists of three phases namely: Web Page Extraction, Rule 
Formulation, and Algorithm Validation. The details of each phase discussed in this 
chapter. Chapter four discussed about the prototype implementation for HTML Parser 
and Text Extraction algorithm. Chapter five discussed the results and finding of the 
experiment conducted by using the HTML Parser and WEKA application. Chapter six 
presents the conclusion and recommendations to improve this study. 
 
 
 

6 
 
CHAPTER TWO 
LITERATURE REVIEW 
 
This chapter presents previous work on web classification, text extraction particularly 
in web mining and HTML parser techniques. 
 
2.1 Website classification 
Anagnostopoulos et al. (2002) proposed an automated website classification system. 
The study focuses on e-commerce sites and the pages were indexed and the extracted 
words represented using Vector Space Model (VSM) and Site Vector (SV). The 
proposed system compares the content of web pages using VSM with the mdimensional commerce descriptor vector (SV). The system performance reached 
around 96% accuracy, which indicates that their system has produced good results. 
However, their scope is limited to e-commerce only. 
 
Tomar, Shekhar, and Ashish (2006) proposed a classification concept for web pages, 
called “WebClassify”. They modified Naïve-Bayes algorithm with multi-nominal 
model to classify web pages into different categories. The proposed algorithm makes 
use of both Naïve-Bayes algorithm and term frequency – inverse document frequency 
factor. They did an experiment using the traditional Naïve-Bayes algorithm (NB) and 
their modified algorithm. The results for web page classification were 42.5% using 
NB and 55% using their algorithm. From the results, their approach shows 

7 
 
enhancement in the classification accuracy over traditional NB. Although the 
classification accuracy is not high, but the important point in their study, they have 
introduced a term frequency as class document frequency. This method takes into 
account only those words have a term frequency inverse class document frequency 
value above a threshold value. This technique produced small data set which needs 
less processing time. 
 
Hui, Yan, and Jian-Ping (2008) proposed a combined approach for news 
categorization. The combination includes decision tree with multilayer neural 
networks. Data set was created based on term frequency inverse document frequence 
(tf-idf) technique to weight each word (term) in the document. Attributes were 
selected by applying feature space reduction method called Chi Square (X2 statistic). 
In their experiment, only subject level (the upper hierarchical level) of web pages 
were classified. The classifier was built using Back Propagation Network and ID3 
algorithm. With 371 news items (pages) inclueded in the training set, the 
classification accuracy reached around 88%. The important point in their approach is 
the feature space reduction technique. By selecting significant attributes from the data 
set, the classification accuracy increased and the processing time decreased which 
means there is improvement in the performance of the algorithm. 
 
Meijuan, Jingwen and Shiru (2009) presented a research about web classification 
based on support vector machine (SVM). In the study, SVM network was structured 
for web text classification where genetic algorithm (GA) was used to optimize SVM 
parameters. Hence, enhance the convergence rate and the classification accuracy. 

8 
 
1000 web pages were classified into six categories: “news”, “sport”, “science and 
technology”, “education”, “amusement”, and “others”. The result of their experiment 
measured by mean squared errors and the best results were: (0.0045) in term of errors 
for training and (0.0682) in terms of errors for testing. It is clear from their 
experiment that they have solved the over-fitting problem of ANN by using SVM. 
However, the experiment was applied only on flat web page structure, which is easier 
for classification compared to hierarchical web pages. 
 
A comprehensive study on feature selection in automatic web pages categorization 
has been presented by (Sun, Sun, Chen, & Fu, 2009). The study focused on ReliefF 
and Symmetrical Uncertainty (SU). They addressed critical issues such as “the high 
dimensional text vocabulary space” in web pages classification. They used Hidden 
Naïve-Bayes classifier (HNB) and Complement class Naïve-Bayes (CNB). They 
compared their results with results produced using Support Vector Machine (SVM), K 
Nearest Neighbor, and C4.5. The experiment was conducted on 527 web pages 
partitioned into seven classes. Boolean data type was used to represent the data. The 
results showed that by using feature selection methods, higher accuracy can be 
obtained. The highest classification accuracy is around 88% that has been achieved 
using HNB classifier. They performed another experiment where “the words are 
sorted according to their SU scores”. In addition, some numbers of attributes selected 
through removing top N words. The results showed that CNB achieved the highest 
accuracy around 97% but HNB was more stable among the others. While SVM was 
not practical, it is time-consuming for big-size problem. They concluded that SU is 
better for feature selection method than ReliefF. From their study, it is clear that they 

9 
 
achieved good classification accuracy and they identified practical method for feature 
selection. However, the experiment was conducted only on flat web pages. 
Zhongda, Kun, and Yanfen (2007) proposed a research that deals with many 
important issues in web pages classification such as feature selection method and 
classification algorithms. In their research, they found two models that could 
represent the web page namely Vector Space Model (VSM) and Boolean Model. 
According to the research, VSM model can represent web pages better than Boolean 
because each word was assigned with specific value (weight). For feature selection, 
the algorithms used were Comparative Information, Latent Semantic Indexing (LSI), 
and Rough Set theory. LSI shows favor among the other algorithms. For classification 
algorithm, they investigate K Nearest Neighbor (KNN), Naïve-Bayes (NB), 
Relational Learning (RL), Support Vector Machines (SVM), Decision Trees DT, and 
Neural Networks Back-propagation (NNB). They concluded that, there is no 
classification algorithm better than the other. Each algorithm has advantages and 
disadvantages; sometime in order to achieve higher accuracy it may cost processing 
time or memory space. Each algorithm will behave differently with the type of data 
set. For example, DT algorithm provides better accuracy if the data set is represented 
by Boolean values. The classification accuracy depends on the algorithm used in 
extraction and feature selection more than the classifier model. 
 
Qi and Davison (2009) have done a comprehensive survey on web page classification. 
In their research, Feature extraction and classification algorithms were the focus of 
their study. They distinguish the web page classification as “subject classification, 
functional classification, sentiment classification, and other types of classification”. 

10 
 
They divided the classification problem into binary classification and multiclass 
classification. The survey shows another type of classification, that is flat 
classification and hierarchical classification. In flat classification, the categories 
organized parallel such that one category does not followed by another category, 
while in hierarchical classification the categories structured as hierarchical tree-like 
structure in which each category may have a number of subcategories. They identify 
the differences between web classification and text classification. Classification of 
web pages content is different from classification of standard text classification. The 
standard text is structured and written with consistent styles, while web pages do not 
have such properties. Another important aspect in their research is the feature 
selection. They divided the feature into two classes namely “on-page features” which 
are available directly on the page to be classified. The other class called “features of 
neighbors” which are available on the pages linked in some way to the page to be 
classified. The algorithms used for feature selection were N-gram, Dimension 
Reduction, and Relational Learning. They have pointed to critical problem about 
feature selection from the web pages that contain large images or flash objects, which 
is common in today’s web pages. For classification algorithms, the research 
investigated many algorithms such as K-Nearest Neighbor, Support Vector Machine 
(SVM), and Naïve-Bayes classifiers. The research concluded that in order to improve 
the classification accuracy, an appropriate method for feature extraction from web 
pages “directly or from neighbor web pages” could improve the classification 
performance. 
 
 

11 
 
2.2 Text Mining in Web 
According to Zhongda, Kun, and Yanfen (2007) text classification has been 
implemented before web page classification. They define text classification as “It is 
the process of algorithmically analyzing a text documents in order to assign a set of 
categories (or index terms) from a predefined vocabulary to succinctly describe the 
content of a document”. The followings are some works in the field of text mining 
specifically in web. 
 
By extracting the text from the web documents, a key phrase can extracted from the 
text. Zhang, Zincir-Heywood, and Milios (2005) have investigated the significance of 
narrative text classification in the task of automatic key phrase extraction in Web 
document collection. Their work consists of two steps i.e., long paragraph 
classification and narrative paragraph classification. They have used three methods, 
i.e., TF-IDF, KEA, and Keyterm. The processing steps include extracting all tokens in 
the text, removing punctuation marks, removing standard set of 425 stop-words, and 
applying Porter stemming to obtain word stems. The results show that Keyterm 
method gives the best output for all plain text and for narrative text. 
 
Shiqun, Yuhui, and Jike (2007) proposed an algorithm used in text mining on web. In 
their studies, they showed an algorithm of how to follow the appointed website or 
web page according to the user’s request by using the text mining. They also 
investigate on how to extract test characteristic, and how to classify the data 
information with feedback judgment combined with the Web page text contents for 

12 
 
later use. Their approach was related but different from data mining and different 
from information retrieval. The processing method includes four steps: lookup 
resources, pretreatment and information extraction, mode discovery, and mode 
analysis. They used vector space model for web text classification. Experiments 
conducted using 500 documents extracted from sport website. The text classification 
takes about 6 seconds and the accuracy of the classification attains 91%. 
 
Most of the websites contain text as well as noise. In order to perform text mining in 
the web text, a noise elimination process should first apply. Dey and Haque (2008) 
have proposed a paper focused on opinion extraction from noisy data. They used a 
hybrid approach, in which they initially employ a semi-supervised method to learn 
domain knowledge from a training repository, which contains both noisy and clean 
text. A localized linguistic technique has applied to extract opinion expression from 
noisy text. Their work comprises five steps namely “pre-processing of noisy text to 
enable opinion mining, locate opinion expressions in processed text, analyze opinion 
expression to identify features and opinions expressed about them, aggregate opinions 
at multiple levels of granularity as desired by user, and developing a Graphical User 
Interface to facilitate user interaction”. The performance measured used were standard 
evaluation measures of precision, recall, and F-score. The results reached 94% 
precision, and the recall value reached 75%. Their studies showed different concept of 
text mining by cleaning the text from noisy. By applying this technique, the results 
improved in terms of quality. 
 

13 
 
Lee, Baker, Song, and Wetherbe (2010) proposed a paper contains a comparison of 
four text-mining methods namely “Latent Semantic Analysis (LSA), Probabilistic 
Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and 
Correlation Topic Model (CTM)”. In their work, they identify the advantages and 
limitation of each one. Vector Space Model (VSM) used to represent the words in 
each document with term frequency inverse document frequency (tf-idf) technique. In 
their experiment, they collect 258 users’ comments on hardware product (Digital 
Camera type Canon) from Amazon website. The task is applying the four algorithms 
for Topic Detection from the users’ comments. From the experiment’s results, LSA 
gave best result in case of single topic in a document. PLSA gave best result in case of 
single topic in one document. LDA and CTM gave best result with both single and 
multiple topics for one document. CTM has the ability to model the relationships 
between topics. However, LSA shows better ability to deal with synonyms problem. 
 
2.3 HTML Parser for text extraction 
Jianli, Jie, and Cuihua (2004) have used HTML parser in their studies, initially; they 
developed a model to segment the web document with document object model. They 
adopted different sets of filters to filter non-text information and all links, the process 
of filtering achieved by scanning HTML pages and removing “src” and “herf” 
attributes with the value containing “img”, “swf”, “gif” etc. Their algorithm 
implemented by java program language and testing data comes from web document in 
English. Other algorithms have used in the process are removing stop-words and 
stemming the words with Paice/Husk Stemmer. The results varied from 78% - 100%, 

14 
 
and there were some errors in their experiment. However, the HTML parser works 
perfectly in their studies. 
 
There are many HTML Parser available freely for different purpose. Alonso-Rios, 
Luis-Vazquez, Mosqueira-Rey, Moret-Bonillo, and del Rio (2009) proposed in their 
studies an HTML analyzer for web usability. In their system, the HTML analyzer 
examines HTML code to extract usability information about web pages. The analyzer 
identifies usability problems, checks compliance with usability guidelines, and 
interprets usability information to facilitate the task of human usability experts. In 
their experiment, they were able to classify each component in the web page like 
images, tables and frames. However, they did not extract the text from the HTML 
source code. 
 
Ke, Yueming, Zongyan, and Xiang (2008) developed HTML Parser used in the 
embedded browser based on deterministic finite automaton. Several resource 
limitations such as CPU type, system function definition, display equipment, and 
electricity considered in the study. An embedded browser with four functional 
models: “network transport, memory management, HTML parser, and display model” 
was proposed. The parser has the capability to parse and download at the same time. 
By using this parallel process, a parsing time will decrease because it does not need to 
wait for the stream to fully downloaded. The architecture and design explained with 
some details but the authors did not provide any experiment result. 
Ziegler, Vogele, and Viermetz (2009) propsed an approach for web page information 
extraction. Their approach is “fully-automated extraction of news content from 

15 
 
HTML pages”. Their algorithm divided into three phases, namely “the merging of text 
blocks on a structural level, the computation of text block features, and the decision to 
keep or discard the current block”. By using those phases, the algorithm was able to 
calculate “average sentence length, number of sentences, character distribution, and 
stop-word ratio”. In their experiment, they have used 220 news sources in five 
western languages and the results compared with other two systems: Crunch and 
Body Text Extraction. For evaluation, they implemented classification models such as 
C4.5, One Rule, SVM, Naïve-Bayes, and Random Forest. The highest accuracy 
obtained by using Random Forest classifier around 78%. The results are acceptable 
because the Parser used to extract text from hierarchical websites, which is more 
difficult than flat websites in term of classification. 
 
2.4 Summary 
This chapter presents some of the previous works that have been done by researchers. 
The topics that been discussed are more about website classification techniques, text 
mining, and HTML parser. Many algorithms and techniques have been investigated to 
highlight the advantages and disadvantages of them. It is clear from all the mentioned 
works that there are three important factors in web page classification. First is the 
feature selection technique, second is how to present the extracted data, and third is 
the classifier algorithm. All the three factors are important but the most important one 
which has strong influence on the classification accuracy is the feature selection 
technique. If the selected features can represent the web page perfectly, then the 
classification accuracy will be high regardless of using any classifier algorithm. 
 

16 
 
CHAPTER THREE 
RESEARCH METHODOLOGY 
 
This chapter provides the methodology used in this study. There are two main phases 
i.e., formulating an algorithm for text extraction and developing the prototype. 
 
3.1 Text extraction algorithm 
There are three phases involved in formulating text extraction algorithm. The phases 
are web page extraction, rule formulation, and algorithm validation. Fig 3.1 illustrates 
the phases. 
 

17 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Figure 3.1: Text Extraction Algorithm 
 
3.2 Web Page Extraction 
The process starts by loading the web pages from specific folder or by loading the 
web page from the internet to the memory. The algorithm has the facility to handle 
any number of pages with any number of targets as long as there is enough physical 
 Start 
Algorithm 
Validation 
Web Page 
Extraction 
Rule Formulation 
Data 
Processing 
Data Set 
Creation 
Text 
Classification 

18 
 
memory in the machine. After that, extraction process is applied by reading the pages 
one by one and extracting source code with all the textual contents. The source code 
contains HTML tags, text paragraphs, hyperlinks, and any other text characters. The 
user will be asked for the target of the web page/s in order to label them in the data 
set. The user can specify how much percentage can be extracted from the page/s. The 
output of this phase is a string array. Each item in the array represents the extracted 
web page. The following algorithm was used for this phase: 
Extracting Page Source Code Algorithm (Online web page): 
Step 0: Initialize Uniform Resource Identifier (URI) object; 
Step 1: Send request to the server using (URI) address; 
Step 2: Send the Credentials with request (default credentials is Windows credentials). 
Step 3: Initialize stream object in the memory; 
Step 4: Store the stream from the server in stream object; 
Step 5: Initialize string; 
Step 6: Read and convert (to string) the stream object and save it in the String Array; 
Step 7: Return the String Array; 
Extracting Page Source Code Algorithm (From Documents in the Hard Disk): 
Step 0: Initialize Directory object; 
Step 1: Initialize array of File type; 
Step 2: Initialize array of String type; 
Step 3: Initialize Stream Reader object; 
Step 4: Give the directory path (on the hard disk) to the Directory object; 
Step 5: Get all the files information from the Directory object; 
Step 6: For each file in file array (filesInfoi, i=0,...,n) do steps 7-9; 
 Step 7: Read the file content into the Stream Reader object; 
 Step 8: Read the Stream until the end and convert it to String; 
 Step 9: Save the String in the String array; 
Step 10: Return the String Array; 
 
3.3 Rule Formulation 
Rule Formulation consist of three steps namely Data processing, Data set creation, 
and Text classification. The followings are the details of each step. 
 
 

19 
 
3.3.1 Data Processing 
After extracting the web page source code, the output contains text that includes noise 
data like HTML tags and some other components, which are not relevant for 
classification process. This step is called text transformation (Croft, Metzler, & 
Strohman, 2009), and it includes deleting HTML tags, digits and stop-words, and 
stemming the words using Porter stemming algorithm. The output of this step will be 
two dimensions string arrays representing all of the extracted words as columns and 
all of the web pages as rows. The following are data processing step: 
I. Remove HTML Tags: This process will start after collecting the source code 
from the web page. The input of this step will be in form of string array, which 
has both the text and HTML Tags. The algorithm will check each line in the 
string to remove the Tags. The output from this step is a string array contains 
only the page paragraph without any HTML Tags. The following algorithm 
describes the process. 
Remove HTML Tags Algorithm: 
Step 0: Initialize the input string array of extracted source code (output of the 
 previous process “SourceCode”); 
Step 1: Initialize output array type of string; 
Step 2: Replace any of this character (" < ( . | \ n ) * ? > ") with empty space and 
 save the output in the output array; 
Step 3: For each item in the output array (TextSourcei, i=0...,n) do steps 4- 
Step 4: Check if the item contains non-Letter character then, replace it 
 with empty space; 
Step 5: Return the output array; 
 
II. Remove Digits: For classification purpose, digits are not required because 
they do not reflect any important value to the model. Therefore, this step will 
remove the digits from the paragraph by checking the characters one by one; if 
it is a digit then it will remove from the string. The output of this step is the 

20 
 
same string array of the previous step but cleaned from any digit. The 
following algorithm describes the process: 
Remove Digits Algorithm: 
Step 0: For each item in the output array (TextSourcei, i=0...,n) do step 1; 
Step 1: Check if the item contains non-Letter character then, replace it 
 with empty space; 
Step 2: Return the output array; 
 
III. Split lines to words: Until this point, each string in the array contains lines of 
words; this step will split each line to words by using splitting criteria for 
example, empty space, comma, semi comma, or any kind of delimited 
character. The output of this step will be a list of string array, each item (array) 
in the list represent a web page and each item in the array (string) represent the 
word in that page The following algorithm describes the process: 
Split lines to words Algorithm: 
Step 0: Initialize the input (Text array from the previous step); 
Step 1: Initialize List of string; 
Step 2: For each input (Texti, i=0,...,n) do steps 3-4; 
 Step 3: Split the word from the text if one of the following character is 
 after the word: { ' ', '\n', '-', '_', '{', '}', '[', ']','=', '(', ')', ';', '"', ',', '.', '+', 
 '!' }; 
 Step 4: Add the split word to the List; 
Step 5: Return the List; 
 
IV. Remove duplicate words from the list: The list of string array, which is 
created from the previous step, has many duplicate words. In this step, a list of 
list type string created and each item in the list of string array added to the list 
of list if it is not identical to any item in the list of list. The output is a list of 
list type of string contains unique words. The following algorithm describes 
the process: 
 
Remove duplicate words Algorithm: 
Step 0: Initialize the output list (list of list type of string) 

21 
 
Step 1: For each page in the input list (pagei, i=0,..., n) do step 2; 
Step 2: For each item in the page (itemj, j=0,...,m) do step 3; 
Step 3: Check if the item[i] is not empty and does not exist in the 
 output list[i], then do step 4; 
Step 4: Add the item[j] to the output list[i]; 
Step 5: Return the output list; 
 
 
V. Remove the words have less than specific number of characters: The user 
will specify how many characters should be in each word in order to process at 
run time. Small words they only give overload to the data set and most of them 
will be in form of preposition, therefore it is better to remove those words and 
focus on bigger words only. This step will do this process by checking the 
length of character in each word, if it is less than what the user specified then 
it will remove. The user can change the number of character length at run 
time. The output of this step will be the same list of list type of string but 
contains only words that have passed the checking criteria. The following 
algorithm describes the process: 
Remove specific words Algorithm: 
Step 0: For each page in the input list (pagei, i=0,..., n) do step 1; 
Step 1: For each string in the page (stringj, j=0,...,m) do step 2; 
Step 2: Check if the string characters length is less than 2 (or any 
 number specified by the user), then do step 3; 
Step 3: Remove string[j] from the input list[i]; 
j--; 
Step 4: Return the input list; 
 
VI. Remove stop-words: Not every word in the page is important or can reflect 
any value to the model. For example if a word has occurred on most or all the 
pages, such a word doesn’t give any value to the model and doesn’t help to 
map the function, and some time will confuse the decision if it occurred 
alternatively in many classes. Therefore, it is better not to include such a word 
in the data set. For this purpose, a list of stop-words was created to use by this 

22 
 
step. The process will start by reading the file that contains the stop-words in 
the hard disk and adding the words to an array of string. After that a loop will 
start to check if there is any word from the stop-words match any word in the 
list of list, if there is a matching then this word will be deleted from the list of 
list. The output of this step will be a list of list type of string clean from stopwords. The following algorithm describes the process: 
Remove stop-words Algorithm: 
Step 0: Initialize stop-words array (string array) contains the stop-words; 
Step 1: For each word in the in the stop-words array (wordi, i=0,..., n) do step 2; 
Step 2: For each page in the pages-list (pagej, j=0,...,m) do step 3; 
Step 3: Check if the pages-list[j] contains stop-words[i], then do 
 step 4; 
Step 4: Remove stop-words[i] from the pages-list[j]; 
Step 5: Return the pages-list; 
 
VII. Stemming the words: In English language each word or verb has many 
formations for example the verb “play” can be formatted to (player, plays, 
playing, played) or irregular verb such as “eat” can be formatted to (ate, eaten, 
eats, eating). Another example of words variety is singular and plural words. 
Therefore, it is very essential to get the root of those words in order to unify 
them if they have the same root, this process called stemming. This step uses 
Porter Stemming Algorithm (PSA), each word in the list of list will be send to 
PSA, the return word from the algorithm is the stemmed word. The output of 
this step will be a list of list type of string contains stemmed words. Appendix 
B contains Porter Stemming Algorithm. 
VIII. Remove duplicate words after stemming: Although there was a process of 
removing duplicate words before stemming, but because of stemming process, 
many words, which appear un-identical at that time, are identical now after 
stemming. In addition, the reason why removing duplicate step is not omit if 

23 
 
there is a similar process at this step? Removing duplicate words before 
stemming and after stemming will save more time than stemming all the 
words including the duplicate and then removing the duplicate words. The 
output of this step will be a list of list type of string contains stemmed unique 
words. The following algorithm describes the process: 
Remove duplicate words Algorithm: 
Step 0: Initialize the output list (list of list type of string) 
Step 1: For each page in the input list (pagei, i=0,..., n) do step 2; 
Step 2: For each word in the page (wordj, j=0,...,m) do step 3; 
Step 3: Check if the word[i] does not exist in the output list[i], then do 
 step 4; 
Step 4: Add the word[j] to the output list[i]; 
Step 5: Return the output list; 
 
3.3.2 Data Set Creation 
At this point, a list of list type of string (equal to two dimensional array) contains 
processed words is ready. 
I. Create Binary Data Set 
Each page is presented as a record and passed to the algorithm. The algorithm 
will add a new row and check every column in the first row, if any word from 
the page exist as a column, then it will add 1 to that column in the new row 
otherwise it will add new columns and the word will be the title of that 
column. This loop will continue until all the pages included in the two 
dimensional array. The output of this step will be a two dimensional array 
contains all words from the extracted web pages. The following algorithm 
describes the process: 
Binary Data Set Algorithm: 
Step 0: Initialize a List of List (rows) (columns) type of string (equal to 
 multidimensional array or Table); 
Step 1: For each document (Di, i = 0,...,n) do steps 2 – 12; 
Step 2: Initialize string array with words extracted from the Di. 
Step 3: Add new row (item in the list); 

24 
 
Step 4: For each word in the word array (Wj, j=0,...,m) do steps from 5 – 
 11; 
 Step 5: Check if the word (Wj) is exist in the List. IF exist, do 
 steps from 6 to 7; 
 Step 6: integer r = row[n], integer c= column[i] of 
 (existed word); 
Step 7: Insert 1 to List[r,c]; 
IF not exist do steps 8 to 12; 
Step 8: Add new columns “The column title is the 
 new word”; 
Step 9: integer ro = row[n], integer col =column[n]; 
Step 10: Insert value of 1 to the List [ro, col]; 
Step 11: For each row (Rk, k=0,..., p-1) insert value of 0 to 
 the column[n]; 
Step 12: For each row (Rv, v=0,...,z) insert Target category. 
 
Change each item in the Binary data set contains Null to 0 
From the previous algorithm, the data set array contains all the extracted 
words as columns title and a value of “1” if the record contains that word or 
Null if it does not contain it. This step will convert all the Null values to zero 
values by looping through all the items in the array and checking if the value 
is Null then change it to “0”. The output of this step will be a two dimensional 
array contains words in the first row and values (1, or 0) in the other rows. The 
following algorithm describes the process: 
 
Convert Null to Zero Algorithm: 
Step 0: For each row in the dataSet (rowi, i=0,..., n) do step 1; 
Step 1: For each column in the dataSet (columnj, j=0,..., m) do step 2; 
Step 2: Check if dataSet [i, j] = Null, then do step 3; 
Step 3: dataSet [i, j] = 0; 
Step 4: Return the dataSet; 
 
II. Create tf-idf Data Set: 
In this type of data set, each word has a weight depends on its frequency in the 
document. The weight calculated by using the following formula: 
Each page is represented by “d” and the total number of pages represented by “D”, 
such that d 

 D. 

25 
 
The “tf” calculated using the following formula: 
tf
i.j 
= 
푁푗
∑
 푤푖푗
푛
푖=0
 
where: 
N is the total number of the same word (wi) appeared in document (j) 
i (i = 0,...,n) is the index of the word (w) in document (j) 
j(j=0,...,p) is the index of document (d) in (D) 
and the denominator is: 
∑
 푤푖푗
푛
푖=0
 is the sum of all the words (w) in document (dj). 
The calculation of idf is done by using the following formula: 
idfi
 
= log 
|퐷|
|1+ 
{
푑∶푤푖 ∈푑
}
|
 
where |D| is the total number of documents extracted. 
and |
{
푑∶푤푖 ∈푑
}
| is the number of documents where the word (wi) appeared. 
“1” added to avoid division by zero in case if the word did not exist in any other page. 
(tf-idf)i,j is computed by multiplying tf
i.j 
and idfi 
The idea behind this calculation is if the term occurred many times in the same page 
means, it is more important. However, if the same term occurred in many other pages 
means it is less important. By using this technique, each term has specific weight 
according to its occurrence and frequency. This approach gives better value for each 
term instead of giving 0 and 1 or simply counting the frequency of the word in the 

26 
 
document. This method will also create some relations between each word by 
counting the frequency of term in each document. The relations were destroyed 
previously when the algorithm extracted the text and splitting the words into 
individual term without preserving the location of the each word, which gives 
different meaning. 
The data set created by using this algorithm has two dimensions. Each row in the data 
set represent individual page and each column represent unique term. The following 
algorithms describe all the mentioned calculation and data set creation. 
Creating tfidf Data Set Algorithm (First part) 
Step 0: Initialize two-dimensional Array (DS) type of string; 
Step 1: For each document (di, i = 0,...,n) do steps 2 – 11; 
Step 2: Initialize string array (W) with words extracted from di. 
Step 3: For each word in the word array (Wj, j=0,...,m) do steps from 4 – 
 11; 
 Step 4: Check if the word (Wj) is exist in the DS. 
 IF exist, do steps from 5 to 6; 
 Step 5: integer r = row[n], integer c= column[i] of 
 (existed word Wi); 
Step 6: Add 1 to the value in DS[r,c]; 
 IF not exist do steps 7 to 10; 
Step 7: Add new columns “The column title is the 
 new word”; 
Step 8: integer ro = row[n], integer col =column[n]; 
Step 9: Insert value of 1 to the DS [ro, col]; 
Step 10: For each row (Rk, k=0,..., p-1) insert value 
 of 0 to the column[n]; 
Step 11: For each row (Rv, v=0,...,z) insert Target category. 
Step 12: Return DS. 
 
 
Change each item in DS contains Null to 0 (Second part) 
This step will convert all the Null values to zero values by looping through all 
the items in the array and checking if the value is Null then change it to “0”. 
The output of this step will be a two dimensional array contains words in the 

27 
 
first row and values (words count, or 0) in the other rows. The following 
algorithm describes the process: 
 
Convert Null to Zero Algorithm: 
Step 0: For each row in the DS (rowi, i=0,..., n) do step 1; 
Step 1: For each column in the DS (columnj, j=0,..., m) do step 2; 
Step 2: Check if DS [i, j] = Null, then do step 3; 
Step 3: DS [i, j] = 0; 
Step 4: Return the DS; 
 
Calculate tf-idf Algorithm (Third part): 
Step 0: Initialize two-dimensional Array (DS) type of string (Output of the second 
 part); 
Step 1: Initialize int N “represent all the same word in the page”; 
Step 2: For each row in the DS (rowi, i=1,..., n) do step 2; 
Step 3: For each column in the DS (columnj, j=0,..., m) do step 3; 
 Step 4: For each row in the DS (rowk, k=1,..., m) do step 4; 
Step 5: IF DS[k,c] != “0” do step 6; 
 Step 6: N++; 
 Step 7: IF DS[r,c] != “0” do step 8; 
 Step 8: DS[r,c] = CalcTFIDF(DS[r,c], N); 
Step 9: N = 0; 
 
 
CalcTFIDF (DS[r,c], N) Algorithm (Forth part): 
Step 0: Initialize double tfidf; 
Step 1: Initialize double N “represent all the same word in the page”; 
Step 2: Initialize double tf; 
Step 3: Initialize double idf; 
Step 4: tf = N / the sum of all the words in document; 
Step 5: idf = Log (total pages) / (1 + the number of pages where the word 
 appeared); 
Step 6: tfidf = tf * idf; 
Step 7: Return tfidf; 
 
III. Build Physical Data Set 
This step will create physical data set from the data set in the memory. The following 
sub-steps and algorithms will describe this step. 
a) Count how many columns has non 0 value in the data set 

28 
 
The initialization of the two-dimensional array was approximated to (1500, 15000) 
which is quite enough to hold all the data but it is not precisely represent the data set 
dimensions. This step is the first point towards calculating the exact size of 
dimensions by counting how many columns has not “0” values which means the 
columns contain words. It is clearly the process needs to loop only in the first row 
because it represents all the words in the array. The total number of counting plus “1” 
as a target column will represent the boundary of the second dimension in the data set 
while the first dimension can be obtained from the total number of pages extracted in 
the beginning plus “1” as a row (for columns title). The following algorithm describes 
the process: 
b) Counting Array Boundary Algorithm: 
Step 0: Initialize integer variable (counter); 
Step 1: For each column in the dataSet (columni, i=0,..., n) do step 2; 
Step 2: Check if dataSet [0, j] Not equal to 0, then do step 3; 
Step 3: counter++; 
Step 4: Return counter; 
 
c) Initialize the size of two dimensional array type of string: 
After obtaining the required dimensions for the data set, this step will initializing the 
new two dimensional array with the obtained dimensions. The output of this step will 
be a new two-dimensional array, which represent the data set size exactly. The 
following algorithm describes the process: 
Initialize the size of two-dimensional array Algorithm: 
Step 0: Initialize integer variable (dim1); 
Step 1: Initialize integer variable (dim2); 
Step 2: dim1 = number of page extracted + 1; 
Step 3: dim2 = counter + 1 (counter from previous algorithm); 
Step 4: Initialize 2 dimensional string array NewDataSet [dim1, dim2]; 
Step 5: Return NewDataSet; 
 
d) Copy every item from the old data set array to the new array: 

29 
 
This step will copy every item in the previous array to the new two dimensions array. 
The output of this step will be a two dimensions array contains all the words and 
values with specific and known size. The following algorithm describes the process: 
Copy all items Algorithm: 
Step 0: For each row in the old dataSet (rowi, i=0,..., n) do step 1; 
Step 1: For each column in the old dataSet (columnj, j=0,..., m) do step 2; 
Step 2: (Copy item from old array to new array) NewDataSet [i, j] = 
dataSet [i, j]; 
Step 3: Return the NewDataSet; 
 
 
e) Create CSV file 
This is the last step in creating data set. This process starts by creating a stream writer 
object to create and write a file on specific location with .CSV extension. After that, 
the process will loop through all the rows and columns in the data set array to read its 
content and write it to the stream object with a comma after each value. Between 
looping in each row, a new line is created in the file to represent the next row in the 
array. The output of this step will be a .CSV file which is ready for classification 
process. The following algorithm describes the process: 
Create CSV file Algorithm: 
Step 0: Initialize stream writer (SW) object to write into specific file; 
Step 1: For each row in the NewDataSet (rowi, i=0,..., n) do step 2; 
Step 2: For each column in the NewDataSet (columnj, j=0,...,m) do step 3; 
Step 3: Write the value of NewDataSet [i, j] to the file using SW 
 object; 
 Step 4: Write new line in the file using SW object; 
Step 5: Close SW object; 
 
3.3.3 Text Classification 
Two algorithms used in this experiment are the Naïve-Bayes (NB) and C 4.5. Both 
applied with different size of data set and extraction parameters to investigate which 
algorithm give better results in terms of accuracy and processing time. 

30 
 
The WEKA application was used in this project where the Explorer option was 
selected from the WEKA GUI. WEKA has the facility to handle (CSV) file therefore 
there is no need to convert the data set because it is already in CSV format. 
 
3.4 Algorithm Validation 
At the final phase, a validation technique applied to validate the accuracy of the rules 
and module built by WEKA. The module used to classify a new web page in the same 
scope. For this purpose, a prototype built using Visual C# 2008. It is a window 
application mixed with web connection protocol. It uses HTTP protocol for internet 
connection. The prototype has the ability to view the web page inside the User 
Interface and downloading the source code for further process. The prototype gives 
the user the facility of saving the data set in CSV or TEXT extension. 
 
3.5 Developing of the Prototype 
The methodology adopted for developing the prototype “HTML Parser” called Code 
and Fix methodology. 
Code and Fix methodology consists of two steps: Write some code and Fix the 
problem in the code (McConnell, 1996). 
This methodology is suitable for small projects like the developed one. Because of the 
iterative nature of the methodology, it was easy to change the prototype functionality 
when it is required. This methodology focuses on writing the code to implement the 
required functions, execute the functions, and fix the code if the functions do not work 
correctly as it is required. 

31 
 
 
3.6 Summary 
The methodologies discussed in this chapter are practical methodologies. The first 
methodology designed specifically to formulate the text extraction algorithm. All the 
algorithms mentioned in this chapter implemented as classes in the prototype. The 
second methodology adapted from (McConnell, 1996) which is perfect to develop a 
small prototype like this one. The methodologies were suitable and quite sufficient for 
this study. 
 
 
 

32 
 
CHAPTER FOUR 
PROTOTYPE DEVELOPMENT 
 
This chapter describes the methodology used in developing the prototype. The code 
and fix methodology adapted in developing the prototype “HTML_Parser”. 
 
4.1 Implementation Details 
The prototype implemented by using Visual C#. All the algorithms mentioned in 
chapter three are implemented as classes in the prototype HTML_Parser. 
HTML_Parser is windows application mixed with web connection protocol. It uses 
HTTP protocol for internet connection. The prototype has the facility to view the web 
page inside the User Interface and downloading the source code for further process. 
The prototype gives the user the facility of saving the data set in CSV extension. 
 Many considerations were taken into account especially the processing time in 
developing the HTML Parser prototype. The prototype gives the user many options 
while processing the documents like how much percentage to be processed from each 
page, remove stop-words, and stemming the words etc. 
The prototype can handle any number of web pages to process but the data set size 
will be limited to the free size available in the physical memory. By trying many 
number of pages (max 737), it was able to create a data set with (738 rows and 12847 
columns). 

33 
 
The processing time will be different from machine to machine depend on the 
processor speed and the number of pages. The machine used to run this prototype is 
Intel Core 2 Duo, 2.00 GHz with 3.00 GB RAM and the performance was quite 
acceptable in term of time. 
The prototype divides the algorithms into classes. Each class represents one 
algorithm. Fig 4.1 shows the classes diagram and their relationships with each other. 
 
Figure 4.1: Classes Diagram 
 

34 
 
4.2 Classes Implementation 
 
I. Get Source Code Algorithm Implementation 
There are two methods to extract the source code. First method is 
extracting online web page. And the second one is extracting web pages 
from the hard disk. Appendix A contains both methods implemented in 
Visual C#. 
 
II. Calculate Percentage Algorithm Implementation 
This class provides a method used for calculating how much percentage 
from each page to be extracted. Appendix A contains the method 
implemented in Visual C#. 
 
III. Remove HTML Tags Algorithm Implementation: 
The object of this class has a method called Remove. This method takes 
one parameter “array of string” as the extracted web pages. The method 
will remove any HTML tags in those pages. Appendix A contains the 
method implemented in Visual C#. 
 
IV. Remove Digits Algorithm Implementation 
The object of this class has a method called RemoveDigits. This method 
takes one parameter “array of string” as the extracted web pages. The 
method will remove any digits in those pages. Appendix A contains the 
method implemented in Visual C#. 
 

35 
 
V. Split lines to words Algorithm Implementation 
This class provides the functionality to split the text extracted from the 
web pages into separated words. Appendix A contains the method 
implemented in Visual C#. 
 
VI. Remove duplicate words Algorithm Implementation 
This class provides the methods to search and remove duplicate words 
before and after stemming. Appendix A contains the method implemented 
in Visual C#. 
 
VII. Remove the words have less than specific number of characters 
This class provides a method used to select only words have specific 
number of character. Appendix A contains the method implemented in 
Visual C#. 
 
VIII. Remove stop-words Algorithm Implementation 
This class provides the functionality to remove the stop-words from the 
extracted web pages. Appendix A contains the method implemented in 
Visual C#. 
 
IX. Stemming the words Algorithm Implementation 
This class implements Porter Stemming Algorithm. The class 
implementation adapted from (http://tartarus.org/~martin/PorterStemmer/). 
Appendix B provides the algorithm implementation. 
 

36 
 
X. Create Binary Data Set Algorithm Implementation 
This class provides methods for creating Binary data set. Appendix A 
contains the method implemented in Visual C#. 
 
XI. Create tf-idf Data Set Algorithm Implementation 
This class provides methods for creating tf-idf data set. Appendix A 
contains the method implemented in Visual C#. 
 
XII. Build Physical Data Set Algorithm Implementation 
This class provides a method to create the data set type .CSV on the hard 
disk. Appendix A contains the method implemented in Visual C#. 
 
4.3 Graphical User Interface 
The interface provides all the facilities required for text extraction. The extraction can 
be done from online web page by providing the web page address to the built-in 
browser or from web pages stored in the hard disk. Fig 4.3 shows the HTML_Parser 
interface. 

37 
 
 
Figure 4.2: HTML_Parser Interface. 
4.4 Summary 
This chapter describes the implementation tools and process of developing 
HTML_Parser prototype. The prototype provides all the facilities required for Web 
Text Extraction such as extracting text from web pages, processing the text, and 
creating the data set. The prototype performance was acceptable in term of processing 
time consumption and extraction accuracy. 
 
 

38 
 
CHAPTER FIVE 
RESULTS AND FINDINGS 
 
This chapter presents the experiment results conducted for this study. 
 
5.1 Experiment Design 
The experiment was divided into two parts. First, is the extraction process using 
HTML Parser to create a data set. Second, is the classification activities performed 
using WEKA application to classify the web pages. Two classifier algorithms were 
used in this study, Naïve-Bayes algorithm and C4.5 which is J48 in WEKA. Those 
algorithms were applied on two types of data set, Binary and tf-idf. 
 
5.2 Extraction process 
The experiment conducted used the web pages from Carnegie Mellon University 
called “CMU World Wide Knowledge Base (Web->KB) project”. This package 
contains WWW-pages collected from the universities of Cornell, Texas, Washington, 
and Wisconsin. The pages were manually classified into the following categories: 
student, faculty, staff, department, course, and project. The number of pages for each 
category is 1641, 1124, 137, 182, 930, 504, and 3764 respectively. Only four of the 
categories have been used in this experiment. The categories are course, department, 
project, and student. Where there are 150 pages for each category total of 600 web 
pages. 

39 
 
The files are organized into a directory structure, where one directory represents one 
class. Each of these four directories consists of five universities web pages. Some of 
the pages do not contain useful information. For example, about 80 pages only 
contain information for redirecting the browser to a different location; therefore the 
pages were selected after testing each one manually. 
After downloading and organizing the pages in the directories, the next step is reading 
those pages. There are two options for this step, either by reading the page 
individually one by one or by reading all the pages inside the folder in one shot, this 
experiment was conducted by using the second option in order to save time. 
The HTML Parser prototype used here has the facilities to read multi-pages, 
providing the folder path and the target “class type” manually. In this experiment, 
100% of the pages content was selected to be processed into a data set. Many words 
in the web pages have small number of characters, therefore the next option selected 
is processing only the words consist of more than 3 characters. To apply tf-idf 
technique, duplicate words are needed to weight the frequency of that word. For this 
requirement, the option remove duplicate is not selected. The other option remove 
stop-words selected to remove the words that don’t give any value for classification. 
Stemming option was selected to apply Porter Stemming Algorithm on the extracted 
words. The last option is to decide which technique needs to be used for creating data 
set. There are two techniques, one is creating binary data set and the other is creating 
tf-idf data set. In this experiment, both techniques are used to create two type of data 
set. The first data set is created by using tf-idf technique. After creating the data set, 
the file was in form of comma separated value (CSV). The final data set contains 600 
rows with 4 classes namely: course, department, project, and student. Fig 5.1 shows 
part of tf-idf data set. 

40 
 
 
Figure 5.1: tf-idf Data set. 
The second data set created by using binary technique instead of tf-idf technique. For 
this type of data set, no need to count the duplicate words, therefore the option 
remove duplicate was selected. Fig 5.2 shows part of binary data set. 
 
Figure 5.2: Binary data set. 
 

41 
 
5.3 Classification Activities 
WEKA 3.7.1 application used in this experiment, Explorer option selected from the 
Weka GUI chooser. WEKA has the facility to handle (CSV) file therefore there is no 
need to convert the data set because it is already in CSV format. The algorithm chosen 
in this experiment was Naïve-Bayes algorithm and C4.5 (J48). Many splitting rate 
was used to find out which rate is the best for training and test parts. Table 5.1 shows 
the classification results for 60% as training set and 40% as testing set using tf-idf 
data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
tf-idf 7328 
Naïve-
Bayes 
No 4.42 8.3333 % 91.6667 % 
tf-idf 7328 
Naïve-
Bayes 
Yes 5.65 13.75 % 86.25 % 
tf-idf 7328 C4.5 N/A 40.43 22.9167 % 77.0833 % 
Table 5.1: classification results (60%-40%). 
By using tf-idf data set with (60% - 40%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 91%. 
 
Table 5.2 shows the classification results for 65% as training set and 35% as testing 
set using tf-idf data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
tf-idf 7328 
Naïve-
Bayes 
No 4.08 9.5238 % 90.4762 % 
tf-idf 7328 
Naïve-
Bayes 
Yes 5.76 15.7143 % 84.2857 % 
tf-idf 7328 C4.5 N/A 39.65 19.0476 % 80.9524 % 
Table 5.2: classification results (65%-35%). 

42 
 
By using tf-idf data set with (65% - 35%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 90%. 
 
Table 5.3 shows the classification results for 70% as training set and 30% as testing 
set using tf-idf data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
tf-idf 7328 
Naïve-
Bayes 
No 3.61 7.7778 % 92.2222 % 
tf-idf 7328 
Naïve-
Bayes 
Yes 5.3 17.2222 % 82.7778 % 
tf-idf 7328 C4.5 N/A 41.45 24.4444 % 75.5556 % 
Table 5.3: classification results (70%-30%). 
By using tf-idf data set with (70% - 30%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 92%. 
 
Table 5.4 shows the classification results for 60% as training set and 40% as testing 
set using Binary data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
Binary 7209 
Naïve-
Bayes 
No 4.34 10.8333 % 89.1667 % 
Binary 7209 
Naïve-
Bayes 
Yes 5.3 14.5833 % 85.4167 % 
Binary 7209 C4.5 N/A 49.42 20.4167 % 79.5833 % 
Table 5.4: classification results (60%-40%). 
By using Binary data set with (60% - 40%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 89%. 
 

43 
 
Table 5.5 shows the classification results for 65% as training set and 35% as testing 
set using Binary data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
Binary 7209 
Naïve-
Bayes 
No 3.63 11.9048 % 88.0952 % 
Binary 7209 
Naïve-
Bayes 
Yes 5.14 13.8095 % 86.1905 % 
Binary 7209 C4.5 NA 47.96 20 % 80 % 
Table 5.5: classification results (65%-35%). 
By using Binary data set with (65% - 35%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 88%. 
 
Table 5.6 shows the classification results for 70% as training set and 30% as testing 
set using Binary data set. 
Data 
Representation 
No of 
attributes 
Classifier 
Algorithm 
Discretized 
Time to 
build model 
(second) 
Incorrectly 
classified 
Correctly 
classified 
Binary 7209 
Naïve-
Bayes 
No 4.18 12.2222 % 87.7778 % 
Binary 7209 
Naïve-
Bayes 
Yes 5.65 16.1111 % 83.8889 % 
Binary 7209 C4.5 NA 46.39 26.1111 % 73.8889 % 
Table 5.6: classification results (70%-30%). 
By using Binary data set with (70% - 30%) as splitting rate and without applying 
Discretization option, Naïve-Bayes shows the highest accuracy around 87%. 
 
The number of instance has a vital rule in the results of accuracy. By trying many data 
set sizes, the results are that neither a small number of instances nor a very big 
number of instances was useful in the classification accuracy. This is because a small 

44 
 
number of instances will build a weak model which doesn’t have the ability to classify 
new input to the correct class. In opposite case when using very big number of 
instances, the model will be very complex and the probability of over fitting will 
increase, therefore the data set in middle size (600 instances) gives the best results. 
From the results, it is clear that the highest accuracy around 92% obtained by using tfidf data set and Naïve-Bayes classifier without applying Discretization technique. By 
using tf-idf technique the classification accuracy increased around 3% from using 
Binary technique. This type of classification called Hierarchical classification 
(Xiaoguang & Brian, 2009). This type of classification is more difficult than flat 
classification because the task here is to classify the pages which are in the same area 
(educational in this case). Those pages have similar nature in structure and contents, 
therefore it is not an easy task to classify such pages and we need a sufficient 
algorithm to extract the text from those pages like the developed one. 
 
Songbo, Xueqi, Moustafa, Bin, and Hongbo (2005) have proposed a clssification 
algorithm called “DragPushing”. In their approach, they have used the same data set 
used in this study (WebKB) and seven classification algorithms namely: Refining 
Centroid Classifier (RCC), Centroid, Refining Naïve-Bayes (RNB), Naïve-Bayes 
(NB), K-Nearest Neighbor (KNN), Winnow, and Support Vector Machines (SVM). 
Table 5.7 shows the classification accuracy details for each algorithm. 
RCC Centroid RNB NB KNN Winnow SVM 
0.8802 0.7847 0.9063 0.8601 0.7015 0.7785 0.8766 
Table 5.7: Classification accuracy Details (DragPushing). 
 

45 
 
RCC and RNB algorithms gave the highest classification algorithm, while NB gave 
around 0.86 classification accuracy. 
Comparing their approach with this approach, this approach was able to produce 
classification accuracy approximately 2% higher than RNB and approximately 6% 
higher than NB in their approach. This indicated that, this study has improved the web 
page classification in term of accuracy. 
 
5.4 Summary 
Experiment was conducted by using Naïve-Bayes and C4.5 classifier. Both algorithms 
applied on Binary data set and tf-idf data set. Naïve-Bayes classifier shows accuracy 
around 92% which is higher than C4.5 around 80%. Another important factor is the 
time consumption. C4.5 consume around 41 second while Naïve-Bayes consume 
around 3 second. Therefore, the classifier Naïve-Bayes consider better than C4.5 in 
term of classification accuracy and time consumption. 
Extracting more words from each page gives more accuracy later in the classification, 
the other options like selecting the number of character in each word should be 
considered as critical option as well. 
Naïve-Bayes has good capability to handle large data set. By using hierarchical data 
set, the classification accuracy was acceptable. The accuracy will increased if the 
classification type is flat structure or if the data set classes are less than 4 classes. 
Educational data set is very challenging data set and it is good date set for evaluating 
an algorithm. The proposed approach is a promising approach for text extraction in 
web classification. 

46 
 
CHAPTER SIX 
CONCLUSION AND RECOMMENDATIONS 
 
This chapter concludes the study and discussion of this project for Text Extraction 
Algorithm. Finally, recommendation and further study in this area are given in this 
chapter. 
6.1 Project Conclusion 
The objective of this project is to develop a text extraction algorithm for web text 
classification. The HTML Parser that been developed provides all the facilities 
required for text extraction from the web pages. Two type of data set can be created 
using the HTML Parser, one is Binary data set and the other one is id tf-idf data set. 
The experiment validates the prototype by applying classifier algorithms on the data 
set. Two classifiers algorithms used, one is Naïve-Bayes algorithm and the other one 
is C4.5 algorithm. Naïve-Bayes classifier produced better classification results in term 
of accuracy and time consumption. 
 
 

47 
 
6.2 Recommendation for Future Research 
For more research on text extraction algorithm for web page classification, deep 
investigations on feature selection algorithms are required to improve the data set 
quality. The feature selection algorithm shows strong influence on classification 
accuracy. 
Furthermore, it is also suggested that in future research, how to represent the extracted 
data should be investigated. It is very vital to find a method to preserve the relations 
between extracted words. 
 

48 
 
REFERENCES 
Aaron, H. (2004). Introduction to Bayesian Learning. New York: University of 
Toronto. 
 
Alonso-Rios, D., Luis-Vazquez, I., Mosqueira-Rey, E., Moret-Bonillo, V., & del Rio, 
B. B. (2009, 11-14 Oct. 2009). An HTML analyzer for the study of web 
usability. Paper presented at the IEEE International Conference on Systems, 
Man and Cybernetics, 2009. SMC 2009, San Antonio, TX, 1224-1229. 
 
Anagnostopoulos, I., Kouzas, G., Anagnostopoulos, C., Vergados, D., 
Papaleonidopoulos, I., Generalis, A.,...Kayafas, E. (2002). Automatic Web 
site classification in a large repository under information filtering and retrieval 
techniques. Paper presented at the Electrotechnical Conference, 2002. 
MELECON. 11th Mediterranean, 279-283. 
Cooley, R., Mobasher, B., & Srivastava, J. (1997, 3-8 Nov 1997). Web mining: 
information and pattern discovery on the World Wide Web. Paper presented at 
the Proceedings Ninth IEEE International Conference on Tools with Artificial 
Intelligence, Newport Beach, CA, 558-567. 
 
Croft, B., Metzler, D., & Strohman, T. (2009). Search Engines: Information Retrieval 
in Practice: Addison Wesley. 
 
Dey, L., & Haque, S. K. M. (2008). Opinion mining from noisy text data. Paper 
presented at the Proceedings of the second workshop on Analytics for noisy 
unstructured text data, Singapore, 83-90. 
 
Feng, C., & Cherkassky, V. (2009). SVM+ regression and multi-task learning. Paper 
presented at the International Joint Conference on Neural Networks, 2009. 
IJCNN 2009, Atlanta, GA, 418-424. 
 
Fresno, V., & Ribeiro, A. (2004). An Analytical Approach to Concept Extraction in 
HTML Environments. Journal of Intelligent Information Systems, Volume 22, 
Number 3(0925-9902), 215-235. 
 
Hui, G., Yan, F., & Jian-Ping, L. (2008). Classification of Sensitive Web Documents. 
Paper presented at the International Conference on Apperceiving Computing 
and Intelligence Analysis, 2008. ICACIA 2008, Chengdu, 295-298. 
 
Jianli, L., Jie, S., & Cuihua, X. (2004, 15-18 Sept. 2004). Segmenting the Web 
document with document object model. Paper presented at the IEEE 
International Conference on Services Computing, 2004. (SCC 2004), China, 
449-452. 
 
Ke, Y., Yueming, L., Zongyan, L., & Xiang, D. (2008, 25-27 June 2008). 
Architecture and design of the parser used in the embedded browser based on 
DFA. Paper presented at the 7th World Congress on Intelligent Control and 
Automation, 2008. WCICA 2008, Chongqing, 6627-6630. 
 

49 
 
Lee, S., Baker, J., Song, J., & Wetherbe, J. C. (2010). An Empirical Comparison of 
Four Text Mining Methods. Paper presented at the International Conference on 
43rd Hawaii System Sciences, (HICSS), Honolulu, Hawaii, USA, 1-10. 
 
Liu, Y., A process-based search engine. Ph.D. dissertation, University of Kansas, 
 United States -- Kansas. Retrieved April 16, 2010, from Dissertations & 
 Theses: Full Text.(Publication No. AAT 3360489). 
McConnell, S. (1996). Rapid Development: Taming Wild Software Schedules. 
Washington: Microsoft Press. 
 
Meijuan, G., Jingwen, T., & Shiru, Z. (2009). Research of web classification mining 
based on classify support vector machine. Paper presented at the International 
Colloquium on Computing, Communication, Control, and Management, 2009. 
CCCM 2009. ISECS, Sanya, 21-24. 
 
Morales, S. d. P. B., Fandino, H. A. B., & Rodrivguez, J. R. (2009). Hypertext 
classification to filtrate information on the web. Paper presented at the 
Proceedings of the 2009 Euro American Conference on Telematics and 
Information Systems: New Opportunities to increase Digital Citizenship, 
Prague, Czech Republic, 1-7. 
 
Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms. 
ACM Computing Surveys (CSUR), 41(2), 1-31. 
 
Ronen, F., & James, S. (2007). The text mining handbook. New York: Cambridge 
University Press. 
 
Shiqun, Y., Yuhui, Q., & Jike, G. (2007). Research and Realization of Text Mining 
Algorithm on Web. Paper presented at the International Conference on 
Computational Intelligence and Security Workshops, 2007. CISW 2007, 
Harbin, 413-416. 
 
Songbo, T., Xueqi, C., Moustafa, M. G., Bin, W., & Hongbo, X. (2005). A novel 
refinement approach for text categorization. Paper presented at the 
Proceedings of the 14th ACM international conference on Information and 
knowledge management, Bremen, Germany, 469-476. 
 
Suanmali, L., Salim, N., & Binwahlan, M. S. (2009). Feature-Based Sentence 
Extraction Using Fuzzy Inference Rules. Paper presented at the 2009 
International Conference on Signal Processing Systems, Singapore, 511- 515. 
 
Sun, B., Sun, Q., Chen, Z., & Fu, Z. (2009). A Study on Automatic Web Pages 
Categorization. Paper presented at the IEEE International Advance 
Computing Conference, 2009. IACC 2009, Patiala, 1423-1427. 
 
Tomar, G. S., Shekhar, V., & Ashish, J. (2006). Web Page Classification using 
Modified Naive Bayesian Approach. Paper presented at the TENCON 2006. 
IEEE Region 10 Conference, Hong Kong, 1-4. 
 

50 
 
Wang, J., Huang, Y., Wu, G., & Zhang, F. (1999). Web mining: knowledge discovery 
on the Web. Paper presented at the 1999 IEEE International Conference on 
Systems, Man, and Cybernetics, IEEE SMC '99 Conference Proceedings, 
Tokyo, 137-14. 
 
Xiaoguang, Q., & Brian, D. D. (2009). Web page classification: Features and 
algorithms. ACM Computing Surveys (CSUR) 41(2), 1-31. 
 
Yu, H., Han, J., & Chang, K. C.-C. (2002). PEBL: positive example based learning 
for Web page classification using SVM. Paper presented at the Proceedings of 
the eighth ACM SIGKDD international conference on Knowledge discovery 
and data mining, Edmonton, Alberta, Canada, 239-248. 
 
Zhang, Y., Zincir-Heywood, N., & Milios, E. (2005). Narrative text classification for 
automatic key phrase extraction in web document corpora. Paper presented at 
the Proceedings of the 7th annual ACM international workshop on Web 
information and data management, Bremen, Germany, 51-58. 
 
Zhongda, L., Kun, D., & Yanfen, H. (2007). Research of Web Pages Categorization. 
Paper presented at the IEEE International Conference on Granular Computing, 
2007. GRC 2007, Fremont, CA, 691-691. 
 
Ziegler, C.-N., Vogele, C., & Viermetz, M. (2009). Distilling Informative Content 
from HTML News Pages. Paper presented at the International Joint 
Conferences on Web Intelligence and Intelligent Agent Technologies, 2009. 
WI-IAT '09. IEEE/WIC/ACM Milan, Italy, 707-712. 
 

51 
 
Appendix A 
 
I. Get Source Code Algorithm Implementation 
class GetSourceCode 
 { 
 public StreamReader streamReader; 
 public string GetOnlineSource(string address) 
 { 
 Uri uri = new Uri(address); 
 WebRequest request = WebRequest.Create(uri); 
 request.Credentials = CredentialCache.DefaultCredentials; 
 HttpWebResponse response = (HttpWebResponse)request.GetResponse(); 
 Stream dataStream = response.GetResponseStream(); 
 streamReader = new StreamReader(dataStream); 
 string sourceCode = streamReader.ReadToEnd(); 
 response.Close(); 
 dataStream.Close(); 
 streamReader.Close(); 
 return sourceCode; 
 } 
 public string[] GetSourecCodeFromFolder() 
 { 
 FolderBrowserDialog folderDialog = new FolderBrowserDialog(); 
 folderDialog.ShowDialog(); 
 if (folderDialog.SelectedPath != "") 
 { 
 DirectoryInfo di = new DirectoryInfo(folderDialog.SelectedPath); 
 FileInfo[] filesInfo = di.GetFiles(); 
 string[] allPages = new string[filesInfo.Length]; 
 
 // 1- Collect files and read source code 
 for (int i = 0; i < filesInfo.Length; i++) 
 { 
 streamReader = new StreamReader(filesInfo[i].FullName); 
 allPages[i] = streamReader.ReadToEnd().ToLower(); 
 } 
 return allPages; 
 } 
 string[] allPages2 = new string[0]; 
 return allPages2; 
 } 
 } 
 
II. Calculate Percentage Algorithm Implementation 
class CalcPercentage 
 { 
 public void Calc(double percentage, string[] webPages) 
 { 
 for (int i = 0; i < webPages.Length; i++) 
 { 
 if (webPages[i].Length > webPages[i].Length * percentage) 
 { 
 int start = Convert.ToInt32(webPages[i].Length * percentage); 
 int charCount = webPages[i].Length - start; 
 webPages[i] = webPages[i].Remove(start, charCount); 
 } 

52 
 
 } 
 } 
 } 
III. Remove HTML Tags Algorithm Implementation: 
public class RemoveHTMLTags 
{ 
 public void Remove(string[] webPages) 
 { 
 for (int i = 0; i < webPages.Length; i++) 
 { 
 webPages[i] = Regex.Replace(webPages[i], @"<(.|\n)*?>", " "); 
 } 
 } 
 } 
IV. Remove Digits Algorithm Implementation 
class RemoveDigits 
 { 
 public void Remove(string[] webPages) 
 { 
 for (int i = 0; i < webPages.Length; i++) 
 { 
 for (int j = 0; j < webPages[i].Length; j++) 
 { 
 if (!char.IsLetter(webPages[i][j]) && webPages[i][j] != ' ') 
 { 
 webPages[i] = webPages[i].Replace(webPages[i][j], ' '); 
 } 
 } 
 } 
 } 
 } 
V. Split lines to words Algorithm Implementation 
class SplitText 
 { 
 public void Split(string[] allPages, List<string[]> listArray) 
 { 
 for (int i = 0; i < allPages.Length; i++) 
 { 
 string[] words = allPages[i].Split(new char[] { ' ', '\n', '\\','/', '-', '_', '{', '}', '[', 
']', '=', '(', ')', ';', '"', ',', '.', '+', '!' }); 
 List<string> wordList = new List<string>(); 
 for (int w = 0; w < words.Length; w++) 
 { 
 if (words[w].Trim() != "") 
 { 
 wordList.Add(words[w]); 
 } 
 } 
 string[] cleanWords = new string[wordList.Count]; 
 for (int cw = 0; cw < cleanWords.Length; cw++) 
 { 
 cleanWords[cw] = wordList[cw]; 
 } 
 listArray.Add(cleanWords); 
 } 

53 
 
 } 
 } 
VI. Remove duplicate words Algorithm Implementation 
class RemoveDuplicate 
 { 
 public void Remove(bool remove, List<string[]> pageList, List<List<string>> 
 webPagesList) 
 { 
 if (remove) 
 { 
 for (int i = 0; i < pageList.Count; i++) 
 { 
 for (int j = 0; j < pageList[i].Length; j++) 
 { 
 if (!webPagesList[i].Contains(pageList[i][j].Trim())) 
 { 
 webPagesList[i].Add(pageList[i][j]); 
 } 
 } 
 } 
 } 
 else 
 { 
 for (int i = 0; i < pageList.Count; i++) 
 { 
 for (int j = 0; j < pageList[i].Length; j++) 
 { 
 webPagesList[i].Add(pageList[i][j]); 
 } 
 } 
 } 
 } 
 public void InitializaeUniqeWords(List<List<string>> stemedList, 
 List<List<string>> uniqueWordsList) 
 { 
 for (int i = 0; i < stemedList.Count; i++) 
 { 
 uniqueWordsList.Add(new List<string>()); 
 } 
 } 
 
 internal void Remove(bool removeDuplicate, List<List<string>> stemedList, List<List<string>> 
uniqueWordsList) 
 { 
 if (removeDuplicate) 
 { 
 for (int i = 0; i < stemedList.Count; i++) 
 { 
 for (int j = 0; j < stemedList[i].Count; j++) 
 { 
 if (!uniqueWordsList[i].Contains(stemedList[i][j])) 
 { 
 uniqueWordsList[i].Add(stemedList[i][j]); 
 } 
 } 
 } 
 } 
 else 

54 
 
 { 
 for (int i = 0; i < stemedList.Count; i++) 
 { 
 for (int j = 0; j < stemedList[i].Count; j++) 
 { 
 uniqueWordsList[i].Add(stemedList[i][j]); 
 } 
 } 
 } 
 } 
 } 
VII. Remove the words have less than specific number of characters 
class RemoveMinNoOfChar 
 { 
 public void Remove(int charLength, List<List<string>> webPagesList) 
 { 
 if (charLength < 100) 
 { 
 for (int i = 0; i < webPagesList.Count; i++) { 
 for (int j = 0; j < webPagesList[i].Count; j++) { 
 if (webPagesList[i][j].Length < charLength){ 
 webPagesList[i].Remove(webPagesList[i][j]); 
 j--; 
 } 
 } 
 } 
 } 
 } 
 } 
 
VIII. Remove stop-words Algorithm Implementation 
public class RemoveStopWords 
 { 
 public void Remove(bool remove, List<List<string>> webPagesList) { 
 if (remove) { 
 string[] stop_words = File.ReadAllLines(@"stop.txt"); 
 for (int i = 0; i < stop_words.Length; i++) { 
 for (int j = 0; j < webPagesList.Count; j++) { 
 if (webPagesList[j].Contains(stop_words[i].ToLower())) { 
 webPagesList[j].Remove(stop_words[i].ToLower()); 
 } 
 } 
 } 
 } 
 } 
 } 
IX. Stemming the words Algorithm Implementation 
class Stemmer 
 { 
 public void InitializeStemmedlIST(bool steam, List<List<string>> webPagesList, 
 List<List<string>> stemedList,ProgressBar pb ) 
 { 
 if (steam) 
 { 

55 
 
 int max = 0; 
 for (int i = 0; i < webPagesList.Count; i++) 
 { 
 stemedList.Add(Steam(webPagesList[i])); 
 if (max <= 100) 
 { 
 pb.Value = max; 
 max++; 
 } 
 } 
 pb.Value = 100; 
 } 
 else 
 { 
 for (int i = 0; i < webPagesList.Count; i++) 
 { 
 stemedList.Add(webPagesList[i]); 
 } 
 } 
 } 
 
 static public List<string> Steam(List<string> list) 
 { 
 List<string> resultList = new List<string>(); 
 string fullString = ""; 
 for (int i = 0; i < list.Count; i++) 
 { 
 fullString += list[i] + " "; 
 } 
 char[] w = new char[501]; 
 Stemmer s = new Stemmer(); 
 Stream _in = new MemoryStream(ASCIIEncoding.Default.GetBytes(fullString)); 
 while (true) 
 { 
 int ch = _in.ReadByte(); 
 if (Char.IsLetter((char)ch)) 
 { 
 int j = 0; 
 while (true) 
 { 
 ch = Char.ToLower((char)ch); 
 w[j] = (char)ch; 
 if (j < 500) 
 j++; 
 ch = _in.ReadByte(); 
 if (!Char.IsLetter((char)ch)) 
 { 
 for (int c = 0; c < j; c++) 
 s.add(w[c]); 
 s.stem(); 
 String u; 
 u = s.ToString(); 
 
 resultList.Add(u); 
 break; 
 } 
 } 
 } 
 if (ch < 0) 
 break; 

56 
 
 } 
 return resultList; 
 } 
 
 private char[] b; 
 private int i, /* offset into b */ 
 i_end, /* offset to end of stemmed word */ 
 j, k; 
 private static int INC = 50; 
 /* unit of size whereby b is increased */ 
 
 public Stemmer() 
 { 
 b = new char[INC]; 
 i = 0; 
 i_end = 0; 
 } 
 
 /** 
 * Add a character to the word being stemmed. When you are finished 
 * adding characters, you can call stem(void) to stem the word. 
 */ 
 
 public void add(char ch) 
 { 
 if (i == b.Length) 
 { 
 char[] new_b = new char[i + INC]; 
 for (int c = 0; c < i; c++) 
 new_b[c] = b[c]; 
 b = new_b; 
 } 
 b[i++] = ch; 
 } 
 
 /** Adds wLen characters to the word being stemmed contained in a portion 
 * of a char[] array. This is like repeated calls of add(char ch), but 
 * faster. 
 */ 
 
 public void add(char[] w, int wLen) 
 { 
 if (i + wLen >= b.Length) 
 { 
 char[] new_b = new char[i + wLen + INC]; 
 for (int c = 0; c < i; c++) 
 new_b[c] = b[c]; 
 b = new_b; 
 } 
 for (int c = 0; c < wLen; c++) 
 b[i++] = w[c]; 
 } 
 
 /** 
 * After a word has been stemmed, it can be retrieved by toString(), 
 * or a reference to the internal buffer can be retrieved by getResultBuffer 
 * and getResultLength (which is generally more efficient.) 
 */ 
 public override string ToString() 
 { 

57 
 
 return new String(b, 0, i_end); 
 } 
 
 /** 
 * Returns the length of the word resulting from the stemming process. 
 */ 
 public int getResultLength() 
 { 
 return i_end; 
 } 
 
 /** 
 * Returns a reference to a character buffer containing the results of 
 * the stemming process. You also need to consult getResultLength() 
 * to determine the length of the result. 
 */ 
 public char[] getResultBuffer() 
 { 
 return b; 
 } 
 
 /* cons(i) is true <=> b[i] is a consonant. */ 
 private bool cons(int i) 
 { 
 switch (b[i]) 
 { 
 case 'a': 
 case 'e': 
 case 'i': 
 case 'o': 
 case 'u': return false; 
 case 'y': return (i == 0) ? true : !cons(i - 1); 
 default: return true; 
 } 
 } 
 
 /* m() measures the number of consonant sequences between 0 and j. if c is 
 a consonant sequence and v a vowel sequence, and <..> indicates arbitrary 
 presence, 
 
 <c><v> gives 0 
 <c>vc<v> gives 1 
 <c>vcvc<v> gives 2 
 <c>vcvcvc<v> gives 3 
 .... 
 */ 
 private int m() 
 { 
 int n = 0; 
 int i = 0; 
 while (true) 
 { 
 if (i > j) return n; 
 if (!cons(i)) break; i++; 
 } 
 i++; 
 while (true) 
 { 
 while (true) 
 { 

58 
 
 if (i > j) return n; 
 if (cons(i)) break; 
 i++; 
 } 
 i++; 
 n++; 
 while (true) 
 { 
 if (i > j) return n; 
 if (!cons(i)) break; 
 i++; 
 } 
 i++; 
 } 
 } 
 
 /* vowelinstem() is true <=> 0,...j contains a vowel */ 
 private bool vowelinstem() 
 { 
 int i; 
 for (i = 0; i <= j; i++) 
 if (!cons(i)) 
 return true; 
 return false; 
 } 
 
 /* doublec(j) is true <=> j,(j-1) contain a double consonant. */ 
 private bool doublec(int j) 
 { 
 if (j < 1) 
 return false; 
 if (b[j] != b[j - 1]) 
 return false; 
 return cons(j); 
 } 
 
 /* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant 
 and also if the second c is not w,x or y. this is used when trying to 
 restore an e at the end of a short word. e.g. 
 
 cav(e), lov(e), hop(e), crim(e), but 
 snow, box, tray. 
 
 */ 
 private bool cvc(int i) 
 { 
 if (i < 2 || !cons(i) || cons(i - 1) || !cons(i - 2)) 
 return false; 
 int ch = b[i]; 
 if (ch == 'w' || ch == 'x' || ch == 'y') 
 return false; 
 return true; 
 } 
 
 private bool ends(String s) 
 { 
 int l = s.Length; 
 int o = k - l + 1; 
 if (o < 0) 
 return false; 

59 
 
 char[] sc = s.ToCharArray(); 
 for (int i = 0; i < l; i++) 
 if (b[o + i] != sc[i]) 
 return false; 
 j = k - l; 
 return true; 
 } 
 
 /* setto(s) sets (j+1),...k to the characters in the string s, readjusting 
 k. */ 
 private void setto(String s) 
 { 
 int l = s.Length; 
 int o = j + 1; 
 char[] sc = s.ToCharArray(); 
 for (int i = 0; i < l; i++) 
 b[o + i] = sc[i]; 
 k = j + l; 
 } 
 
 /* r(s) is used further down. */ 
 private void r(String s) 
 { 
 if (m() > 0) 
 setto(s); 
 } 
 
 /* step1() gets rid of plurals and -ed or -ing. e.g. 
 caresses -> caress 
 ponies -> poni 
 ties -> ti 
 caress -> caress 
 cats -> cat 
 
 feed -> feed 
 agreed -> agree 
 disabled -> disable 
 
 matting -> mat 
 mating -> mate 
 meeting -> meet 
 milling -> mill 
 messing -> mess 
 
 meetings -> meet 
 
 */ 
 
 private void step1() 
 { 
 if (b[k] == 's') 
 { 
 if (ends("sses")) 
 k -= 2; 
 else if (ends("ies")) 
 setto("i"); 
 else if (b[k - 1] != 's') 
 k--; 
 } 
 if (ends("eed")) 

60 
 
 { 
 if (m() > 0) 
 k--; 
 } 
 else if ((ends("ed") || ends("ing")) && vowelinstem()) 
 { 
 k = j; 
 if (ends("at")) 
 setto("ate"); 
 else if (ends("bl")) 
 setto("ble"); 
 else if (ends("iz")) 
 setto("ize"); 
 else if (doublec(k)) 
 { 
 k--; 
 int ch = b[k]; 
 if (ch == 'l' || ch == 's' || ch == 'z') 
 k++; 
 } 
 else if (m() == 1 && cvc(k)) setto("e"); 
 } 
 } 
 
 /* step2() turns terminal y to i when there is another vowel in the stem. */ 
 private void step2() 
 { 
 if (ends("y") && vowelinstem()) 
 b[k] = 'i'; 
 } 
 
 /* step3() maps double suffices to single ones. so -ization ( = -ize plus 
 -ation) maps to -ize etc. note that the string before the suffix must give 
 m() > 0. */ 
 private void step3() 
 { 
 if (k == 0) 
 return; 
 
 /* For Bug 1 */ 
 switch (b[k - 1]) 
 { 
 case 'a': 
 if (ends("ational")) { r("ate"); break; } 
 if (ends("tional")) { r("tion"); break; } 
 break; 
 case 'c': 
 if (ends("enci")) { r("ence"); break; } 
 if (ends("anci")) { r("ance"); break; } 
 break; 
 case 'e': 
 if (ends("izer")) { r("ize"); break; } 
 break; 
 case 'l': 
 if (ends("bli")) { r("ble"); break; } 
 if (ends("alli")) { r("al"); break; } 
 if (ends("entli")) { r("ent"); break; } 
 if (ends("eli")) { r("e"); break; } 
 if (ends("ousli")) { r("ous"); break; } 
 break; 

61 
 
 case 'o': 
 if (ends("ization")) { r("ize"); break; } 
 if (ends("ation")) { r("ate"); break; } 
 if (ends("ator")) { r("ate"); break; } 
 break; 
 case 's': 
 if (ends("alism")) { r("al"); break; } 
 if (ends("iveness")) { r("ive"); break; } 
 if (ends("fulness")) { r("ful"); break; } 
 if (ends("ousness")) { r("ous"); break; } 
 break; 
 case 't': 
 if (ends("aliti")) { r("al"); break; } 
 if (ends("iviti")) { r("ive"); break; } 
 if (ends("biliti")) { r("ble"); break; } 
 break; 
 case 'g': 
 if (ends("logi")) { r("log"); break; } 
 break; 
 default: 
 break; 
 } 
 } 
 
 /* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */ 
 private void step4() 
 { 
 switch (b[k]) 
 { 
 case 'e': 
 if (ends("icate")) { r("ic"); break; } 
 if (ends("ative")) { r(""); break; } 
 if (ends("alize")) { r("al"); break; } 
 break; 
 case 'i': 
 if (ends("iciti")) { r("ic"); break; } 
 break; 
 case 'l': 
 if (ends("ical")) { r("ic"); break; } 
 if (ends("ful")) { r(""); break; } 
 break; 
 case 's': 
 if (ends("ness")) { r(""); break; } 
 break; 
 } 
 } 
 
 /* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */ 
 private void step5() 
 { 
 if (k == 0) 
 return; 
 
 /* for Bug 1 */ 
 switch (b[k - 1]) 
 { 
 case 'a': 
 if (ends("al")) break; return; 
 case 'c': 
 if (ends("ance")) break; 

62 
 
 if (ends("ence")) break; return; 
 case 'e': 
 if (ends("er")) break; return; 
 case 'i': 
 if (ends("ic")) break; return; 
 case 'l': 
 if (ends("able")) break; 
 if (ends("ible")) break; return; 
 case 'n': 
 if (ends("ant")) break; 
 if (ends("ement")) break; 
 if (ends("ment")) break; 
 /* element etc. not stripped before the m */ 
 if (ends("ent")) break; return; 
 case 'o': 
 if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break; 
 /* j >= 0 fixes Bug 2 */ 
 if (ends("ou")) break; return; 
 /* takes care of -ous */ 
 case 's': 
 if (ends("ism")) break; return; 
 case 't': 
 if (ends("ate")) break; 
 if (ends("iti")) break; return; 
 case 'u': 
 if (ends("ous")) break; return; 
 case 'v': 
 if (ends("ive")) break; return; 
 case 'z': 
 if (ends("ize")) break; return; 
 default: 
 return; 
 } 
 if (m() > 1) 
 k = j; 
 } 
 
 /* step6() removes a final -e if m() > 1. */ 
 private void step6() 
 { 
 j = k; 
 
 if (b[k] == 'e') 
 { 
 int a = m(); 
 if (a > 1 || a == 1 && !cvc(k - 1)) 
 k--; 
 } 
 if (b[k] == 'l' && doublec(k) && m() > 1) 
 k--; 
 } 
 
 /** Stem the word placed into the Stemmer buffer through calls to add(). 
 * Returns true if the stemming process resulted in a word different 
 * from the input. You can retrieve the result with 
 * getResultLength()/getResultBuffer() or toString(). 
 */ 
 public void stem() 
 { 
 k = i - 1; 

63 
 
 if (k > 1) 
 { 
 step1(); 
 step2(); 
 step3(); 
 step4(); 
 step5(); 
 step6(); 
 } 
 i_end = k + 1; 
 i = 0; 
 } 
 } 
X. Create Binary Data Set Algorithm Implementation 
class CreateBinaryDataSet 
 { 
 static int col = 0; 
 static int row = 0; 
 static bool flag = false; 
 static int ColCount = 0; 
 
 public void Create(List<List<string>> uniqueWordsList, string[,] dataSet) 
 { 
 for (int i = 0; i < uniqueWordsList.Count; i++) 
 { 
 string[] arrayOfPageWords = new string[uniqueWordsList[i].Count]; 
 for (int j = 0; j < uniqueWordsList[i].Count; j++) 
 { 
 arrayOfPageWords[j] = uniqueWordsList[i][j]; 
 } 
 AddTODataSet(arrayOfPageWords, dataSet); 
 } 
 } 
 
 private void AddTODataSet(string[] doc, string[,] dataSet) 
 { 
 row++; 
 for (int i = 0; i < doc.GetLength(0); i++) 
 { 
 flag = false; 
 for (col = 0; col < dataSet.GetLength(1); col++) 
 { 
 if (doc[i] == dataSet[0, col]) 
 { 
 flag = true; 
 break; 
 } 
 } 
 if (flag == true) 
 { 
 AddValueTo(row, col,dataSet); 
 } 
 else if (flag == false) 
 { 
 AddNewCol(ColCount, doc[i], dataSet); 
 ColCount++; 
 } 
 } 

64 
 
 } 
 private void AddNewCol(int c, string w, string[,] dataSet) 
 { 
 dataSet[0, c] = w; 
 AddValueTo(row, c, dataSet); 
 } 
 private void AddValueTo(int row, int col, string[,] dataSet) 
 { 
 dataSet[row, col] = "1"; 
 } 
 } 
XI. Create tf-idf Data Set Algorithm Implementation 
class CreateTFIDFdataSet 
 { 
 static int col = 0; 
 static int row = 0; 
 static bool flag = false; 
 static int ColCount = 0; 
 
 public void Create(List<List<string>> uniqueWordsList, string[,] dataSet) 
 { 
 for (int i = 0; i < uniqueWordsList.Count; i++) 
 { 
 string[] arrayOfPageWords = new string[uniqueWordsList[i].Count]; 
 for (int j = 0; j < uniqueWordsList[i].Count; j++) 
 { 
 arrayOfPageWords[j] = uniqueWordsList[i][j]; 
 } 
 AddTODataSet(arrayOfPageWords, dataSet); 
 } 
 
 } 
 public void CalcTFIDF(string [,] dataSetMatrix, List<int> pagesSize) 
 { 
 int no_of_terms_in_all_documens = 0; 
 
 for (int r = 1; r < dataSetMatrix.GetLength(0); r++) 
 { 
 for (int c = 0; c < dataSetMatrix.GetLength(1) - 1; c++) 
 { 
 for (int cnt = 1; cnt < dataSetMatrix.GetLength(0); cnt++) 
 { 
 if (dataSetMatrix[cnt, c] != "0") 
 { 
 no_of_terms_in_all_documens++; 
 } 
 } 
 if (dataSetMatrix[r, c] != "0") 
 { 
 dataSetMatrix[r, c] = CalcTFIDF(dataSetMatrix[r, c], pagesSize[r - 1], 
dataSetMatrix.GetLength(0) - 1, no_of_terms_in_all_documens); 
 } 
 no_of_terms_in_all_documens = 0; 
 } 
 } 
 } 
 
 private void AddTODataSet(string[] doc, string[,] dataSet) 

65 
 
 { 
 row++; 
 for (int i = 0; i < doc.GetLength(0); i++) 
 { 
 flag = false; 
 for (col = 0; col < dataSet.GetLength(1); col++) 
 { 
 if (doc[i] == dataSet[0, col]) 
 { 
 flag = true; 
 break; 
 } 
 } 
 if (flag == true) 
 { 
 AddValueTo(row, col,dataSet); 
 } 
 else if (flag == false) 
 { 
 AddNewCol(ColCount, doc[i], dataSet); 
 ColCount++; 
 } 
 } 
 } 
 private void AddNewCol(int c, string w, string[,] dataSet) 
 { 
 dataSet[0, c] = w; 
 AddValueTo(row, c, dataSet); 
 } 
 private void AddValueTo(int row, int col,string[,] dataSet) 
 { 
 string cellValue = dataSet[row, col]; 
 if (cellValue == null) 
 { 
 dataSet[row, col] = "1"; 
 } 
 else 
 { 
 dataSet[row, col] = (double.Parse(cellValue) + 1).ToString(); ; 
 } 
 
 } 
 
 public void ChangeNullItems(bool binary, string[,] dataSet) 
 { 
 if (binary) 
 { 
 for (int i = 0; i < dataSet.GetLength(0); i++) 
 { 
 for (int j = 0; j < dataSet.GetLength(1); j++) 
 { 
 if (dataSet[i, j] == null) 
 { 
 dataSet[i, j] = "0"; 
 } 
 } 
 } 
 } 
 } 
 private string CalcTFIDF(string termAppears, int totalFreq, int totalPages, int totTermFreq) 

66 
 
 { 
 double tdidf; 
 double termApp = (double.Parse(termAppears)); 
 double tf = termApp / (double)totalFreq; 
 double idf1 = (double)totalPages / (1 + (double)totTermFreq); 
 double idf = Math.Log(idf1); 
 tdidf = tf * idf; 
 tdidf = Math.Round(tdidf, 4); 
 return tdidf.ToString(); 
 } 
 } 
XII. Build Physical Data Set Algorithm Implementation 
class SaveDataSet 
 { 
 public void Save(string[,] dataSetMatrix) 
 { 
 SaveFileDialog sfd = new SaveFileDialog(); 
 sfd.FileName = "dataSet" + DateTime.Now.Millisecond + ".csv"; 
 DialogResult result = sfd.ShowDialog(); 
 if (result == DialogResult.OK) 
 { 
 
 string sFilePath = sfd.FileName; 
 
 StreamWriter sw = File.CreateText(sFilePath); 
 
 for (int r = 0; r < dataSetMatrix.GetLength(0); r++) 
 { 
 for (int c = 0; c < dataSetMatrix.GetLength(1); c++) 
 { 
 sw.Write(dataSetMatrix[r, c] + ","); 
 
 } 
 sw.WriteLine(); 
 } 
 sw.Close(); 
 
 Process.Start(sFilePath); 
 } 
 } 
 } 
 

67 
 
 Appendix B 
 
 
/* This is the Porter stemming algorithm, coded up in ANSI C by the author. */ 
#include <string.h> /* for memmove */ 
#define TRUE 1 
#define FALSE 0 
static char * b; /* buffer for word to be stemmed */ 
static int k,k0,j; /* j is a general offset into the string */ 
/* cons(i) is TRUE <=> b[i] is a consonant. */ 
static int cons(int i) 
{ switch (b[i]) 
 { case 'a': case 'e': case 'i': case 'o': case 'u': return FALSE; 
 case 'y': return (i==k0) ? TRUE : !cons(i-1); 
 default: return TRUE; 
 } 
} 
 
/* m() measures the number of consonant sequences between k0 and j. if c is 
 a consonant sequence and v a vowel sequence, and <..> indicates arbitrary 
 presence, 
 <c><v> gives 0 
 <c>vc<v> gives 1 
 <c>vcvc<v> gives 2 
 <c>vcvcvc<v> gives 3 
 .... 
*/ 
static int m() 
{ int n = 0; 
 int i = k0; 
 while(TRUE) 
 { if (i > j) return n; 
 if (! cons(i)) break; i++; 
 } 
 i++; 
 while(TRUE) 
 { while(TRUE) 
 { if (i > j) return n; 
 if (cons(i)) break; 
 i++; 
 } 
 i++; 
 n++; 
 while(TRUE) 
 { if (i > j) return n; 
 if (! cons(i)) break; 
 i++; 
 } 
 i++; 
 } 
} 
/* vowelinstem() is TRUE <=> k0,...j contains a vowel */ 
static int vowelinstem() 
{ int i; for (i = k0; i <= j; i++) if (! cons(i)) return TRUE; 
 return FALSE; 
} 
/* doublec(j) is TRUE <=> j,(j-1) contain a double consonant. */ 
static int doublec(int j) 
{ if (j < k0+1) return FALSE; 

68 
 
 if (b[j] != b[j-1]) return FALSE; 
 return cons(j); 
} 
/* cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant 
 and also if the second c is not w,x or y. this is used when trying to 
 restore an e at the end of a short word. e.g. 
 cav(e), lov(e), hop(e), crim(e), but 
 snow, box, tray. 
*/ 
static int cvc(int i) 
{ if (i < k0+2 || !cons(i) || cons(i-1) || !cons(i-2)) return FALSE; 
 { int ch = b[i]; 
 if (ch == 'w' || ch == 'x' || ch == 'y') return FALSE; 
 } 
 return TRUE; 
} 
/* ends(s) is TRUE <=> k0,...k ends with the string s. */ 
static int ends(char * s) 
{ int length = s[0]; 
 if (s[length] != b[k]) return FALSE; /* tiny speed-up */ 
 if (length > k-k0+1) return FALSE; 
 if (memcmp(b+k-length+1,s+1,length) != 0) return FALSE; 
 j = k-length; 
 return TRUE; 
} 
/* setto(s) sets (j+1),...k to the characters in the string s, readjusting 
 k. */ 
static void setto(char * s) 
{ int length = s[0]; 
 memmove(b+j+1,s+1,length); 
 k = j+length; 
} 
/* r(s) is used further down. */ 
static void r(char * s) { if (m() > 0) setto(s); } 
/* step1ab() gets rid of plurals and -ed or -ing. e.g. 
 caresses -> caress 
 ponies -> poni 
 ties -> ti 
 caress -> caress 
 cats -> cat 
 feed -> feed 
 agreed -> agree 
 disabled -> disable 
 matting -> mat 
 mating -> mate 
 meeting -> meet 
 milling -> mill 
 messing -> mess 
 meetings -> meet 
*/ 
static void step1ab() 
{ if (b[k] == 's') 
 { if (ends("\04" "sses")) k -= 2; else 
 if (ends("\03" "ies")) setto("\01" "i"); else 
 if (b[k-1] != 's') k--; 
 } 
 if (ends("\03" "eed")) { if (m() > 0) k--; } else 
 if ((ends("\02" "ed") || ends("\03" "ing")) && vowelinstem()) 
 { k = j; 
 if (ends("\02" "at")) setto("\03" "ate"); else 

69 
 
 if (ends("\02" "bl")) setto("\03" "ble"); else 
 if (ends("\02" "iz")) setto("\03" "ize"); else 
 if (doublec(k)) 
 { k--; 
 { int ch = b[k]; 
 if (ch == 'l' || ch == 's' || ch == 'z') k++; 
 } 
 } 
 else if (m() == 1 && cvc(k)) setto("\01" "e"); 
 } 
} 
/* step1c() turns terminal y to i when there is another vowel in the stem. */ 
static void step1c() { if (ends("\01" "y") && vowelinstem()) b[k] = 'i'; } 
/* step2() maps double suffices to single ones. so -ization ( = -ize plus 
 -ation) maps to -ize etc. note that the string before the suffix must give 
 m() > 0. */ 
static void step2() { switch (b[k-1]) 
{ 
 case 'a': if (ends("\07" "ational")) { r("\03" "ate"); break; } 
 if (ends("\06" "tional")) { r("\04" "tion"); break; } 
 break; 
 case 'c': if (ends("\04" "enci")) { r("\04" "ence"); break; } 
 if (ends("\04" "anci")) { r("\04" "ance"); break; } 
 break; 
 case 'e': if (ends("\04" "izer")) { r("\03" "ize"); break; } 
 break; 
 case 'l': if (ends("\03" "bli")) { r("\03" "ble"); break; } /*-DEPARTURE-*/ 
 /* To match the published algorithm, replace this line with 
 case 'l': if (ends("\04" "abli")) { r("\04" "able"); break; } */ 
 if (ends("\04" "alli")) { r("\02" "al"); break; } 
 if (ends("\05" "entli")) { r("\03" "ent"); break; } 
 if (ends("\03" "eli")) { r("\01" "e"); break; } 
 if (ends("\05" "ousli")) { r("\03" "ous"); break; } 
 break; 
 case 'o': if (ends("\07" "ization")) { r("\03" "ize"); break; } 
 if (ends("\05" "ation")) { r("\03" "ate"); break; } 
 if (ends("\04" "ator")) { r("\03" "ate"); break; } 
 break; 
 case 's': if (ends("\05" "alism")) { r("\02" "al"); break; } 
 if (ends("\07" "iveness")) { r("\03" "ive"); break; } 
 if (ends("\07" "fulness")) { r("\03" "ful"); break; } 
 if (ends("\07" "ousness")) { r("\03" "ous"); break; } 
 break; 
 case 't': if (ends("\05" "aliti")) { r("\02" "al"); break; } 
 if (ends("\05" "iviti")) { r("\03" "ive"); break; } 
 if (ends("\06" "biliti")) { r("\03" "ble"); break; } 
 break; 
 case 'g': if (ends("\04" "logi")) { r("\03" "log"); break; } /*-DEPARTURE-*/ 
 /* To match the published algorithm, delete this line */ 
} } 
/* step3() deals with -ic-, -full, -ness etc. similar strategy to step2. */ 
static void step3() { switch (b[k]) 
{ 
 case 'e': if (ends("\05" "icate")) { r("\02" "ic"); break; } 
 if (ends("\05" "ative")) { r("\00" ""); break; } 
 if (ends("\05" "alize")) { r("\02" "al"); break; } 
 break; 
 case 'i': if (ends("\05" "iciti")) { r("\02" "ic"); break; } 
 break; 
 case 'l': if (ends("\04" "ical")) { r("\02" "ic"); break; } 

70 
 
 if (ends("\03" "ful")) { r("\00" ""); break; } 
 break; 
 case 's': if (ends("\04" "ness")) { r("\00" ""); break; } 
 break; 
} } 
/* step4() takes off -ant, -ence etc., in context <c>vcvc<v>. */ 
static void step4() 
{ switch (b[k-1]) 
 { case 'a': if (ends("\02" "al")) break; return; 
 case 'c': if (ends("\04" "ance")) break; 
 if (ends("\04" "ence")) break; return; 
 case 'e': if (ends("\02" "er")) break; return; 
 case 'i': if (ends("\02" "ic")) break; return; 
 case 'l': if (ends("\04" "able")) break; 
 if (ends("\04" "ible")) break; return; 
 case 'n': if (ends("\03" "ant")) break; 
 if (ends("\05" "ement")) break; 
 if (ends("\04" "ment")) break; 
 if (ends("\03" "ent")) break; return; 
 case 'o': if (ends("\03" "ion") && (b[j] == 's' || b[j] == 't')) break; 
 if (ends("\02" "ou")) break; return; 
 /* takes care of -ous */ 
 case 's': if (ends("\03" "ism")) break; return; 
 case 't': if (ends("\03" "ate")) break; 
 if (ends("\03" "iti")) break; return; 
 case 'u': if (ends("\03" "ous")) break; return; 
 case 'v': if (ends("\03" "ive")) break; return; 
 case 'z': if (ends("\03" "ize")) break; return; 
 default: return; 
 } 
 if (m() > 1) k = j; 
} 
 
/* step5() removes a final -e if m() > 1, and changes -ll to -l if 
 m() > 1. */ 
 
static void step5() 
{ j = k; 
 if (b[k] == 'e') 
 { int a = m(); 
 if (a > 1 || a == 1 && !cvc(k-1)) k--; 
 } 
 if (b[k] == 'l' && doublec(k) && m() > 1) k--; 
} 
 
 
int stem(char * p, int i, int j) 
{ b = p; k = j; k0 = i; /* copy the parameters into statics */ 
 if (k <= k0+1) return k; /*-DEPARTURE-*/ 
 
 /* With this line, strings of length 1 or 2 don't go through the 
 stemming process, although no mention is made of this in the 
 published algorithm. Remove the line to match the published 
 algorithm. */ 
 
 step1ab(); step1c(); step2(); step3(); step4(); step5(); 
 return k; 
}

Automatically extracted. Refer to the original PDF for figures, tables, and formatting.

Cite this

Alobaedy, M. M. (2010). Text Extraction Algorithm for Web Text Classification. *Master Project, Universiti Utara Malaysia*. https://doi.org/10.13140/rg.2.2.29447.66729