Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naïve-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naïve-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification.
📄 Full text (109,199 characters)extracted from the PDF · click to expand
TEXT EXTRACTION ALGORITHM FOR WEB TEXT
CLASSIFICATION
MUSTAFA MUWAFAK THEAB
UNIVERSITI UTARA MALAYSIA
2010
2
TEXT EXTRACTION ALGORITHM FOR WEB TEXT
CLASSIFICATION
A thesis submitted to the Faculty of Information Technology
in partial fulfillment of the requirements for the degree
Master of Science (Intelligent Systems)
Universiti Utara Malaysia
By
Mustafa Muwafak Theab
© Mustafa Muwafak, 2010. All rights reserved.
i
PERMISSION TO USE
In presenting this thesis in partial fulfillment of the requirements for a
postgraduate degree from University Utara Malaysia, I agree that the University
Library may make it freely available for inspection. I further agree that permission for
copying of this thesis in any manner, in whole or in part, for scholarly purpose may be
granted by my supervisor(s) or, in their absence by the Dean of the Graduate School.
It is understood that any copying or publication or use of this thesis or parts thereof
for financial gain shall not be allowed without my written permission. It is also
understood that due recognition shall be given to me and to University Utara Malaysia
for any scholarly use which may be made of any material from my thesis.
Requests for permission to copy or to make other use of materials in this thesis, in
whole or in part should be addressed to:
Dean of Faculty of Information Technology
Universiti Utara Malaysia
06010 UUM Sintok
Kedah Darul Aman
ii
ABSTRACT
Explosive expand of web pages in the World Wide Web makes it difficult for
search engine and web directory to give relevant results to the user
requirements. Web pages need automatic classification techniques with high
classification accuracy. This study provides a text extraction algorithm for
web text classification. The extraction algorithm consists of three phases
namely web page extraction, rule formulation, and algorithm validation. A
text extraction prototype is built using Visual C# 2008 to validate the
algorithm. It is a windows application mixed with web connection protocol.
The prototype offers the creation of Binary data set as well as term frequency
inverse document frequency (tf-idf) data set. In this study, the experiment
was conducted on five English educational websites. The created data sets
are then classified using Naïve-Bayes and C4.5 algorithms provided in
WEKA application. The experimental results show that Naïve-Bayes
classifier with web text extraction algorithm proves to be the best method for
web text classification.
iii
ACKNOWLEDGEMENTS
I would like to express my gratitude sincere appreciation to both of my supervisor,
Prof. Dr. Ku Ruhana Ku Mahamud and Dr. Faudziah Ahmad for guiding the research
presented in this project. I appreciate and thank them for their continuous support,
advice, and comment. The knowledge I have learned from them is the most valuable
things I had learn in this world of research.
A special thanks to all my lecturers in Universiti Utara Malaysia for their support.
My deepest appreciation goes to my beloved family who has been very supportive,
patient and understanding all the time.
Lastly, this wish of thanks is dedicated to beloved friends and others who involve
direct or indirect with this project.
Thank you everyone.
iv
Table of Contents
PERMISSION TO USE.......................................................................i
ABSTRACT....................................................................................ii
ACKNOWLEDGEMENTS..................................................................iii
TABLE OF CONTENT......................................................................iv
LIST OF TABLES.............................................................................vii
LIST OF FIGURES...........................................................................viii
LIST OF APPENDICES.....................................................................ix
CHAPTER ONE: INTRODUCTION
1.0 Introduction........................................................................................................1
1.1 Problem Statement.............................................................................................2
1.2 Objective........................................................................................3
1.3 Significance of the study......................................................................3
1.4 Scope of the study...........................................................................4
1.5 Limitations of the study.....................................................................4
1.6 Organization of the report...................................................................5
CHAPTER TWO: LITERATURE REVIEW
2.1 Website classification.......................................................................6
2.2 Text Mining in Web........................................................................11
2.3 HTML Parser for text extraction.........................................................13
2.4 Summary.....................................................................................15
v
CHAPTER THREE: RESEARCH METHODOLOGY
3.1 Web Text extraction algorithm .........................................................16
3.2 Web Page Extraction ......................................................................17
3.3 Rule Formulation..........................................................................18
3.3.1 Data Processing...............................................................19
I. Remove HTML Tags................................................19
II. Remove Digits........................................................19
III. Split lines to words..................................................20
IV. Remove duplicate words from the list............................20
V. Remove the words have less than specific
number of characters........................................................21
VI. Remove stop-words...................................................21
VII. Stemming the words................................................22
VIII. Remove duplicate words after stemming........................23
3.3.2 Data Set Creation............................................................23
I. Create Binary Data Set.............................................23
II. Create tf-idf Data Set...............................................24
III. Build Physical Data Set............................................27
3.3.3 Text Classification...........................................................29
3.4 Algorithm Validation.....................................................................30
3.5 Developing of the prototype ...............................................................30
3.6 Summary....................................................................................31
vi
CHAPTER FOUR: PROTOTYPE DEVELOPMENT
4.1 Implementation Details.................................................................32
4.2 Classes Implementation.................................................................34
4.3 Graphical User Interface...............................................................36
4.4 Summary....................................................................................37
CHAPTER FIVE: RESULTS AND FINDINGS
5.1 Experiment Design .........................................................................38
5.2 Extraction processing.......................................................................38
5.3 Classification Activities.....................................................................41
5.4 Summary....................................................................................45
CHAPTER SIX: CONCLUSION
6.1 Project Conclusion .........................................................................46
6.2 Recommendation for Future Research ...................................................47
REFERENCES.................................................................................48
vii
LIST OF TABLES
Table 5.1: Classification results (60%-40%)..............................................41
Table 5.2: Classification results (65%-35%)..............................................41
Table 5.3: Classification results (70%-30%)..............................................42
Table 5.4: Classification results (60%-40%)..............................................42
Table 5.5: Classification results (65%-35%)..............................................43
Table 5.6: Classification results (70%-30%)..............................................43
Table 5.7: Classification accuracy Details (DragPushing)..............................44
viii
LIST OF FIGURES
Figure 3.1: Text Extraction Algorithm.......................................................17
Figure 4.1: Classes Diagram...................................................................33
Figure 4.2: HTML_Parser Interface...........................................................37
Figure 5.1: tf-idf Data set.......................................................................40
Figure 5.2: Binary data set.....................................................................40
ix
LIST OF APPENDICES
Appendix A....................................................................................51
Appendix B....................................................................................67
1
CHAPTER ONE
INTRODUCTION
The rapid growth of information on the internet has made it necessary for users to
identify relevant information that were obtained from the process of information
retrieval (Suanmali, Salim, & Binwahlan, 2009). One of the most important aspects in
any information retrieval (IR) process, text query, search engine, and web
classification is to distinguish each word to give better results to the query.
Cooley, Mobasher and Srivastava (1997) define Web mining as “the discovery and
analysis of useful information from the World Wide Web. This describes the
automatic search of information resources available on-line, i.e. Web content mining,
and the discovery of user access patterns from Web servers, i.e., Web usage mining”.
According to Wang, Huang, Wu, and Zhang (1999) “web mining is an integrated
technology in which several research fields are involved, such as data mining,
computational linguistics, statistics, information and so on” (p. 137).
From the previous definitions, Web mining has different methods from data mining
because web pages are semi-structure or non-structure documents. Therefore some
traditional data mining methods such as Linear Regression (Aaron, 2004) are not
applicable to Web mining.
Web classification is an important component in search engines and is use to classify
the type of web pages. Methods like clustering and support vector machine can be
used in Web classification (Yu, Han, & Chang, 2002). Web classification research
2
focuses on web text classification, web image classification, web hyperlinks
classification and web usage classification (Morales, Fandino, & Rodrivguez, 2009).
Text Mining is another important component in Web mining. According to Ronen and
James (2007) “Text Mining can be broadly defined as a knowledge-intensive process
in which a user interacts with a document collection over time by using a suite of
analysis tools”.
Text Mining deals with extracting useful information from text document through the
identification and exploration of patterns using different formulas. Text Mining uses
machine learning algorithms such as Naïve Bayse (NB) classifier, Bayesian logistic
regression, decision tree classifiers, Rocchio method and neural network (Ronen &
James, 2007) to extract useful information. Text mining also includes clustering
methods for unsupervised process through which objects are classified into groups
called clusters. Algorithms such as K-means, EM-based probabilistic clustering,
hierarchical agglomerative clustering and minimal spanning tree are used for
clustering activities (Ronen & James, 2007).
Each method and algorithm has advantages and disadvantages. Some applications
combine two or more algorithms in order to get better classification accuracy because
web mining involves many processing steps which require different algorithms for
each step (Feng & Cherkassky, 2009).
1.1 PROBLEM STATEMENT
Inefficient web classification algorithms will affect the results of IR. Web
classification needs efficient algorithm to produce high accuracy. The highest
3
accuracy from existing text extraction algorithm is around 80% (Fresno & Ribeiro,
2004). The problem in existing algorithm is the deficiency to extract meaningful
words from web pages (Liu, 2009). In Web searching, due to the growth of
information on the Web, matched web pages for a given query could be millions.
Ranking web pages in terms of normalized frequency of keywords does not work
properly, because short pages with only the query in them would likely been ranked
on the top.
1.2 OBJECTIVE
The main objective of this study is to develop a text extraction algorithm for web text
classification. Specific objectives are:
1. To develop feature extraction techniques to extract features from the web
pages.
2. To develop a HTML parser that can remove HTML tags.
3. To develop a prototype for web text extraction.
1.3 SIGNIFICANCE OF THE STUDY
An algorithm for web classification and text extraction will be formulated from this
project. This algorithm will be the contribution to the body of knowledge in the area
of web text classification.
The algorithm has the facility to create Binary data set as well as tf-idf data set from
the extracted text which are necessary for classification activity.
4
The developed prototype can automate the process of text extracting and date set
creation will be a contribution to the society particularly for users to perform web text
classification. This prototype can be easily used by any user with little experience for
example the search engine developer for web classification and web directory
developer.
1.4 SCOPE OF THE STUDY
The study focused on English educational websites. Five different Universities
Websites used in this study. The classification applied only on web text page.
1.5 LIMITATIONS OF THE STUDY
By using Core 2 Duo, 2.00 GHz processor with 3.00 GB RAM, the prototype took
around 3 minutes to process six hundred web pages. The maximum number of web
pages could be processed by the prototype without system hanging was around
thousand pages.
The rows and columns in the created data set should be within 1048576 and 16384
respectively otherwise; Microsoft Excel 2007 will not be able to view the data set.
5
1.6 ORGANIZATION OF THE REPORT
This report divided into six chapters. The chapters that are presented in this report are:
Chapter one gives an overview and introduction about Text Extraction algorithms and
techniques used in web classification. This chapter also discussed about problem
statement, objective, significance of the study, and scope of the study. Limitations of
the study are also presented in this chapter. Chapter two discussed about literature
review, which are related to this project. The literature review divided into three
sections which are web classification, Text Mining in Web, and HTML Parser for text
extraction. Chapter three discussed methodology that been used in this project. The
methodology consists of three phases namely: Web Page Extraction, Rule
Formulation, and Algorithm Validation. The details of each phase discussed in this
chapter. Chapter four discussed about the prototype implementation for HTML Parser
and Text Extraction algorithm. Chapter five discussed the results and finding of the
experiment conducted by using the HTML Parser and WEKA application. Chapter six
presents the conclusion and recommendations to improve this study.
6
CHAPTER TWO
LITERATURE REVIEW
This chapter presents previous work on web classification, text extraction particularly
in web mining and HTML parser techniques.
2.1 Website classification
Anagnostopoulos et al. (2002) proposed an automated website classification system.
The study focuses on e-commerce sites and the pages were indexed and the extracted
words represented using Vector Space Model (VSM) and Site Vector (SV). The
proposed system compares the content of web pages using VSM with the mdimensional commerce descriptor vector (SV). The system performance reached
around 96% accuracy, which indicates that their system has produced good results.
However, their scope is limited to e-commerce only.
Tomar, Shekhar, and Ashish (2006) proposed a classification concept for web pages,
called “WebClassify”. They modified Naïve-Bayes algorithm with multi-nominal
model to classify web pages into different categories. The proposed algorithm makes
use of both Naïve-Bayes algorithm and term frequency – inverse document frequency
factor. They did an experiment using the traditional Naïve-Bayes algorithm (NB) and
their modified algorithm. The results for web page classification were 42.5% using
NB and 55% using their algorithm. From the results, their approach shows
7
enhancement in the classification accuracy over traditional NB. Although the
classification accuracy is not high, but the important point in their study, they have
introduced a term frequency as class document frequency. This method takes into
account only those words have a term frequency inverse class document frequency
value above a threshold value. This technique produced small data set which needs
less processing time.
Hui, Yan, and Jian-Ping (2008) proposed a combined approach for news
categorization. The combination includes decision tree with multilayer neural
networks. Data set was created based on term frequency inverse document frequence
(tf-idf) technique to weight each word (term) in the document. Attributes were
selected by applying feature space reduction method called Chi Square (X2 statistic).
In their experiment, only subject level (the upper hierarchical level) of web pages
were classified. The classifier was built using Back Propagation Network and ID3
algorithm. With 371 news items (pages) inclueded in the training set, the
classification accuracy reached around 88%. The important point in their approach is
the feature space reduction technique. By selecting significant attributes from the data
set, the classification accuracy increased and the processing time decreased which
means there is improvement in the performance of the algorithm.
Meijuan, Jingwen and Shiru (2009) presented a research about web classification
based on support vector machine (SVM). In the study, SVM network was structured
for web text classification where genetic algorithm (GA) was used to optimize SVM
parameters. Hence, enhance the convergence rate and the classification accuracy.
8
1000 web pages were classified into six categories: “news”, “sport”, “science and
technology”, “education”, “amusement”, and “others”. The result of their experiment
measured by mean squared errors and the best results were: (0.0045) in term of errors
for training and (0.0682) in terms of errors for testing. It is clear from their
experiment that they have solved the over-fitting problem of ANN by using SVM.
However, the experiment was applied only on flat web page structure, which is easier
for classification compared to hierarchical web pages.
A comprehensive study on feature selection in automatic web pages categorization
has been presented by (Sun, Sun, Chen, & Fu, 2009). The study focused on ReliefF
and Symmetrical Uncertainty (SU). They addressed critical issues such as “the high
dimensional text vocabulary space” in web pages classification. They used Hidden
Naïve-Bayes classifier (HNB) and Complement class Naïve-Bayes (CNB). They
compared their results with results produced using Support Vector Machine (SVM), K
Nearest Neighbor, and C4.5. The experiment was conducted on 527 web pages
partitioned into seven classes. Boolean data type was used to represent the data. The
results showed that by using feature selection methods, higher accuracy can be
obtained. The highest classification accuracy is around 88% that has been achieved
using HNB classifier. They performed another experiment where “the words are
sorted according to their SU scores”. In addition, some numbers of attributes selected
through removing top N words. The results showed that CNB achieved the highest
accuracy around 97% but HNB was more stable among the others. While SVM was
not practical, it is time-consuming for big-size problem. They concluded that SU is
better for feature selection method than ReliefF. From their study, it is clear that they
9
achieved good classification accuracy and they identified practical method for feature
selection. However, the experiment was conducted only on flat web pages.
Zhongda, Kun, and Yanfen (2007) proposed a research that deals with many
important issues in web pages classification such as feature selection method and
classification algorithms. In their research, they found two models that could
represent the web page namely Vector Space Model (VSM) and Boolean Model.
According to the research, VSM model can represent web pages better than Boolean
because each word was assigned with specific value (weight). For feature selection,
the algorithms used were Comparative Information, Latent Semantic Indexing (LSI),
and Rough Set theory. LSI shows favor among the other algorithms. For classification
algorithm, they investigate K Nearest Neighbor (KNN), Naïve-Bayes (NB),
Relational Learning (RL), Support Vector Machines (SVM), Decision Trees DT, and
Neural Networks Back-propagation (NNB). They concluded that, there is no
classification algorithm better than the other. Each algorithm has advantages and
disadvantages; sometime in order to achieve higher accuracy it may cost processing
time or memory space. Each algorithm will behave differently with the type of data
set. For example, DT algorithm provides better accuracy if the data set is represented
by Boolean values. The classification accuracy depends on the algorithm used in
extraction and feature selection more than the classifier model.
Qi and Davison (2009) have done a comprehensive survey on web page classification.
In their research, Feature extraction and classification algorithms were the focus of
their study. They distinguish the web page classification as “subject classification,
functional classification, sentiment classification, and other types of classification”.
10
They divided the classification problem into binary classification and multiclass
classification. The survey shows another type of classification, that is flat
classification and hierarchical classification. In flat classification, the categories
organized parallel such that one category does not followed by another category,
while in hierarchical classification the categories structured as hierarchical tree-like
structure in which each category may have a number of subcategories. They identify
the differences between web classification and text classification. Classification of
web pages content is different from classification of standard text classification. The
standard text is structured and written with consistent styles, while web pages do not
have such properties. Another important aspect in their research is the feature
selection. They divided the feature into two classes namely “on-page features” which
are available directly on the page to be classified. The other class called “features of
neighbors” which are available on the pages linked in some way to the page to be
classified. The algorithms used for feature selection were N-gram, Dimension
Reduction, and Relational Learning. They have pointed to critical problem about
feature selection from the web pages that contain large images or flash objects, which
is common in today’s web pages. For classification algorithms, the research
investigated many algorithms such as K-Nearest Neighbor, Support Vector Machine
(SVM), and Naïve-Bayes classifiers. The research concluded that in order to improve
the classification accuracy, an appropriate method for feature extraction from web
pages “directly or from neighbor web pages” could improve the classification
performance.
11
2.2 Text Mining in Web
According to Zhongda, Kun, and Yanfen (2007) text classification has been
implemented before web page classification. They define text classification as “It is
the process of algorithmically analyzing a text documents in order to assign a set of
categories (or index terms) from a predefined vocabulary to succinctly describe the
content of a document”. The followings are some works in the field of text mining
specifically in web.
By extracting the text from the web documents, a key phrase can extracted from the
text. Zhang, Zincir-Heywood, and Milios (2005) have investigated the significance of
narrative text classification in the task of automatic key phrase extraction in Web
document collection. Their work consists of two steps i.e., long paragraph
classification and narrative paragraph classification. They have used three methods,
i.e., TF-IDF, KEA, and Keyterm. The processing steps include extracting all tokens in
the text, removing punctuation marks, removing standard set of 425 stop-words, and
applying Porter stemming to obtain word stems. The results show that Keyterm
method gives the best output for all plain text and for narrative text.
Shiqun, Yuhui, and Jike (2007) proposed an algorithm used in text mining on web. In
their studies, they showed an algorithm of how to follow the appointed website or
web page according to the user’s request by using the text mining. They also
investigate on how to extract test characteristic, and how to classify the data
information with feedback judgment combined with the Web page text contents for
12
later use. Their approach was related but different from data mining and different
from information retrieval. The processing method includes four steps: lookup
resources, pretreatment and information extraction, mode discovery, and mode
analysis. They used vector space model for web text classification. Experiments
conducted using 500 documents extracted from sport website. The text classification
takes about 6 seconds and the accuracy of the classification attains 91%.
Most of the websites contain text as well as noise. In order to perform text mining in
the web text, a noise elimination process should first apply. Dey and Haque (2008)
have proposed a paper focused on opinion extraction from noisy data. They used a
hybrid approach, in which they initially employ a semi-supervised method to learn
domain knowledge from a training repository, which contains both noisy and clean
text. A localized linguistic technique has applied to extract opinion expression from
noisy text. Their work comprises five steps namely “pre-processing of noisy text to
enable opinion mining, locate opinion expressions in processed text, analyze opinion
expression to identify features and opinions expressed about them, aggregate opinions
at multiple levels of granularity as desired by user, and developing a Graphical User
Interface to facilitate user interaction”. The performance measured used were standard
evaluation measures of precision, recall, and F-score. The results reached 94%
precision, and the recall value reached 75%. Their studies showed different concept of
text mining by cleaning the text from noisy. By applying this technique, the results
improved in terms of quality.
13
Lee, Baker, Song, and Wetherbe (2010) proposed a paper contains a comparison of
four text-mining methods namely “Latent Semantic Analysis (LSA), Probabilistic
Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and
Correlation Topic Model (CTM)”. In their work, they identify the advantages and
limitation of each one. Vector Space Model (VSM) used to represent the words in
each document with term frequency inverse document frequency (tf-idf) technique. In
their experiment, they collect 258 users’ comments on hardware product (Digital
Camera type Canon) from Amazon website. The task is applying the four algorithms
for Topic Detection from the users’ comments. From the experiment’s results, LSA
gave best result in case of single topic in a document. PLSA gave best result in case of
single topic in one document. LDA and CTM gave best result with both single and
multiple topics for one document. CTM has the ability to model the relationships
between topics. However, LSA shows better ability to deal with synonyms problem.
2.3 HTML Parser for text extraction
Jianli, Jie, and Cuihua (2004) have used HTML parser in their studies, initially; they
developed a model to segment the web document with document object model. They
adopted different sets of filters to filter non-text information and all links, the process
of filtering achieved by scanning HTML pages and removing “src” and “herf”
attributes with the value containing “img”, “swf”, “gif” etc. Their algorithm
implemented by java program language and testing data comes from web document in
English. Other algorithms have used in the process are removing stop-words and
stemming the words with Paice/Husk Stemmer. The results varied from 78% - 100%,
14
and there were some errors in their experiment. However, the HTML parser works
perfectly in their studies.
There are many HTML Parser available freely for different purpose. Alonso-Rios,
Luis-Vazquez, Mosqueira-Rey, Moret-Bonillo, and del Rio (2009) proposed in their
studies an HTML analyzer for web usability. In their system, the HTML analyzer
examines HTML code to extract usability information about web pages. The analyzer
identifies usability problems, checks compliance with usability guidelines, and
interprets usability information to facilitate the task of human usability experts. In
their experiment, they were able to classify each component in the web page like
images, tables and frames. However, they did not extract the text from the HTML
source code.
Ke, Yueming, Zongyan, and Xiang (2008) developed HTML Parser used in the
embedded browser based on deterministic finite automaton. Several resource
limitations such as CPU type, system function definition, display equipment, and
electricity considered in the study. An embedded browser with four functional
models: “network transport, memory management, HTML parser, and display model”
was proposed. The parser has the capability to parse and download at the same time.
By using this parallel process, a parsing time will decrease because it does not need to
wait for the stream to fully downloaded. The architecture and design explained with
some details but the authors did not provide any experiment result.
Ziegler, Vogele, and Viermetz (2009) propsed an approach for web page information
extraction. Their approach is “fully-automated extraction of news content from
15
HTML pages”. Their algorithm divided into three phases, namely “the merging of text
blocks on a structural level, the computation of text block features, and the decision to
keep or discard the current block”. By using those phases, the algorithm was able to
calculate “average sentence length, number of sentences, character distribution, and
stop-word ratio”. In their experiment, they have used 220 news sources in five
western languages and the results compared with other two systems: Crunch and
Body Text Extraction. For evaluation, they implemented classification models such as
C4.5, One Rule, SVM, Naïve-Bayes, and Random Forest. The highest accuracy
obtained by using Random Forest classifier around 78%. The results are acceptable
because the Parser used to extract text from hierarchical websites, which is more
difficult than flat websites in term of classification.
2.4 Summary
This chapter presents some of the previous works that have been done by researchers.
The topics that been discussed are more about website classification techniques, text
mining, and HTML parser. Many algorithms and techniques have been investigated to
highlight the advantages and disadvantages of them. It is clear from all the mentioned
works that there are three important factors in web page classification. First is the
feature selection technique, second is how to present the extracted data, and third is
the classifier algorithm. All the three factors are important but the most important one
which has strong influence on the classification accuracy is the feature selection
technique. If the selected features can represent the web page perfectly, then the
classification accuracy will be high regardless of using any classifier algorithm.
16
CHAPTER THREE
RESEARCH METHODOLOGY
This chapter provides the methodology used in this study. There are two main phases
i.e., formulating an algorithm for text extraction and developing the prototype.
3.1 Text extraction algorithm
There are three phases involved in formulating text extraction algorithm. The phases
are web page extraction, rule formulation, and algorithm validation. Fig 3.1 illustrates
the phases.
17
Figure 3.1: Text Extraction Algorithm
3.2 Web Page Extraction
The process starts by loading the web pages from specific folder or by loading the
web page from the internet to the memory. The algorithm has the facility to handle
any number of pages with any number of targets as long as there is enough physical
Start
Algorithm
Validation
Web Page
Extraction
Rule Formulation
Data
Processing
Data Set
Creation
Text
Classification
18
memory in the machine. After that, extraction process is applied by reading the pages
one by one and extracting source code with all the textual contents. The source code
contains HTML tags, text paragraphs, hyperlinks, and any other text characters. The
user will be asked for the target of the web page/s in order to label them in the data
set. The user can specify how much percentage can be extracted from the page/s. The
output of this phase is a string array. Each item in the array represents the extracted
web page. The following algorithm was used for this phase:
Extracting Page Source Code Algorithm (Online web page):
Step 0: Initialize Uniform Resource Identifier (URI) object;
Step 1: Send request to the server using (URI) address;
Step 2: Send the Credentials with request (default credentials is Windows credentials).
Step 3: Initialize stream object in the memory;
Step 4: Store the stream from the server in stream object;
Step 5: Initialize string;
Step 6: Read and convert (to string) the stream object and save it in the String Array;
Step 7: Return the String Array;
Extracting Page Source Code Algorithm (From Documents in the Hard Disk):
Step 0: Initialize Directory object;
Step 1: Initialize array of File type;
Step 2: Initialize array of String type;
Step 3: Initialize Stream Reader object;
Step 4: Give the directory path (on the hard disk) to the Directory object;
Step 5: Get all the files information from the Directory object;
Step 6: For each file in file array (filesInfoi, i=0,...,n) do steps 7-9;
Step 7: Read the file content into the Stream Reader object;
Step 8: Read the Stream until the end and convert it to String;
Step 9: Save the String in the String array;
Step 10: Return the String Array;
3.3 Rule Formulation
Rule Formulation consist of three steps namely Data processing, Data set creation,
and Text classification. The followings are the details of each step.
19
3.3.1 Data Processing
After extracting the web page source code, the output contains text that includes noise
data like HTML tags and some other components, which are not relevant for
classification process. This step is called text transformation (Croft, Metzler, &
Strohman, 2009), and it includes deleting HTML tags, digits and stop-words, and
stemming the words using Porter stemming algorithm. The output of this step will be
two dimensions string arrays representing all of the extracted words as columns and
all of the web pages as rows. The following are data processing step:
I. Remove HTML Tags: This process will start after collecting the source code
from the web page. The input of this step will be in form of string array, which
has both the text and HTML Tags. The algorithm will check each line in the
string to remove the Tags. The output from this step is a string array contains
only the page paragraph without any HTML Tags. The following algorithm
describes the process.
Remove HTML Tags Algorithm:
Step 0: Initialize the input string array of extracted source code (output of the
previous process “SourceCode”);
Step 1: Initialize output array type of string;
Step 2: Replace any of this character (" < ( . | \ n ) * ? > ") with empty space and
save the output in the output array;
Step 3: For each item in the output array (TextSourcei, i=0...,n) do steps 4-
Step 4: Check if the item contains non-Letter character then, replace it
with empty space;
Step 5: Return the output array;
II. Remove Digits: For classification purpose, digits are not required because
they do not reflect any important value to the model. Therefore, this step will
remove the digits from the paragraph by checking the characters one by one; if
it is a digit then it will remove from the string. The output of this step is the
20
same string array of the previous step but cleaned from any digit. The
following algorithm describes the process:
Remove Digits Algorithm:
Step 0: For each item in the output array (TextSourcei, i=0...,n) do step 1;
Step 1: Check if the item contains non-Letter character then, replace it
with empty space;
Step 2: Return the output array;
III. Split lines to words: Until this point, each string in the array contains lines of
words; this step will split each line to words by using splitting criteria for
example, empty space, comma, semi comma, or any kind of delimited
character. The output of this step will be a list of string array, each item (array)
in the list represent a web page and each item in the array (string) represent the
word in that page The following algorithm describes the process:
Split lines to words Algorithm:
Step 0: Initialize the input (Text array from the previous step);
Step 1: Initialize List of string;
Step 2: For each input (Texti, i=0,...,n) do steps 3-4;
Step 3: Split the word from the text if one of the following character is
after the word: { ' ', '\n', '-', '_', '{', '}', '[', ']','=', '(', ')', ';', '"', ',', '.', '+',
'!' };
Step 4: Add the split word to the List;
Step 5: Return the List;
IV. Remove duplicate words from the list: The list of string array, which is
created from the previous step, has many duplicate words. In this step, a list of
list type string created and each item in the list of string array added to the list
of list if it is not identical to any item in the list of list. The output is a list of
list type of string contains unique words. The following algorithm describes
the process:
Remove duplicate words Algorithm:
Step 0: Initialize the output list (list of list type of string)
21
Step 1: For each page in the input list (pagei, i=0,..., n) do step 2;
Step 2: For each item in the page (itemj, j=0,...,m) do step 3;
Step 3: Check if the item[i] is not empty and does not exist in the
output list[i], then do step 4;
Step 4: Add the item[j] to the output list[i];
Step 5: Return the output list;
V. Remove the words have less than specific number of characters: The user
will specify how many characters should be in each word in order to process at
run time. Small words they only give overload to the data set and most of them
will be in form of preposition, therefore it is better to remove those words and
focus on bigger words only. This step will do this process by checking the
length of character in each word, if it is less than what the user specified then
it will remove. The user can change the number of character length at run
time. The output of this step will be the same list of list type of string but
contains only words that have passed the checking criteria. The following
algorithm describes the process:
Remove specific words Algorithm:
Step 0: For each page in the input list (pagei, i=0,..., n) do step 1;
Step 1: For each string in the page (stringj, j=0,...,m) do step 2;
Step 2: Check if the string characters length is less than 2 (or any
number specified by the user), then do step 3;
Step 3: Remove string[j] from the input list[i];
j--;
Step 4: Return the input list;
VI. Remove stop-words: Not every word in the page is important or can reflect
any value to the model. For example if a word has occurred on most or all the
pages, such a word doesn’t give any value to the model and doesn’t help to
map the function, and some time will confuse the decision if it occurred
alternatively in many classes. Therefore, it is better not to include such a word
in the data set. For this purpose, a list of stop-words was created to use by this
22
step. The process will start by reading the file that contains the stop-words in
the hard disk and adding the words to an array of string. After that a loop will
start to check if there is any word from the stop-words match any word in the
list of list, if there is a matching then this word will be deleted from the list of
list. The output of this step will be a list of list type of string clean from stopwords. The following algorithm describes the process:
Remove stop-words Algorithm:
Step 0: Initialize stop-words array (string array) contains the stop-words;
Step 1: For each word in the in the stop-words array (wordi, i=0,..., n) do step 2;
Step 2: For each page in the pages-list (pagej, j=0,...,m) do step 3;
Step 3: Check if the pages-list[j] contains stop-words[i], then do
step 4;
Step 4: Remove stop-words[i] from the pages-list[j];
Step 5: Return the pages-list;
VII. Stemming the words: In English language each word or verb has many
formations for example the verb “play” can be formatted to (player, plays,
playing, played) or irregular verb such as “eat” can be formatted to (ate, eaten,
eats, eating). Another example of words variety is singular and plural words.
Therefore, it is very essential to get the root of those words in order to unify
them if they have the same root, this process called stemming. This step uses
Porter Stemming Algorithm (PSA), each word in the list of list will be send to
PSA, the return word from the algorithm is the stemmed word. The output of
this step will be a list of list type of string contains stemmed words. Appendix
B contains Porter Stemming Algorithm.
VIII. Remove duplicate words after stemming: Although there was a process of
removing duplicate words before stemming, but because of stemming process,
many words, which appear un-identical at that time, are identical now after
stemming. In addition, the reason why removing duplicate step is not omit if
23
there is a similar process at this step? Removing duplicate words before
stemming and after stemming will save more time than stemming all the
words including the duplicate and then removing the duplicate words. The
output of this step will be a list of list type of string contains stemmed unique
words. The following algorithm describes the process:
Remove duplicate words Algorithm:
Step 0: Initialize the output list (list of list type of string)
Step 1: For each page in the input list (pagei, i=0,..., n) do step 2;
Step 2: For each word in the page (wordj, j=0,...,m) do step 3;
Step 3: Check if the word[i] does not exist in the output list[i], then do
step 4;
Step 4: Add the word[j] to the output list[i];
Step 5: Return the output list;
3.3.2 Data Set Creation
At this point, a list of list type of string (equal to two dimensional array) contains
processed words is ready.
I. Create Binary Data Set
Each page is presented as a record and passed to the algorithm. The algorithm
will add a new row and check every column in the first row, if any word from
the page exist as a column, then it will add 1 to that column in the new row
otherwise it will add new columns and the word will be the title of that
column. This loop will continue until all the pages included in the two
dimensional array. The output of this step will be a two dimensional array
contains all words from the extracted web pages. The following algorithm
describes the process:
Binary Data Set Algorithm:
Step 0: Initialize a List of List (rows) (columns) type of string (equal to
multidimensional array or Table);
Step 1: For each document (Di, i = 0,...,n) do steps 2 – 12;
Step 2: Initialize string array with words extracted from the Di.
Step 3: Add new row (item in the list);
24
Step 4: For each word in the word array (Wj, j=0,...,m) do steps from 5 –
11;
Step 5: Check if the word (Wj) is exist in the List. IF exist, do
steps from 6 to 7;
Step 6: integer r = row[n], integer c= column[i] of
(existed word);
Step 7: Insert 1 to List[r,c];
IF not exist do steps 8 to 12;
Step 8: Add new columns “The column title is the
new word”;
Step 9: integer ro = row[n], integer col =column[n];
Step 10: Insert value of 1 to the List [ro, col];
Step 11: For each row (Rk, k=0,..., p-1) insert value of 0 to
the column[n];
Step 12: For each row (Rv, v=0,...,z) insert Target category.
Change each item in the Binary data set contains Null to 0
From the previous algorithm, the data set array contains all the extracted
words as columns title and a value of “1” if the record contains that word or
Null if it does not contain it. This step will convert all the Null values to zero
values by looping through all the items in the array and checking if the value
is Null then change it to “0”. The output of this step will be a two dimensional
array contains words in the first row and values (1, or 0) in the other rows. The
following algorithm describes the process:
Convert Null to Zero Algorithm:
Step 0: For each row in the dataSet (rowi, i=0,..., n) do step 1;
Step 1: For each column in the dataSet (columnj, j=0,..., m) do step 2;
Step 2: Check if dataSet [i, j] = Null, then do step 3;
Step 3: dataSet [i, j] = 0;
Step 4: Return the dataSet;
II. Create tf-idf Data Set:
In this type of data set, each word has a weight depends on its frequency in the
document. The weight calculated by using the following formula:
Each page is represented by “d” and the total number of pages represented by “D”,
such that d
D.
25
The “tf” calculated using the following formula:
tf
i.j
=
푁푗
∑
푤푖푗
푛
푖=0
where:
N is the total number of the same word (wi) appeared in document (j)
i (i = 0,...,n) is the index of the word (w) in document (j)
j(j=0,...,p) is the index of document (d) in (D)
and the denominator is:
∑
푤푖푗
푛
푖=0
is the sum of all the words (w) in document (dj).
The calculation of idf is done by using the following formula:
idfi
= log
|퐷|
|1+
{
푑∶푤푖 ∈푑
}
|
where |D| is the total number of documents extracted.
and |
{
푑∶푤푖 ∈푑
}
| is the number of documents where the word (wi) appeared.
“1” added to avoid division by zero in case if the word did not exist in any other page.
(tf-idf)i,j is computed by multiplying tf
i.j
and idfi
The idea behind this calculation is if the term occurred many times in the same page
means, it is more important. However, if the same term occurred in many other pages
means it is less important. By using this technique, each term has specific weight
according to its occurrence and frequency. This approach gives better value for each
term instead of giving 0 and 1 or simply counting the frequency of the word in the
26
document. This method will also create some relations between each word by
counting the frequency of term in each document. The relations were destroyed
previously when the algorithm extracted the text and splitting the words into
individual term without preserving the location of the each word, which gives
different meaning.
The data set created by using this algorithm has two dimensions. Each row in the data
set represent individual page and each column represent unique term. The following
algorithms describe all the mentioned calculation and data set creation.
Creating tfidf Data Set Algorithm (First part)
Step 0: Initialize two-dimensional Array (DS) type of string;
Step 1: For each document (di, i = 0,...,n) do steps 2 – 11;
Step 2: Initialize string array (W) with words extracted from di.
Step 3: For each word in the word array (Wj, j=0,...,m) do steps from 4 –
11;
Step 4: Check if the word (Wj) is exist in the DS.
IF exist, do steps from 5 to 6;
Step 5: integer r = row[n], integer c= column[i] of
(existed word Wi);
Step 6: Add 1 to the value in DS[r,c];
IF not exist do steps 7 to 10;
Step 7: Add new columns “The column title is the
new word”;
Step 8: integer ro = row[n], integer col =column[n];
Step 9: Insert value of 1 to the DS [ro, col];
Step 10: For each row (Rk, k=0,..., p-1) insert value
of 0 to the column[n];
Step 11: For each row (Rv, v=0,...,z) insert Target category.
Step 12: Return DS.
Change each item in DS contains Null to 0 (Second part)
This step will convert all the Null values to zero values by looping through all
the items in the array and checking if the value is Null then change it to “0”.
The output of this step will be a two dimensional array contains words in the
27
first row and values (words count, or 0) in the other rows. The following
algorithm describes the process:
Convert Null to Zero Algorithm:
Step 0: For each row in the DS (rowi, i=0,..., n) do step 1;
Step 1: For each column in the DS (columnj, j=0,..., m) do step 2;
Step 2: Check if DS [i, j] = Null, then do step 3;
Step 3: DS [i, j] = 0;
Step 4: Return the DS;
Calculate tf-idf Algorithm (Third part):
Step 0: Initialize two-dimensional Array (DS) type of string (Output of the second
part);
Step 1: Initialize int N “represent all the same word in the page”;
Step 2: For each row in the DS (rowi, i=1,..., n) do step 2;
Step 3: For each column in the DS (columnj, j=0,..., m) do step 3;
Step 4: For each row in the DS (rowk, k=1,..., m) do step 4;
Step 5: IF DS[k,c] != “0” do step 6;
Step 6: N++;
Step 7: IF DS[r,c] != “0” do step 8;
Step 8: DS[r,c] = CalcTFIDF(DS[r,c], N);
Step 9: N = 0;
CalcTFIDF (DS[r,c], N) Algorithm (Forth part):
Step 0: Initialize double tfidf;
Step 1: Initialize double N “represent all the same word in the page”;
Step 2: Initialize double tf;
Step 3: Initialize double idf;
Step 4: tf = N / the sum of all the words in document;
Step 5: idf = Log (total pages) / (1 + the number of pages where the word
appeared);
Step 6: tfidf = tf * idf;
Step 7: Return tfidf;
III. Build Physical Data Set
This step will create physical data set from the data set in the memory. The following
sub-steps and algorithms will describe this step.
a) Count how many columns has non 0 value in the data set
28
The initialization of the two-dimensional array was approximated to (1500, 15000)
which is quite enough to hold all the data but it is not precisely represent the data set
dimensions. This step is the first point towards calculating the exact size of
dimensions by counting how many columns has not “0” values which means the
columns contain words. It is clearly the process needs to loop only in the first row
because it represents all the words in the array. The total number of counting plus “1”
as a target column will represent the boundary of the second dimension in the data set
while the first dimension can be obtained from the total number of pages extracted in
the beginning plus “1” as a row (for columns title). The following algorithm describes
the process:
b) Counting Array Boundary Algorithm:
Step 0: Initialize integer variable (counter);
Step 1: For each column in the dataSet (columni, i=0,..., n) do step 2;
Step 2: Check if dataSet [0, j] Not equal to 0, then do step 3;
Step 3: counter++;
Step 4: Return counter;
c) Initialize the size of two dimensional array type of string:
After obtaining the required dimensions for the data set, this step will initializing the
new two dimensional array with the obtained dimensions. The output of this step will
be a new two-dimensional array, which represent the data set size exactly. The
following algorithm describes the process:
Initialize the size of two-dimensional array Algorithm:
Step 0: Initialize integer variable (dim1);
Step 1: Initialize integer variable (dim2);
Step 2: dim1 = number of page extracted + 1;
Step 3: dim2 = counter + 1 (counter from previous algorithm);
Step 4: Initialize 2 dimensional string array NewDataSet [dim1, dim2];
Step 5: Return NewDataSet;
d) Copy every item from the old data set array to the new array:
29
This step will copy every item in the previous array to the new two dimensions array.
The output of this step will be a two dimensions array contains all the words and
values with specific and known size. The following algorithm describes the process:
Copy all items Algorithm:
Step 0: For each row in the old dataSet (rowi, i=0,..., n) do step 1;
Step 1: For each column in the old dataSet (columnj, j=0,..., m) do step 2;
Step 2: (Copy item from old array to new array) NewDataSet [i, j] =
dataSet [i, j];
Step 3: Return the NewDataSet;
e) Create CSV file
This is the last step in creating data set. This process starts by creating a stream writer
object to create and write a file on specific location with .CSV extension. After that,
the process will loop through all the rows and columns in the data set array to read its
content and write it to the stream object with a comma after each value. Between
looping in each row, a new line is created in the file to represent the next row in the
array. The output of this step will be a .CSV file which is ready for classification
process. The following algorithm describes the process:
Create CSV file Algorithm:
Step 0: Initialize stream writer (SW) object to write into specific file;
Step 1: For each row in the NewDataSet (rowi, i=0,..., n) do step 2;
Step 2: For each column in the NewDataSet (columnj, j=0,...,m) do step 3;
Step 3: Write the value of NewDataSet [i, j] to the file using SW
object;
Step 4: Write new line in the file using SW object;
Step 5: Close SW object;
3.3.3 Text Classification
Two algorithms used in this experiment are the Naïve-Bayes (NB) and C 4.5. Both
applied with different size of data set and extraction parameters to investigate which
algorithm give better results in terms of accuracy and processing time.
30
The WEKA application was used in this project where the Explorer option was
selected from the WEKA GUI. WEKA has the facility to handle (CSV) file therefore
there is no need to convert the data set because it is already in CSV format.
3.4 Algorithm Validation
At the final phase, a validation technique applied to validate the accuracy of the rules
and module built by WEKA. The module used to classify a new web page in the same
scope. For this purpose, a prototype built using Visual C# 2008. It is a window
application mixed with web connection protocol. It uses HTTP protocol for internet
connection. The prototype has the ability to view the web page inside the User
Interface and downloading the source code for further process. The prototype gives
the user the facility of saving the data set in CSV or TEXT extension.
3.5 Developing of the Prototype
The methodology adopted for developing the prototype “HTML Parser” called Code
and Fix methodology.
Code and Fix methodology consists of two steps: Write some code and Fix the
problem in the code (McConnell, 1996).
This methodology is suitable for small projects like the developed one. Because of the
iterative nature of the methodology, it was easy to change the prototype functionality
when it is required. This methodology focuses on writing the code to implement the
required functions, execute the functions, and fix the code if the functions do not work
correctly as it is required.
31
3.6 Summary
The methodologies discussed in this chapter are practical methodologies. The first
methodology designed specifically to formulate the text extraction algorithm. All the
algorithms mentioned in this chapter implemented as classes in the prototype. The
second methodology adapted from (McConnell, 1996) which is perfect to develop a
small prototype like this one. The methodologies were suitable and quite sufficient for
this study.
32
CHAPTER FOUR
PROTOTYPE DEVELOPMENT
This chapter describes the methodology used in developing the prototype. The code
and fix methodology adapted in developing the prototype “HTML_Parser”.
4.1 Implementation Details
The prototype implemented by using Visual C#. All the algorithms mentioned in
chapter three are implemented as classes in the prototype HTML_Parser.
HTML_Parser is windows application mixed with web connection protocol. It uses
HTTP protocol for internet connection. The prototype has the facility to view the web
page inside the User Interface and downloading the source code for further process.
The prototype gives the user the facility of saving the data set in CSV extension.
Many considerations were taken into account especially the processing time in
developing the HTML Parser prototype. The prototype gives the user many options
while processing the documents like how much percentage to be processed from each
page, remove stop-words, and stemming the words etc.
The prototype can handle any number of web pages to process but the data set size
will be limited to the free size available in the physical memory. By trying many
number of pages (max 737), it was able to create a data set with (738 rows and 12847
columns).
33
The processing time will be different from machine to machine depend on the
processor speed and the number of pages. The machine used to run this prototype is
Intel Core 2 Duo, 2.00 GHz with 3.00 GB RAM and the performance was quite
acceptable in term of time.
The prototype divides the algorithms into classes. Each class represents one
algorithm. Fig 4.1 shows the classes diagram and their relationships with each other.
Figure 4.1: Classes Diagram
34
4.2 Classes Implementation
I. Get Source Code Algorithm Implementation
There are two methods to extract the source code. First method is
extracting online web page. And the second one is extracting web pages
from the hard disk. Appendix A contains both methods implemented in
Visual C#.
II. Calculate Percentage Algorithm Implementation
This class provides a method used for calculating how much percentage
from each page to be extracted. Appendix A contains the method
implemented in Visual C#.
III. Remove HTML Tags Algorithm Implementation:
The object of this class has a method called Remove. This method takes
one parameter “array of string” as the extracted web pages. The method
will remove any HTML tags in those pages. Appendix A contains the
method implemented in Visual C#.
IV. Remove Digits Algorithm Implementation
The object of this class has a method called RemoveDigits. This method
takes one parameter “array of string” as the extracted web pages. The
method will remove any digits in those pages. Appendix A contains the
method implemented in Visual C#.
35
V. Split lines to words Algorithm Implementation
This class provides the functionality to split the text extracted from the
web pages into separated words. Appendix A contains the method
implemented in Visual C#.
VI. Remove duplicate words Algorithm Implementation
This class provides the methods to search and remove duplicate words
before and after stemming. Appendix A contains the method implemented
in Visual C#.
VII. Remove the words have less than specific number of characters
This class provides a method used to select only words have specific
number of character. Appendix A contains the method implemented in
Visual C#.
VIII. Remove stop-words Algorithm Implementation
This class provides the functionality to remove the stop-words from the
extracted web pages. Appendix A contains the method implemented in
Visual C#.
IX. Stemming the words Algorithm Implementation
This class implements Porter Stemming Algorithm. The class
implementation adapted from (http://tartarus.org/~martin/PorterStemmer/).
Appendix B provides the algorithm implementation.
36
X. Create Binary Data Set Algorithm Implementation
This class provides methods for creating Binary data set. Appendix A
contains the method implemented in Visual C#.
XI. Create tf-idf Data Set Algorithm Implementation
This class provides methods for creating tf-idf data set. Appendix A
contains the method implemented in Visual C#.
XII. Build Physical Data Set Algorithm Implementation
This class provides a method to create the data set type .CSV on the hard
disk. Appendix A contains the method implemented in Visual C#.
4.3 Graphical User Interface
The interface provides all the facilities required for text extraction. The extraction can
be done from online web page by providing the web page address to the built-in
browser or from web pages stored in the hard disk. Fig 4.3 shows the HTML_Parser
interface.
37
Figure 4.2: HTML_Parser Interface.
4.4 Summary
This chapter describes the implementation tools and process of developing
HTML_Parser prototype. The prototype provides all the facilities required for Web
Text Extraction such as extracting text from web pages, processing the text, and
creating the data set. The prototype performance was acceptable in term of processing
time consumption and extraction accuracy.
38
CHAPTER FIVE
RESULTS AND FINDINGS
This chapter presents the experiment results conducted for this study.
5.1 Experiment Design
The experiment was divided into two parts. First, is the extraction process using
HTML Parser to create a data set. Second, is the classification activities performed
using WEKA application to classify the web pages. Two classifier algorithms were
used in this study, Naïve-Bayes algorithm and C4.5 which is J48 in WEKA. Those
algorithms were applied on two types of data set, Binary and tf-idf.
5.2 Extraction process
The experiment conducted used the web pages from Carnegie Mellon University
called “CMU World Wide Knowledge Base (Web->KB) project”. This package
contains WWW-pages collected from the universities of Cornell, Texas, Washington,
and Wisconsin. The pages were manually classified into the following categories:
student, faculty, staff, department, course, and project. The number of pages for each
category is 1641, 1124, 137, 182, 930, 504, and 3764 respectively. Only four of the
categories have been used in this experiment. The categories are course, department,
project, and student. Where there are 150 pages for each category total of 600 web
pages.
39
The files are organized into a directory structure, where one directory represents one
class. Each of these four directories consists of five universities web pages. Some of
the pages do not contain useful information. For example, about 80 pages only
contain information for redirecting the browser to a different location; therefore the
pages were selected after testing each one manually.
After downloading and organizing the pages in the directories, the next step is reading
those pages. There are two options for this step, either by reading the page
individually one by one or by reading all the pages inside the folder in one shot, this
experiment was conducted by using the second option in order to save time.
The HTML Parser prototype used here has the facilities to read multi-pages,
providing the folder path and the target “class type” manually. In this experiment,
100% of the pages content was selected to be processed into a data set. Many words
in the web pages have small number of characters, therefore the next option selected
is processing only the words consist of more than 3 characters. To apply tf-idf
technique, duplicate words are needed to weight the frequency of that word. For this
requirement, the option remove duplicate is not selected. The other option remove
stop-words selected to remove the words that don’t give any value for classification.
Stemming option was selected to apply Porter Stemming Algorithm on the extracted
words. The last option is to decide which technique needs to be used for creating data
set. There are two techniques, one is creating binary data set and the other is creating
tf-idf data set. In this experiment, both techniques are used to create two type of data
set. The first data set is created by using tf-idf technique. After creating the data set,
the file was in form of comma separated value (CSV). The final data set contains 600
rows with 4 classes namely: course, department, project, and student. Fig 5.1 shows
part of tf-idf data set.
40
Figure 5.1: tf-idf Data set.
The second data set created by using binary technique instead of tf-idf technique. For
this type of data set, no need to count the duplicate words, therefore the option
remove duplicate was selected. Fig 5.2 shows part of binary data set.
Figure 5.2: Binary data set.
41
5.3 Classification Activities
WEKA 3.7.1 application used in this experiment, Explorer option selected from the
Weka GUI chooser. WEKA has the facility to handle (CSV) file therefore there is no
need to convert the data set because it is already in CSV format. The algorithm chosen
in this experiment was Naïve-Bayes algorithm and C4.5 (J48). Many splitting rate
was used to find out which rate is the best for training and test parts. Table 5.1 shows
the classification results for 60% as training set and 40% as testing set using tf-idf
data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
tf-idf 7328
Naïve-
Bayes
No 4.42 8.3333 % 91.6667 %
tf-idf 7328
Naïve-
Bayes
Yes 5.65 13.75 % 86.25 %
tf-idf 7328 C4.5 N/A 40.43 22.9167 % 77.0833 %
Table 5.1: classification results (60%-40%).
By using tf-idf data set with (60% - 40%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 91%.
Table 5.2 shows the classification results for 65% as training set and 35% as testing
set using tf-idf data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
tf-idf 7328
Naïve-
Bayes
No 4.08 9.5238 % 90.4762 %
tf-idf 7328
Naïve-
Bayes
Yes 5.76 15.7143 % 84.2857 %
tf-idf 7328 C4.5 N/A 39.65 19.0476 % 80.9524 %
Table 5.2: classification results (65%-35%).
42
By using tf-idf data set with (65% - 35%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 90%.
Table 5.3 shows the classification results for 70% as training set and 30% as testing
set using tf-idf data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
tf-idf 7328
Naïve-
Bayes
No 3.61 7.7778 % 92.2222 %
tf-idf 7328
Naïve-
Bayes
Yes 5.3 17.2222 % 82.7778 %
tf-idf 7328 C4.5 N/A 41.45 24.4444 % 75.5556 %
Table 5.3: classification results (70%-30%).
By using tf-idf data set with (70% - 30%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 92%.
Table 5.4 shows the classification results for 60% as training set and 40% as testing
set using Binary data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
Binary 7209
Naïve-
Bayes
No 4.34 10.8333 % 89.1667 %
Binary 7209
Naïve-
Bayes
Yes 5.3 14.5833 % 85.4167 %
Binary 7209 C4.5 N/A 49.42 20.4167 % 79.5833 %
Table 5.4: classification results (60%-40%).
By using Binary data set with (60% - 40%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 89%.
43
Table 5.5 shows the classification results for 65% as training set and 35% as testing
set using Binary data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
Binary 7209
Naïve-
Bayes
No 3.63 11.9048 % 88.0952 %
Binary 7209
Naïve-
Bayes
Yes 5.14 13.8095 % 86.1905 %
Binary 7209 C4.5 NA 47.96 20 % 80 %
Table 5.5: classification results (65%-35%).
By using Binary data set with (65% - 35%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 88%.
Table 5.6 shows the classification results for 70% as training set and 30% as testing
set using Binary data set.
Data
Representation
No of
attributes
Classifier
Algorithm
Discretized
Time to
build model
(second)
Incorrectly
classified
Correctly
classified
Binary 7209
Naïve-
Bayes
No 4.18 12.2222 % 87.7778 %
Binary 7209
Naïve-
Bayes
Yes 5.65 16.1111 % 83.8889 %
Binary 7209 C4.5 NA 46.39 26.1111 % 73.8889 %
Table 5.6: classification results (70%-30%).
By using Binary data set with (70% - 30%) as splitting rate and without applying
Discretization option, Naïve-Bayes shows the highest accuracy around 87%.
The number of instance has a vital rule in the results of accuracy. By trying many data
set sizes, the results are that neither a small number of instances nor a very big
number of instances was useful in the classification accuracy. This is because a small
44
number of instances will build a weak model which doesn’t have the ability to classify
new input to the correct class. In opposite case when using very big number of
instances, the model will be very complex and the probability of over fitting will
increase, therefore the data set in middle size (600 instances) gives the best results.
From the results, it is clear that the highest accuracy around 92% obtained by using tfidf data set and Naïve-Bayes classifier without applying Discretization technique. By
using tf-idf technique the classification accuracy increased around 3% from using
Binary technique. This type of classification called Hierarchical classification
(Xiaoguang & Brian, 2009). This type of classification is more difficult than flat
classification because the task here is to classify the pages which are in the same area
(educational in this case). Those pages have similar nature in structure and contents,
therefore it is not an easy task to classify such pages and we need a sufficient
algorithm to extract the text from those pages like the developed one.
Songbo, Xueqi, Moustafa, Bin, and Hongbo (2005) have proposed a clssification
algorithm called “DragPushing”. In their approach, they have used the same data set
used in this study (WebKB) and seven classification algorithms namely: Refining
Centroid Classifier (RCC), Centroid, Refining Naïve-Bayes (RNB), Naïve-Bayes
(NB), K-Nearest Neighbor (KNN), Winnow, and Support Vector Machines (SVM).
Table 5.7 shows the classification accuracy details for each algorithm.
RCC Centroid RNB NB KNN Winnow SVM
0.8802 0.7847 0.9063 0.8601 0.7015 0.7785 0.8766
Table 5.7: Classification accuracy Details (DragPushing).
45
RCC and RNB algorithms gave the highest classification algorithm, while NB gave
around 0.86 classification accuracy.
Comparing their approach with this approach, this approach was able to produce
classification accuracy approximately 2% higher than RNB and approximately 6%
higher than NB in their approach. This indicated that, this study has improved the web
page classification in term of accuracy.
5.4 Summary
Experiment was conducted by using Naïve-Bayes and C4.5 classifier. Both algorithms
applied on Binary data set and tf-idf data set. Naïve-Bayes classifier shows accuracy
around 92% which is higher than C4.5 around 80%. Another important factor is the
time consumption. C4.5 consume around 41 second while Naïve-Bayes consume
around 3 second. Therefore, the classifier Naïve-Bayes consider better than C4.5 in
term of classification accuracy and time consumption.
Extracting more words from each page gives more accuracy later in the classification,
the other options like selecting the number of character in each word should be
considered as critical option as well.
Naïve-Bayes has good capability to handle large data set. By using hierarchical data
set, the classification accuracy was acceptable. The accuracy will increased if the
classification type is flat structure or if the data set classes are less than 4 classes.
Educational data set is very challenging data set and it is good date set for evaluating
an algorithm. The proposed approach is a promising approach for text extraction in
web classification.
46
CHAPTER SIX
CONCLUSION AND RECOMMENDATIONS
This chapter concludes the study and discussion of this project for Text Extraction
Algorithm. Finally, recommendation and further study in this area are given in this
chapter.
6.1 Project Conclusion
The objective of this project is to develop a text extraction algorithm for web text
classification. The HTML Parser that been developed provides all the facilities
required for text extraction from the web pages. Two type of data set can be created
using the HTML Parser, one is Binary data set and the other one is id tf-idf data set.
The experiment validates the prototype by applying classifier algorithms on the data
set. Two classifiers algorithms used, one is Naïve-Bayes algorithm and the other one
is C4.5 algorithm. Naïve-Bayes classifier produced better classification results in term
of accuracy and time consumption.
47
6.2 Recommendation for Future Research
For more research on text extraction algorithm for web page classification, deep
investigations on feature selection algorithms are required to improve the data set
quality. The feature selection algorithm shows strong influence on classification
accuracy.
Furthermore, it is also suggested that in future research, how to represent the extracted
data should be investigated. It is very vital to find a method to preserve the relations
between extracted words.
48
REFERENCES
Aaron, H. (2004). Introduction to Bayesian Learning. New York: University of
Toronto.
Alonso-Rios, D., Luis-Vazquez, I., Mosqueira-Rey, E., Moret-Bonillo, V., & del Rio,
B. B. (2009, 11-14 Oct. 2009). An HTML analyzer for the study of web
usability. Paper presented at the IEEE International Conference on Systems,
Man and Cybernetics, 2009. SMC 2009, San Antonio, TX, 1224-1229.
Anagnostopoulos, I., Kouzas, G., Anagnostopoulos, C., Vergados, D.,
Papaleonidopoulos, I., Generalis, A.,...Kayafas, E. (2002). Automatic Web
site classification in a large repository under information filtering and retrieval
techniques. Paper presented at the Electrotechnical Conference, 2002.
MELECON. 11th Mediterranean, 279-283.
Cooley, R., Mobasher, B., & Srivastava, J. (1997, 3-8 Nov 1997). Web mining:
information and pattern discovery on the World Wide Web. Paper presented at
the Proceedings Ninth IEEE International Conference on Tools with Artificial
Intelligence, Newport Beach, CA, 558-567.
Croft, B., Metzler, D., & Strohman, T. (2009). Search Engines: Information Retrieval
in Practice: Addison Wesley.
Dey, L., & Haque, S. K. M. (2008). Opinion mining from noisy text data. Paper
presented at the Proceedings of the second workshop on Analytics for noisy
unstructured text data, Singapore, 83-90.
Feng, C., & Cherkassky, V. (2009). SVM+ regression and multi-task learning. Paper
presented at the International Joint Conference on Neural Networks, 2009.
IJCNN 2009, Atlanta, GA, 418-424.
Fresno, V., & Ribeiro, A. (2004). An Analytical Approach to Concept Extraction in
HTML Environments. Journal of Intelligent Information Systems, Volume 22,
Number 3(0925-9902), 215-235.
Hui, G., Yan, F., & Jian-Ping, L. (2008). Classification of Sensitive Web Documents.
Paper presented at the International Conference on Apperceiving Computing
and Intelligence Analysis, 2008. ICACIA 2008, Chengdu, 295-298.
Jianli, L., Jie, S., & Cuihua, X. (2004, 15-18 Sept. 2004). Segmenting the Web
document with document object model. Paper presented at the IEEE
International Conference on Services Computing, 2004. (SCC 2004), China,
449-452.
Ke, Y., Yueming, L., Zongyan, L., & Xiang, D. (2008, 25-27 June 2008).
Architecture and design of the parser used in the embedded browser based on
DFA. Paper presented at the 7th World Congress on Intelligent Control and
Automation, 2008. WCICA 2008, Chongqing, 6627-6630.
49
Lee, S., Baker, J., Song, J., & Wetherbe, J. C. (2010). An Empirical Comparison of
Four Text Mining Methods. Paper presented at the International Conference on
43rd Hawaii System Sciences, (HICSS), Honolulu, Hawaii, USA, 1-10.
Liu, Y., A process-based search engine. Ph.D. dissertation, University of Kansas,
United States -- Kansas. Retrieved April 16, 2010, from Dissertations &
Theses: Full Text.(Publication No. AAT 3360489).
McConnell, S. (1996). Rapid Development: Taming Wild Software Schedules.
Washington: Microsoft Press.
Meijuan, G., Jingwen, T., & Shiru, Z. (2009). Research of web classification mining
based on classify support vector machine. Paper presented at the International
Colloquium on Computing, Communication, Control, and Management, 2009.
CCCM 2009. ISECS, Sanya, 21-24.
Morales, S. d. P. B., Fandino, H. A. B., & Rodrivguez, J. R. (2009). Hypertext
classification to filtrate information on the web. Paper presented at the
Proceedings of the 2009 Euro American Conference on Telematics and
Information Systems: New Opportunities to increase Digital Citizenship,
Prague, Czech Republic, 1-7.
Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms.
ACM Computing Surveys (CSUR), 41(2), 1-31.
Ronen, F., & James, S. (2007). The text mining handbook. New York: Cambridge
University Press.
Shiqun, Y., Yuhui, Q., & Jike, G. (2007). Research and Realization of Text Mining
Algorithm on Web. Paper presented at the International Conference on
Computational Intelligence and Security Workshops, 2007. CISW 2007,
Harbin, 413-416.
Songbo, T., Xueqi, C., Moustafa, M. G., Bin, W., & Hongbo, X. (2005). A novel
refinement approach for text categorization. Paper presented at the
Proceedings of the 14th ACM international conference on Information and
knowledge management, Bremen, Germany, 469-476.
Suanmali, L., Salim, N., & Binwahlan, M. S. (2009). Feature-Based Sentence
Extraction Using Fuzzy Inference Rules. Paper presented at the 2009
International Conference on Signal Processing Systems, Singapore, 511- 515.
Sun, B., Sun, Q., Chen, Z., & Fu, Z. (2009). A Study on Automatic Web Pages
Categorization. Paper presented at the IEEE International Advance
Computing Conference, 2009. IACC 2009, Patiala, 1423-1427.
Tomar, G. S., Shekhar, V., & Ashish, J. (2006). Web Page Classification using
Modified Naive Bayesian Approach. Paper presented at the TENCON 2006.
IEEE Region 10 Conference, Hong Kong, 1-4.
50
Wang, J., Huang, Y., Wu, G., & Zhang, F. (1999). Web mining: knowledge discovery
on the Web. Paper presented at the 1999 IEEE International Conference on
Systems, Man, and Cybernetics, IEEE SMC '99 Conference Proceedings,
Tokyo, 137-14.
Xiaoguang, Q., & Brian, D. D. (2009). Web page classification: Features and
algorithms. ACM Computing Surveys (CSUR) 41(2), 1-31.
Yu, H., Han, J., & Chang, K. C.-C. (2002). PEBL: positive example based learning
for Web page classification using SVM. Paper presented at the Proceedings of
the eighth ACM SIGKDD international conference on Knowledge discovery
and data mining, Edmonton, Alberta, Canada, 239-248.
Zhang, Y., Zincir-Heywood, N., & Milios, E. (2005). Narrative text classification for
automatic key phrase extraction in web document corpora. Paper presented at
the Proceedings of the 7th annual ACM international workshop on Web
information and data management, Bremen, Germany, 51-58.
Zhongda, L., Kun, D., & Yanfen, H. (2007). Research of Web Pages Categorization.
Paper presented at the IEEE International Conference on Granular Computing,
2007. GRC 2007, Fremont, CA, 691-691.
Ziegler, C.-N., Vogele, C., & Viermetz, M. (2009). Distilling Informative Content
from HTML News Pages. Paper presented at the International Joint
Conferences on Web Intelligence and Intelligent Agent Technologies, 2009.
WI-IAT '09. IEEE/WIC/ACM Milan, Italy, 707-712.
51
Appendix A
I. Get Source Code Algorithm Implementation
class GetSourceCode
{
public StreamReader streamReader;
public string GetOnlineSource(string address)
{
Uri uri = new Uri(address);
WebRequest request = WebRequest.Create(uri);
request.Credentials = CredentialCache.DefaultCredentials;
HttpWebResponse response = (HttpWebResponse)request.GetResponse();
Stream dataStream = response.GetResponseStream();
streamReader = new StreamReader(dataStream);
string sourceCode = streamReader.ReadToEnd();
response.Close();
dataStream.Close();
streamReader.Close();
return sourceCode;
}
public string[] GetSourecCodeFromFolder()
{
FolderBrowserDialog folderDialog = new FolderBrowserDialog();
folderDialog.ShowDialog();
if (folderDialog.SelectedPath != "")
{
DirectoryInfo di = new DirectoryInfo(folderDialog.SelectedPath);
FileInfo[] filesInfo = di.GetFiles();
string[] allPages = new string[filesInfo.Length];
// 1- Collect files and read source code
for (int i = 0; i < filesInfo.Length; i++)
{
streamReader = new StreamReader(filesInfo[i].FullName);
allPages[i] = streamReader.ReadToEnd().ToLower();
}
return allPages;
}
string[] allPages2 = new string[0];
return allPages2;
}
}
II. Calculate Percentage Algorithm Implementation
class CalcPercentage
{
public void Calc(double percentage, string[] webPages)
{
for (int i = 0; i < webPages.Length; i++)
{
if (webPages[i].Length > webPages[i].Length * percentage)
{
int start = Convert.ToInt32(webPages[i].Length * percentage);
int charCount = webPages[i].Length - start;
webPages[i] = webPages[i].Remove(start, charCount);
}
52
}
}
}
III. Remove HTML Tags Algorithm Implementation:
public class RemoveHTMLTags
{
public void Remove(string[] webPages)
{
for (int i = 0; i < webPages.Length; i++)
{
webPages[i] = Regex.Replace(webPages[i], @"<(.|\n)*?>", " ");
}
}
}
IV. Remove Digits Algorithm Implementation
class RemoveDigits
{
public void Remove(string[] webPages)
{
for (int i = 0; i < webPages.Length; i++)
{
for (int j = 0; j < webPages[i].Length; j++)
{
if (!char.IsLetter(webPages[i][j]) && webPages[i][j] != ' ')
{
webPages[i] = webPages[i].Replace(webPages[i][j], ' ');
}
}
}
}
}
V. Split lines to words Algorithm Implementation
class SplitText
{
public void Split(string[] allPages, List<string[]> listArray)
{
for (int i = 0; i < allPages.Length; i++)
{
string[] words = allPages[i].Split(new char[] { ' ', '\n', '\\','/', '-', '_', '{', '}', '[',
']', '=', '(', ')', ';', '"', ',', '.', '+', '!' });
List<string> wordList = new List<string>();
for (int w = 0; w < words.Length; w++)
{
if (words[w].Trim() != "")
{
wordList.Add(words[w]);
}
}
string[] cleanWords = new string[wordList.Count];
for (int cw = 0; cw < cleanWords.Length; cw++)
{
cleanWords[cw] = wordList[cw];
}
listArray.Add(cleanWords);
}
53
}
}
VI. Remove duplicate words Algorithm Implementation
class RemoveDuplicate
{
public void Remove(bool remove, List<string[]> pageList, List<List<string>>
webPagesList)
{
if (remove)
{
for (int i = 0; i < pageList.Count; i++)
{
for (int j = 0; j < pageList[i].Length; j++)
{
if (!webPagesList[i].Contains(pageList[i][j].Trim()))
{
webPagesList[i].Add(pageList[i][j]);
}
}
}
}
else
{
for (int i = 0; i < pageList.Count; i++)
{
for (int j = 0; j < pageList[i].Length; j++)
{
webPagesList[i].Add(pageList[i][j]);
}
}
}
}
public void InitializaeUniqeWords(List<List<string>> stemedList,
List<List<string>> uniqueWordsList)
{
for (int i = 0; i < stemedList.Count; i++)
{
uniqueWordsList.Add(new List<string>());
}
}
internal void Remove(bool removeDuplicate, List<List<string>> stemedList, List<List<string>>
uniqueWordsList)
{
if (removeDuplicate)
{
for (int i = 0; i < stemedList.Count; i++)
{
for (int j = 0; j < stemedList[i].Count; j++)
{
if (!uniqueWordsList[i].Contains(stemedList[i][j]))
{
uniqueWordsList[i].Add(stemedList[i][j]);
}
}
}
}
else
54
{
for (int i = 0; i < stemedList.Count; i++)
{
for (int j = 0; j < stemedList[i].Count; j++)
{
uniqueWordsList[i].Add(stemedList[i][j]);
}
}
}
}
}
VII. Remove the words have less than specific number of characters
class RemoveMinNoOfChar
{
public void Remove(int charLength, List<List<string>> webPagesList)
{
if (charLength < 100)
{
for (int i = 0; i < webPagesList.Count; i++) {
for (int j = 0; j < webPagesList[i].Count; j++) {
if (webPagesList[i][j].Length < charLength){
webPagesList[i].Remove(webPagesList[i][j]);
j--;
}
}
}
}
}
}
VIII. Remove stop-words Algorithm Implementation
public class RemoveStopWords
{
public void Remove(bool remove, List<List<string>> webPagesList) {
if (remove) {
string[] stop_words = File.ReadAllLines(@"stop.txt");
for (int i = 0; i < stop_words.Length; i++) {
for (int j = 0; j < webPagesList.Count; j++) {
if (webPagesList[j].Contains(stop_words[i].ToLower())) {
webPagesList[j].Remove(stop_words[i].ToLower());
}
}
}
}
}
}
IX. Stemming the words Algorithm Implementation
class Stemmer
{
public void InitializeStemmedlIST(bool steam, List<List<string>> webPagesList,
List<List<string>> stemedList,ProgressBar pb )
{
if (steam)
{
55
int max = 0;
for (int i = 0; i < webPagesList.Count; i++)
{
stemedList.Add(Steam(webPagesList[i]));
if (max <= 100)
{
pb.Value = max;
max++;
}
}
pb.Value = 100;
}
else
{
for (int i = 0; i < webPagesList.Count; i++)
{
stemedList.Add(webPagesList[i]);
}
}
}
static public List<string> Steam(List<string> list)
{
List<string> resultList = new List<string>();
string fullString = "";
for (int i = 0; i < list.Count; i++)
{
fullString += list[i] + " ";
}
char[] w = new char[501];
Stemmer s = new Stemmer();
Stream _in = new MemoryStream(ASCIIEncoding.Default.GetBytes(fullString));
while (true)
{
int ch = _in.ReadByte();
if (Char.IsLetter((char)ch))
{
int j = 0;
while (true)
{
ch = Char.ToLower((char)ch);
w[j] = (char)ch;
if (j < 500)
j++;
ch = _in.ReadByte();
if (!Char.IsLetter((char)ch))
{
for (int c = 0; c < j; c++)
s.add(w[c]);
s.stem();
String u;
u = s.ToString();
resultList.Add(u);
break;
}
}
}
if (ch < 0)
break;
56
}
return resultList;
}
private char[] b;
private int i, /* offset into b */
i_end, /* offset to end of stemmed word */
j, k;
private static int INC = 50;
/* unit of size whereby b is increased */
public Stemmer()
{
b = new char[INC];
i = 0;
i_end = 0;
}
/**
* Add a character to the word being stemmed. When you are finished
* adding characters, you can call stem(void) to stem the word.
*/
public void add(char ch)
{
if (i == b.Length)
{
char[] new_b = new char[i + INC];
for (int c = 0; c < i; c++)
new_b[c] = b[c];
b = new_b;
}
b[i++] = ch;
}
/** Adds wLen characters to the word being stemmed contained in a portion
* of a char[] array. This is like repeated calls of add(char ch), but
* faster.
*/
public void add(char[] w, int wLen)
{
if (i + wLen >= b.Length)
{
char[] new_b = new char[i + wLen + INC];
for (int c = 0; c < i; c++)
new_b[c] = b[c];
b = new_b;
}
for (int c = 0; c < wLen; c++)
b[i++] = w[c];
}
/**
* After a word has been stemmed, it can be retrieved by toString(),
* or a reference to the internal buffer can be retrieved by getResultBuffer
* and getResultLength (which is generally more efficient.)
*/
public override string ToString()
{
57
return new String(b, 0, i_end);
}
/**
* Returns the length of the word resulting from the stemming process.
*/
public int getResultLength()
{
return i_end;
}
/**
* Returns a reference to a character buffer containing the results of
* the stemming process. You also need to consult getResultLength()
* to determine the length of the result.
*/
public char[] getResultBuffer()
{
return b;
}
/* cons(i) is true <=> b[i] is a consonant. */
private bool cons(int i)
{
switch (b[i])
{
case 'a':
case 'e':
case 'i':
case 'o':
case 'u': return false;
case 'y': return (i == 0) ? true : !cons(i - 1);
default: return true;
}
}
/* m() measures the number of consonant sequences between 0 and j. if c is
a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
presence,
<c><v> gives 0
<c>vc<v> gives 1
<c>vcvc<v> gives 2
<c>vcvcvc<v> gives 3
....
*/
private int m()
{
int n = 0;
int i = 0;
while (true)
{
if (i > j) return n;
if (!cons(i)) break; i++;
}
i++;
while (true)
{
while (true)
{
58
if (i > j) return n;
if (cons(i)) break;
i++;
}
i++;
n++;
while (true)
{
if (i > j) return n;
if (!cons(i)) break;
i++;
}
i++;
}
}
/* vowelinstem() is true <=> 0,...j contains a vowel */
private bool vowelinstem()
{
int i;
for (i = 0; i <= j; i++)
if (!cons(i))
return true;
return false;
}
/* doublec(j) is true <=> j,(j-1) contain a double consonant. */
private bool doublec(int j)
{
if (j < 1)
return false;
if (b[j] != b[j - 1])
return false;
return cons(j);
}
/* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.
cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.
*/
private bool cvc(int i)
{
if (i < 2 || !cons(i) || cons(i - 1) || !cons(i - 2))
return false;
int ch = b[i];
if (ch == 'w' || ch == 'x' || ch == 'y')
return false;
return true;
}
private bool ends(String s)
{
int l = s.Length;
int o = k - l + 1;
if (o < 0)
return false;
59
char[] sc = s.ToCharArray();
for (int i = 0; i < l; i++)
if (b[o + i] != sc[i])
return false;
j = k - l;
return true;
}
/* setto(s) sets (j+1),...k to the characters in the string s, readjusting
k. */
private void setto(String s)
{
int l = s.Length;
int o = j + 1;
char[] sc = s.ToCharArray();
for (int i = 0; i < l; i++)
b[o + i] = sc[i];
k = j + l;
}
/* r(s) is used further down. */
private void r(String s)
{
if (m() > 0)
setto(s);
}
/* step1() gets rid of plurals and -ed or -ing. e.g.
caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat
feed -> feed
agreed -> agree
disabled -> disable
matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess
meetings -> meet
*/
private void step1()
{
if (b[k] == 's')
{
if (ends("sses"))
k -= 2;
else if (ends("ies"))
setto("i");
else if (b[k - 1] != 's')
k--;
}
if (ends("eed"))
60
{
if (m() > 0)
k--;
}
else if ((ends("ed") || ends("ing")) && vowelinstem())
{
k = j;
if (ends("at"))
setto("ate");
else if (ends("bl"))
setto("ble");
else if (ends("iz"))
setto("ize");
else if (doublec(k))
{
k--;
int ch = b[k];
if (ch == 'l' || ch == 's' || ch == 'z')
k++;
}
else if (m() == 1 && cvc(k)) setto("e");
}
}
/* step2() turns terminal y to i when there is another vowel in the stem. */
private void step2()
{
if (ends("y") && vowelinstem())
b[k] = 'i';
}
/* step3() maps double suffices to single ones. so -ization ( = -ize plus
-ation) maps to -ize etc. note that the string before the suffix must give
m() > 0. */
private void step3()
{
if (k == 0)
return;
/* For Bug 1 */
switch (b[k - 1])
{
case 'a':
if (ends("ational")) { r("ate"); break; }
if (ends("tional")) { r("tion"); break; }
break;
case 'c':
if (ends("enci")) { r("ence"); break; }
if (ends("anci")) { r("ance"); break; }
break;
case 'e':
if (ends("izer")) { r("ize"); break; }
break;
case 'l':
if (ends("bli")) { r("ble"); break; }
if (ends("alli")) { r("al"); break; }
if (ends("entli")) { r("ent"); break; }
if (ends("eli")) { r("e"); break; }
if (ends("ousli")) { r("ous"); break; }
break;
61
case 'o':
if (ends("ization")) { r("ize"); break; }
if (ends("ation")) { r("ate"); break; }
if (ends("ator")) { r("ate"); break; }
break;
case 's':
if (ends("alism")) { r("al"); break; }
if (ends("iveness")) { r("ive"); break; }
if (ends("fulness")) { r("ful"); break; }
if (ends("ousness")) { r("ous"); break; }
break;
case 't':
if (ends("aliti")) { r("al"); break; }
if (ends("iviti")) { r("ive"); break; }
if (ends("biliti")) { r("ble"); break; }
break;
case 'g':
if (ends("logi")) { r("log"); break; }
break;
default:
break;
}
}
/* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */
private void step4()
{
switch (b[k])
{
case 'e':
if (ends("icate")) { r("ic"); break; }
if (ends("ative")) { r(""); break; }
if (ends("alize")) { r("al"); break; }
break;
case 'i':
if (ends("iciti")) { r("ic"); break; }
break;
case 'l':
if (ends("ical")) { r("ic"); break; }
if (ends("ful")) { r(""); break; }
break;
case 's':
if (ends("ness")) { r(""); break; }
break;
}
}
/* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */
private void step5()
{
if (k == 0)
return;
/* for Bug 1 */
switch (b[k - 1])
{
case 'a':
if (ends("al")) break; return;
case 'c':
if (ends("ance")) break;
62
if (ends("ence")) break; return;
case 'e':
if (ends("er")) break; return;
case 'i':
if (ends("ic")) break; return;
case 'l':
if (ends("able")) break;
if (ends("ible")) break; return;
case 'n':
if (ends("ant")) break;
if (ends("ement")) break;
if (ends("ment")) break;
/* element etc. not stripped before the m */
if (ends("ent")) break; return;
case 'o':
if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break;
/* j >= 0 fixes Bug 2 */
if (ends("ou")) break; return;
/* takes care of -ous */
case 's':
if (ends("ism")) break; return;
case 't':
if (ends("ate")) break;
if (ends("iti")) break; return;
case 'u':
if (ends("ous")) break; return;
case 'v':
if (ends("ive")) break; return;
case 'z':
if (ends("ize")) break; return;
default:
return;
}
if (m() > 1)
k = j;
}
/* step6() removes a final -e if m() > 1. */
private void step6()
{
j = k;
if (b[k] == 'e')
{
int a = m();
if (a > 1 || a == 1 && !cvc(k - 1))
k--;
}
if (b[k] == 'l' && doublec(k) && m() > 1)
k--;
}
/** Stem the word placed into the Stemmer buffer through calls to add().
* Returns true if the stemming process resulted in a word different
* from the input. You can retrieve the result with
* getResultLength()/getResultBuffer() or toString().
*/
public void stem()
{
k = i - 1;
63
if (k > 1)
{
step1();
step2();
step3();
step4();
step5();
step6();
}
i_end = k + 1;
i = 0;
}
}
X. Create Binary Data Set Algorithm Implementation
class CreateBinaryDataSet
{
static int col = 0;
static int row = 0;
static bool flag = false;
static int ColCount = 0;
public void Create(List<List<string>> uniqueWordsList, string[,] dataSet)
{
for (int i = 0; i < uniqueWordsList.Count; i++)
{
string[] arrayOfPageWords = new string[uniqueWordsList[i].Count];
for (int j = 0; j < uniqueWordsList[i].Count; j++)
{
arrayOfPageWords[j] = uniqueWordsList[i][j];
}
AddTODataSet(arrayOfPageWords, dataSet);
}
}
private void AddTODataSet(string[] doc, string[,] dataSet)
{
row++;
for (int i = 0; i < doc.GetLength(0); i++)
{
flag = false;
for (col = 0; col < dataSet.GetLength(1); col++)
{
if (doc[i] == dataSet[0, col])
{
flag = true;
break;
}
}
if (flag == true)
{
AddValueTo(row, col,dataSet);
}
else if (flag == false)
{
AddNewCol(ColCount, doc[i], dataSet);
ColCount++;
}
}
64
}
private void AddNewCol(int c, string w, string[,] dataSet)
{
dataSet[0, c] = w;
AddValueTo(row, c, dataSet);
}
private void AddValueTo(int row, int col, string[,] dataSet)
{
dataSet[row, col] = "1";
}
}
XI. Create tf-idf Data Set Algorithm Implementation
class CreateTFIDFdataSet
{
static int col = 0;
static int row = 0;
static bool flag = false;
static int ColCount = 0;
public void Create(List<List<string>> uniqueWordsList, string[,] dataSet)
{
for (int i = 0; i < uniqueWordsList.Count; i++)
{
string[] arrayOfPageWords = new string[uniqueWordsList[i].Count];
for (int j = 0; j < uniqueWordsList[i].Count; j++)
{
arrayOfPageWords[j] = uniqueWordsList[i][j];
}
AddTODataSet(arrayOfPageWords, dataSet);
}
}
public void CalcTFIDF(string [,] dataSetMatrix, List<int> pagesSize)
{
int no_of_terms_in_all_documens = 0;
for (int r = 1; r < dataSetMatrix.GetLength(0); r++)
{
for (int c = 0; c < dataSetMatrix.GetLength(1) - 1; c++)
{
for (int cnt = 1; cnt < dataSetMatrix.GetLength(0); cnt++)
{
if (dataSetMatrix[cnt, c] != "0")
{
no_of_terms_in_all_documens++;
}
}
if (dataSetMatrix[r, c] != "0")
{
dataSetMatrix[r, c] = CalcTFIDF(dataSetMatrix[r, c], pagesSize[r - 1],
dataSetMatrix.GetLength(0) - 1, no_of_terms_in_all_documens);
}
no_of_terms_in_all_documens = 0;
}
}
}
private void AddTODataSet(string[] doc, string[,] dataSet)
65
{
row++;
for (int i = 0; i < doc.GetLength(0); i++)
{
flag = false;
for (col = 0; col < dataSet.GetLength(1); col++)
{
if (doc[i] == dataSet[0, col])
{
flag = true;
break;
}
}
if (flag == true)
{
AddValueTo(row, col,dataSet);
}
else if (flag == false)
{
AddNewCol(ColCount, doc[i], dataSet);
ColCount++;
}
}
}
private void AddNewCol(int c, string w, string[,] dataSet)
{
dataSet[0, c] = w;
AddValueTo(row, c, dataSet);
}
private void AddValueTo(int row, int col,string[,] dataSet)
{
string cellValue = dataSet[row, col];
if (cellValue == null)
{
dataSet[row, col] = "1";
}
else
{
dataSet[row, col] = (double.Parse(cellValue) + 1).ToString(); ;
}
}
public void ChangeNullItems(bool binary, string[,] dataSet)
{
if (binary)
{
for (int i = 0; i < dataSet.GetLength(0); i++)
{
for (int j = 0; j < dataSet.GetLength(1); j++)
{
if (dataSet[i, j] == null)
{
dataSet[i, j] = "0";
}
}
}
}
}
private string CalcTFIDF(string termAppears, int totalFreq, int totalPages, int totTermFreq)
66
{
double tdidf;
double termApp = (double.Parse(termAppears));
double tf = termApp / (double)totalFreq;
double idf1 = (double)totalPages / (1 + (double)totTermFreq);
double idf = Math.Log(idf1);
tdidf = tf * idf;
tdidf = Math.Round(tdidf, 4);
return tdidf.ToString();
}
}
XII. Build Physical Data Set Algorithm Implementation
class SaveDataSet
{
public void Save(string[,] dataSetMatrix)
{
SaveFileDialog sfd = new SaveFileDialog();
sfd.FileName = "dataSet" + DateTime.Now.Millisecond + ".csv";
DialogResult result = sfd.ShowDialog();
if (result == DialogResult.OK)
{
string sFilePath = sfd.FileName;
StreamWriter sw = File.CreateText(sFilePath);
for (int r = 0; r < dataSetMatrix.GetLength(0); r++)
{
for (int c = 0; c < dataSetMatrix.GetLength(1); c++)
{
sw.Write(dataSetMatrix[r, c] + ",");
}
sw.WriteLine();
}
sw.Close();
Process.Start(sFilePath);
}
}
}
67
Appendix B
/* This is the Porter stemming algorithm, coded up in ANSI C by the author. */
#include <string.h> /* for memmove */
#define TRUE 1
#define FALSE 0
static char * b; /* buffer for word to be stemmed */
static int k,k0,j; /* j is a general offset into the string */
/* cons(i) is TRUE <=> b[i] is a consonant. */
static int cons(int i)
{ switch (b[i])
{ case 'a': case 'e': case 'i': case 'o': case 'u': return FALSE;
case 'y': return (i==k0) ? TRUE : !cons(i-1);
default: return TRUE;
}
}
/* m() measures the number of consonant sequences between k0 and j. if c is
a consonant sequence and v a vowel sequence, and <..> indicates arbitrary
presence,
<c><v> gives 0
<c>vc<v> gives 1
<c>vcvc<v> gives 2
<c>vcvcvc<v> gives 3
....
*/
static int m()
{ int n = 0;
int i = k0;
while(TRUE)
{ if (i > j) return n;
if (! cons(i)) break; i++;
}
i++;
while(TRUE)
{ while(TRUE)
{ if (i > j) return n;
if (cons(i)) break;
i++;
}
i++;
n++;
while(TRUE)
{ if (i > j) return n;
if (! cons(i)) break;
i++;
}
i++;
}
}
/* vowelinstem() is TRUE <=> k0,...j contains a vowel */
static int vowelinstem()
{ int i; for (i = k0; i <= j; i++) if (! cons(i)) return TRUE;
return FALSE;
}
/* doublec(j) is TRUE <=> j,(j-1) contain a double consonant. */
static int doublec(int j)
{ if (j < k0+1) return FALSE;
68
if (b[j] != b[j-1]) return FALSE;
return cons(j);
}
/* cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant
and also if the second c is not w,x or y. this is used when trying to
restore an e at the end of a short word. e.g.
cav(e), lov(e), hop(e), crim(e), but
snow, box, tray.
*/
static int cvc(int i)
{ if (i < k0+2 || !cons(i) || cons(i-1) || !cons(i-2)) return FALSE;
{ int ch = b[i];
if (ch == 'w' || ch == 'x' || ch == 'y') return FALSE;
}
return TRUE;
}
/* ends(s) is TRUE <=> k0,...k ends with the string s. */
static int ends(char * s)
{ int length = s[0];
if (s[length] != b[k]) return FALSE; /* tiny speed-up */
if (length > k-k0+1) return FALSE;
if (memcmp(b+k-length+1,s+1,length) != 0) return FALSE;
j = k-length;
return TRUE;
}
/* setto(s) sets (j+1),...k to the characters in the string s, readjusting
k. */
static void setto(char * s)
{ int length = s[0];
memmove(b+j+1,s+1,length);
k = j+length;
}
/* r(s) is used further down. */
static void r(char * s) { if (m() > 0) setto(s); }
/* step1ab() gets rid of plurals and -ed or -ing. e.g.
caresses -> caress
ponies -> poni
ties -> ti
caress -> caress
cats -> cat
feed -> feed
agreed -> agree
disabled -> disable
matting -> mat
mating -> mate
meeting -> meet
milling -> mill
messing -> mess
meetings -> meet
*/
static void step1ab()
{ if (b[k] == 's')
{ if (ends("\04" "sses")) k -= 2; else
if (ends("\03" "ies")) setto("\01" "i"); else
if (b[k-1] != 's') k--;
}
if (ends("\03" "eed")) { if (m() > 0) k--; } else
if ((ends("\02" "ed") || ends("\03" "ing")) && vowelinstem())
{ k = j;
if (ends("\02" "at")) setto("\03" "ate"); else
69
if (ends("\02" "bl")) setto("\03" "ble"); else
if (ends("\02" "iz")) setto("\03" "ize"); else
if (doublec(k))
{ k--;
{ int ch = b[k];
if (ch == 'l' || ch == 's' || ch == 'z') k++;
}
}
else if (m() == 1 && cvc(k)) setto("\01" "e");
}
}
/* step1c() turns terminal y to i when there is another vowel in the stem. */
static void step1c() { if (ends("\01" "y") && vowelinstem()) b[k] = 'i'; }
/* step2() maps double suffices to single ones. so -ization ( = -ize plus
-ation) maps to -ize etc. note that the string before the suffix must give
m() > 0. */
static void step2() { switch (b[k-1])
{
case 'a': if (ends("\07" "ational")) { r("\03" "ate"); break; }
if (ends("\06" "tional")) { r("\04" "tion"); break; }
break;
case 'c': if (ends("\04" "enci")) { r("\04" "ence"); break; }
if (ends("\04" "anci")) { r("\04" "ance"); break; }
break;
case 'e': if (ends("\04" "izer")) { r("\03" "ize"); break; }
break;
case 'l': if (ends("\03" "bli")) { r("\03" "ble"); break; } /*-DEPARTURE-*/
/* To match the published algorithm, replace this line with
case 'l': if (ends("\04" "abli")) { r("\04" "able"); break; } */
if (ends("\04" "alli")) { r("\02" "al"); break; }
if (ends("\05" "entli")) { r("\03" "ent"); break; }
if (ends("\03" "eli")) { r("\01" "e"); break; }
if (ends("\05" "ousli")) { r("\03" "ous"); break; }
break;
case 'o': if (ends("\07" "ization")) { r("\03" "ize"); break; }
if (ends("\05" "ation")) { r("\03" "ate"); break; }
if (ends("\04" "ator")) { r("\03" "ate"); break; }
break;
case 's': if (ends("\05" "alism")) { r("\02" "al"); break; }
if (ends("\07" "iveness")) { r("\03" "ive"); break; }
if (ends("\07" "fulness")) { r("\03" "ful"); break; }
if (ends("\07" "ousness")) { r("\03" "ous"); break; }
break;
case 't': if (ends("\05" "aliti")) { r("\02" "al"); break; }
if (ends("\05" "iviti")) { r("\03" "ive"); break; }
if (ends("\06" "biliti")) { r("\03" "ble"); break; }
break;
case 'g': if (ends("\04" "logi")) { r("\03" "log"); break; } /*-DEPARTURE-*/
/* To match the published algorithm, delete this line */
} }
/* step3() deals with -ic-, -full, -ness etc. similar strategy to step2. */
static void step3() { switch (b[k])
{
case 'e': if (ends("\05" "icate")) { r("\02" "ic"); break; }
if (ends("\05" "ative")) { r("\00" ""); break; }
if (ends("\05" "alize")) { r("\02" "al"); break; }
break;
case 'i': if (ends("\05" "iciti")) { r("\02" "ic"); break; }
break;
case 'l': if (ends("\04" "ical")) { r("\02" "ic"); break; }
70
if (ends("\03" "ful")) { r("\00" ""); break; }
break;
case 's': if (ends("\04" "ness")) { r("\00" ""); break; }
break;
} }
/* step4() takes off -ant, -ence etc., in context <c>vcvc<v>. */
static void step4()
{ switch (b[k-1])
{ case 'a': if (ends("\02" "al")) break; return;
case 'c': if (ends("\04" "ance")) break;
if (ends("\04" "ence")) break; return;
case 'e': if (ends("\02" "er")) break; return;
case 'i': if (ends("\02" "ic")) break; return;
case 'l': if (ends("\04" "able")) break;
if (ends("\04" "ible")) break; return;
case 'n': if (ends("\03" "ant")) break;
if (ends("\05" "ement")) break;
if (ends("\04" "ment")) break;
if (ends("\03" "ent")) break; return;
case 'o': if (ends("\03" "ion") && (b[j] == 's' || b[j] == 't')) break;
if (ends("\02" "ou")) break; return;
/* takes care of -ous */
case 's': if (ends("\03" "ism")) break; return;
case 't': if (ends("\03" "ate")) break;
if (ends("\03" "iti")) break; return;
case 'u': if (ends("\03" "ous")) break; return;
case 'v': if (ends("\03" "ive")) break; return;
case 'z': if (ends("\03" "ize")) break; return;
default: return;
}
if (m() > 1) k = j;
}
/* step5() removes a final -e if m() > 1, and changes -ll to -l if
m() > 1. */
static void step5()
{ j = k;
if (b[k] == 'e')
{ int a = m();
if (a > 1 || a == 1 && !cvc(k-1)) k--;
}
if (b[k] == 'l' && doublec(k) && m() > 1) k--;
}
int stem(char * p, int i, int j)
{ b = p; k = j; k0 = i; /* copy the parameters into statics */
if (k <= k0+1) return k; /*-DEPARTURE-*/
/* With this line, strings of length 1 or 2 don't go through the
stemming process, although no mention is made of this in the
published algorithm. Remove the line to match the published
algorithm. */
step1ab(); step1c(); step2(); step3(); step4(); step5();
return k;
}
Automatically extracted. Refer to the original PDF for figures, tables, and formatting.