Year: 2010

Venue: Master Project, Universiti Utara Malaysia

Type: thesis

DOI: 10.13140/rg.2.2.29447.66729

External link: https://doi.org/10.13140/rg.2.2.29447.66729

Abstract

Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naïve-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naïve-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification.

Keywords

Text Extraction AlgorithmWeb Text Classification
📄 Full text (109,199 characters)extracted from the PDF · click to expand
TEXT EXTRACTION ALGORITHM FOR WEB TEXT CLASSIFICATION MUSTAFA MUWAFAK THEAB UNIVERSITI UTARA MALAYSIA 2010 2 TEXT EXTRACTION ALGORITHM FOR WEB TEXT CLASSIFICATION A thesis submitted to the Faculty of Information Technology in partial fulfillment of the requirements for the degree Master of Science (Intelligent Systems) Universiti Utara Malaysia By Mustafa Muwafak Theab © Mustafa Muwafak, 2010. All rights reserved. i PERMISSION TO USE In presenting this thesis in partial fulfillment of the requirements for a postgraduate degree from University Utara Malaysia, I agree that the University Library may make it freely available for inspection. I further agree that permission for copying of this thesis in any manner, in whole or in part, for scholarly purpose may be granted by my supervisor(s) or, in their absence by the Dean of the Graduate School. It is understood that any copying or publication or use of this thesis or parts thereof for financial gain shall not be allowed without my written permission. It is also understood that due recognition shall be given to me and to University Utara Malaysia for any scholarly use which may be made of any material from my thesis. Requests for permission to copy or to make other use of materials in this thesis, in whole or in part should be addressed to: Dean of Faculty of Information Technology Universiti Utara Malaysia 06010 UUM Sintok Kedah Darul Aman ii ABSTRACT Explosive expand of web pages in the World Wide Web makes it difficult for search engine and web directory to give relevant results to the user requirements. Web pages need automatic classification techniques with high classification accuracy. This study provides a text extraction algorithm for web text classification. The extraction algorithm consists of three phases namely web page extraction, rule formulation, and algorithm validation. A text extraction prototype is built using Visual C# 2008 to validate the algorithm. It is a windows application mixed with web connection protocol. The prototype offers the creation of Binary data set as well as term frequency inverse document frequency (tf-idf) data set. In this study, the experiment was conducted on five English educational websites. The created data sets are then classified using Naïve-Bayes and C4.5 algorithms provided in WEKA application. The experimental results show that Naïve-Bayes classifier with web text extraction algorithm proves to be the best method for web text classification. iii ACKNOWLEDGEMENTS I would like to express my gratitude sincere appreciation to both of my supervisor, Prof. Dr. Ku Ruhana Ku Mahamud and Dr. Faudziah Ahmad for guiding the research presented in this project. I appreciate and thank them for their continuous support, advice, and comment. The knowledge I have learned from them is the most valuable things I had learn in this world of research. A special thanks to all my lecturers in Universiti Utara Malaysia for their support. My deepest appreciation goes to my beloved family who has been very supportive, patient and understanding all the time. Lastly, this wish of thanks is dedicated to beloved friends and others who involve direct or indirect with this project. Thank you everyone. iv Table of Contents PERMISSION TO USE.......................................................................i ABSTRACT....................................................................................ii ACKNOWLEDGEMENTS..................................................................iii TABLE OF CONTENT......................................................................iv LIST OF TABLES.............................................................................vii LIST OF FIGURES...........................................................................viii LIST OF APPENDICES.....................................................................ix CHAPTER ONE: INTRODUCTION 1.0 Introduction........................................................................................................1 1.1 Problem Statement.............................................................................................2 1.2 Objective........................................................................................3 1.3 Significance of the study......................................................................3 1.4 Scope of the study...........................................................................4 1.5 Limitations of the study.....................................................................4 1.6 Organization of the report...................................................................5 CHAPTER TWO: LITERATURE REVIEW 2.1 Website classification.......................................................................6 2.2 Text Mining in Web........................................................................11 2.3 HTML Parser for text extraction.........................................................13 2.4 Summary.....................................................................................15 v CHAPTER THREE: RESEARCH METHODOLOGY 3.1 Web Text extraction algorithm .........................................................16 3.2 Web Page Extraction ......................................................................17 3.3 Rule Formulation..........................................................................18 3.3.1 Data Processing...............................................................19 I. Remove HTML Tags................................................19 II. Remove Digits........................................................19 III. Split lines to words..................................................20 IV. Remove duplicate words from the list............................20 V. Remove the words have less than specific number of characters........................................................21 VI. Remove stop-words...................................................21 VII. Stemming the words................................................22 VIII. Remove duplicate words after stemming........................23 3.3.2 Data Set Creation............................................................23 I. Create Binary Data Set.............................................23 II. Create tf-idf Data Set...............................................24 III. Build Physical Data Set............................................27 3.3.3 Text Classification...........................................................29 3.4 Algorithm Validation.....................................................................30 3.5 Developing of the prototype ...............................................................30 3.6 Summary....................................................................................31 vi CHAPTER FOUR: PROTOTYPE DEVELOPMENT 4.1 Implementation Details.................................................................32 4.2 Classes Implementation.................................................................34 4.3 Graphical User Interface...............................................................36 4.4 Summary....................................................................................37 CHAPTER FIVE: RESULTS AND FINDINGS 5.1 Experiment Design .........................................................................38 5.2 Extraction processing.......................................................................38 5.3 Classification Activities.....................................................................41 5.4 Summary....................................................................................45 CHAPTER SIX: CONCLUSION 6.1 Project Conclusion .........................................................................46 6.2 Recommendation for Future Research ...................................................47 REFERENCES.................................................................................48 vii LIST OF TABLES Table 5.1: Classification results (60%-40%)..............................................41 Table 5.2: Classification results (65%-35%)..............................................41 Table 5.3: Classification results (70%-30%)..............................................42 Table 5.4: Classification results (60%-40%)..............................................42 Table 5.5: Classification results (65%-35%)..............................................43 Table 5.6: Classification results (70%-30%)..............................................43 Table 5.7: Classification accuracy Details (DragPushing)..............................44 viii LIST OF FIGURES Figure 3.1: Text Extraction Algorithm.......................................................17 Figure 4.1: Classes Diagram...................................................................33 Figure 4.2: HTML_Parser Interface...........................................................37 Figure 5.1: tf-idf Data set.......................................................................40 Figure 5.2: Binary data set.....................................................................40 ix LIST OF APPENDICES Appendix A....................................................................................51 Appendix B....................................................................................67 1 CHAPTER ONE INTRODUCTION The rapid growth of information on the internet has made it necessary for users to identify relevant information that were obtained from the process of information retrieval (Suanmali, Salim, & Binwahlan, 2009). One of the most important aspects in any information retrieval (IR) process, text query, search engine, and web classification is to distinguish each word to give better results to the query. Cooley, Mobasher and Srivastava (1997) define Web mining as “the discovery and analysis of useful information from the World Wide Web. This describes the automatic search of information resources available on-line, i.e. Web content mining, and the discovery of user access patterns from Web servers, i.e., Web usage mining”. According to Wang, Huang, Wu, and Zhang (1999) “web mining is an integrated technology in which several research fields are involved, such as data mining, computational linguistics, statistics, information and so on” (p. 137). From the previous definitions, Web mining has different methods from data mining because web pages are semi-structure or non-structure documents. Therefore some traditional data mining methods such as Linear Regression (Aaron, 2004) are not applicable to Web mining. Web classification is an important component in search engines and is use to classify the type of web pages. Methods like clustering and support vector machine can be used in Web classification (Yu, Han, & Chang, 2002). Web classification research 2 focuses on web text classification, web image classification, web hyperlinks classification and web usage classification (Morales, Fandino, & Rodrivguez, 2009). Text Mining is another important component in Web mining. According to Ronen and James (2007) “Text Mining can be broadly defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools”. Text Mining deals with extracting useful information from text document through the identification and exploration of patterns using different formulas. Text Mining uses machine learning algorithms such as Naïve Bayse (NB) classifier, Bayesian logistic regression, decision tree classifiers, Rocchio method and neural network (Ronen & James, 2007) to extract useful information. Text mining also includes clustering methods for unsupervised process through which objects are classified into groups called clusters. Algorithms such as K-means, EM-based probabilistic clustering, hierarchical agglomerative clustering and minimal spanning tree are used for clustering activities (Ronen & James, 2007). Each method and algorithm has advantages and disadvantages. Some applications combine two or more algorithms in order to get better classification accuracy because web mining involves many processing steps which require different algorithms for each step (Feng & Cherkassky, 2009). 1.1 PROBLEM STATEMENT Inefficient web classification algorithms will affect the results of IR. Web classification needs efficient algorithm to produce high accuracy. The highest 3 accuracy from existing text extraction algorithm is around 80% (Fresno & Ribeiro, 2004). The problem in existing algorithm is the deficiency to extract meaningful words from web pages (Liu, 2009). In Web searching, due to the growth of information on the Web, matched web pages for a given query could be millions. Ranking web pages in terms of normalized frequency of keywords does not work properly, because short pages with only the query in them would likely been ranked on the top. 1.2 OBJECTIVE The main objective of this study is to develop a text extraction algorithm for web text classification. Specific objectives are: 1. To develop feature extraction techniques to extract features from the web pages. 2. To develop a HTML parser that can remove HTML tags. 3. To develop a prototype for web text extraction. 1.3 SIGNIFICANCE OF THE STUDY An algorithm for web classification and text extraction will be formulated from this project. This algorithm will be the contribution to the body of knowledge in the area of web text classification. The algorithm has the facility to create Binary data set as well as tf-idf data set from the extracted text which are necessary for classification activity. 4 The developed prototype can automate the process of text extracting and date set creation will be a contribution to the society particularly for users to perform web text classification. This prototype can be easily used by any user with little experience for example the search engine developer for web classification and web directory developer. 1.4 SCOPE OF THE STUDY The study focused on English educational websites. Five different Universities Websites used in this study. The classification applied only on web text page. 1.5 LIMITATIONS OF THE STUDY By using Core 2 Duo, 2.00 GHz processor with 3.00 GB RAM, the prototype took around 3 minutes to process six hundred web pages. The maximum number of web pages could be processed by the prototype without system hanging was around thousand pages. The rows and columns in the created data set should be within 1048576 and 16384 respectively otherwise; Microsoft Excel 2007 will not be able to view the data set. 5 1.6 ORGANIZATION OF THE REPORT This report divided into six chapters. The chapters that are presented in this report are: Chapter one gives an overview and introduction about Text Extraction algorithms and techniques used in web classification. This chapter also discussed about problem statement, objective, significance of the study, and scope of the study. Limitations of the study are also presented in this chapter. Chapter two discussed about literature review, which are related to this project. The literature review divided into three sections which are web classification, Text Mining in Web, and HTML Parser for text extraction. Chapter three discussed methodology that been used in this project. The methodology consists of three phases namely: Web Page Extraction, Rule Formulation, and Algorithm Validation. The details of each phase discussed in this chapter. Chapter four discussed about the prototype implementation for HTML Parser and Text Extraction algorithm. Chapter five discussed the results and finding of the experiment conducted by using the HTML Parser and WEKA application. Chapter six presents the conclusion and recommendations to improve this study. 6 CHAPTER TWO LITERATURE REVIEW This chapter presents previous work on web classification, text extraction particularly in web mining and HTML parser techniques. 2.1 Website classification Anagnostopoulos et al. (2002) proposed an automated website classification system. The study focuses on e-commerce sites and the pages were indexed and the extracted words represented using Vector Space Model (VSM) and Site Vector (SV). The proposed system compares the content of web pages using VSM with the mdimensional commerce descriptor vector (SV). The system performance reached around 96% accuracy, which indicates that their system has produced good results. However, their scope is limited to e-commerce only. Tomar, Shekhar, and Ashish (2006) proposed a classification concept for web pages, called “WebClassify”. They modified Naïve-Bayes algorithm with multi-nominal model to classify web pages into different categories. The proposed algorithm makes use of both Naïve-Bayes algorithm and term frequency – inverse document frequency factor. They did an experiment using the traditional Naïve-Bayes algorithm (NB) and their modified algorithm. The results for web page classification were 42.5% using NB and 55% using their algorithm. From the results, their approach shows 7 enhancement in the classification accuracy over traditional NB. Although the classification accuracy is not high, but the important point in their study, they have introduced a term frequency as class document frequency. This method takes into account only those words have a term frequency inverse class document frequency value above a threshold value. This technique produced small data set which needs less processing time. Hui, Yan, and Jian-Ping (2008) proposed a combined approach for news categorization. The combination includes decision tree with multilayer neural networks. Data set was created based on term frequency inverse document frequence (tf-idf) technique to weight each word (term) in the document. Attributes were selected by applying feature space reduction method called Chi Square (X2 statistic). In their experiment, only subject level (the upper hierarchical level) of web pages were classified. The classifier was built using Back Propagation Network and ID3 algorithm. With 371 news items (pages) inclueded in the training set, the classification accuracy reached around 88%. The important point in their approach is the feature space reduction technique. By selecting significant attributes from the data set, the classification accuracy increased and the processing time decreased which means there is improvement in the performance of the algorithm. Meijuan, Jingwen and Shiru (2009) presented a research about web classification based on support vector machine (SVM). In the study, SVM network was structured for web text classification where genetic algorithm (GA) was used to optimize SVM parameters. Hence, enhance the convergence rate and the classification accuracy. 8 1000 web pages were classified into six categories: “news”, “sport”, “science and technology”, “education”, “amusement”, and “others”. The result of their experiment measured by mean squared errors and the best results were: (0.0045) in term of errors for training and (0.0682) in terms of errors for testing. It is clear from their experiment that they have solved the over-fitting problem of ANN by using SVM. However, the experiment was applied only on flat web page structure, which is easier for classification compared to hierarchical web pages. A comprehensive study on feature selection in automatic web pages categorization has been presented by (Sun, Sun, Chen, & Fu, 2009). The study focused on ReliefF and Symmetrical Uncertainty (SU). They addressed critical issues such as “the high dimensional text vocabulary space” in web pages classification. They used Hidden Naïve-Bayes classifier (HNB) and Complement class Naïve-Bayes (CNB). They compared their results with results produced using Support Vector Machine (SVM), K Nearest Neighbor, and C4.5. The experiment was conducted on 527 web pages partitioned into seven classes. Boolean data type was used to represent the data. The results showed that by using feature selection methods, higher accuracy can be obtained. The highest classification accuracy is around 88% that has been achieved using HNB classifier. They performed another experiment where “the words are sorted according to their SU scores”. In addition, some numbers of attributes selected through removing top N words. The results showed that CNB achieved the highest accuracy around 97% but HNB was more stable among the others. While SVM was not practical, it is time-consuming for big-size problem. They concluded that SU is better for feature selection method than ReliefF. From their study, it is clear that they 9 achieved good classification accuracy and they identified practical method for feature selection. However, the experiment was conducted only on flat web pages. Zhongda, Kun, and Yanfen (2007) proposed a research that deals with many important issues in web pages classification such as feature selection method and classification algorithms. In their research, they found two models that could represent the web page namely Vector Space Model (VSM) and Boolean Model. According to the research, VSM model can represent web pages better than Boolean because each word was assigned with specific value (weight). For feature selection, the algorithms used were Comparative Information, Latent Semantic Indexing (LSI), and Rough Set theory. LSI shows favor among the other algorithms. For classification algorithm, they investigate K Nearest Neighbor (KNN), Naïve-Bayes (NB), Relational Learning (RL), Support Vector Machines (SVM), Decision Trees DT, and Neural Networks Back-propagation (NNB). They concluded that, there is no classification algorithm better than the other. Each algorithm has advantages and disadvantages; sometime in order to achieve higher accuracy it may cost processing time or memory space. Each algorithm will behave differently with the type of data set. For example, DT algorithm provides better accuracy if the data set is represented by Boolean values. The classification accuracy depends on the algorithm used in extraction and feature selection more than the classifier model. Qi and Davison (2009) have done a comprehensive survey on web page classification. In their research, Feature extraction and classification algorithms were the focus of their study. They distinguish the web page classification as “subject classification, functional classification, sentiment classification, and other types of classification”. 10 They divided the classification problem into binary classification and multiclass classification. The survey shows another type of classification, that is flat classification and hierarchical classification. In flat classification, the categories organized parallel such that one category does not followed by another category, while in hierarchical classification the categories structured as hierarchical tree-like structure in which each category may have a number of subcategories. They identify the differences between web classification and text classification. Classification of web pages content is different from classification of standard text classification. The standard text is structured and written with consistent styles, while web pages do not have such properties. Another important aspect in their research is the feature selection. They divided the feature into two classes namely “on-page features” which are available directly on the page to be classified. The other class called “features of neighbors” which are available on the pages linked in some way to the page to be classified. The algorithms used for feature selection were N-gram, Dimension Reduction, and Relational Learning. They have pointed to critical problem about feature selection from the web pages that contain large images or flash objects, which is common in today’s web pages. For classification algorithms, the research investigated many algorithms such as K-Nearest Neighbor, Support Vector Machine (SVM), and Naïve-Bayes classifiers. The research concluded that in order to improve the classification accuracy, an appropriate method for feature extraction from web pages “directly or from neighbor web pages” could improve the classification performance. 11 2.2 Text Mining in Web According to Zhongda, Kun, and Yanfen (2007) text classification has been implemented before web page classification. They define text classification as “It is the process of algorithmically analyzing a text documents in order to assign a set of categories (or index terms) from a predefined vocabulary to succinctly describe the content of a document”. The followings are some works in the field of text mining specifically in web. By extracting the text from the web documents, a key phrase can extracted from the text. Zhang, Zincir-Heywood, and Milios (2005) have investigated the significance of narrative text classification in the task of automatic key phrase extraction in Web document collection. Their work consists of two steps i.e., long paragraph classification and narrative paragraph classification. They have used three methods, i.e., TF-IDF, KEA, and Keyterm. The processing steps include extracting all tokens in the text, removing punctuation marks, removing standard set of 425 stop-words, and applying Porter stemming to obtain word stems. The results show that Keyterm method gives the best output for all plain text and for narrative text. Shiqun, Yuhui, and Jike (2007) proposed an algorithm used in text mining on web. In their studies, they showed an algorithm of how to follow the appointed website or web page according to the user’s request by using the text mining. They also investigate on how to extract test characteristic, and how to classify the data information with feedback judgment combined with the Web page text contents for 12 later use. Their approach was related but different from data mining and different from information retrieval. The processing method includes four steps: lookup resources, pretreatment and information extraction, mode discovery, and mode analysis. They used vector space model for web text classification. Experiments conducted using 500 documents extracted from sport website. The text classification takes about 6 seconds and the accuracy of the classification attains 91%. Most of the websites contain text as well as noise. In order to perform text mining in the web text, a noise elimination process should first apply. Dey and Haque (2008) have proposed a paper focused on opinion extraction from noisy data. They used a hybrid approach, in which they initially employ a semi-supervised method to learn domain knowledge from a training repository, which contains both noisy and clean text. A localized linguistic technique has applied to extract opinion expression from noisy text. Their work comprises five steps namely “pre-processing of noisy text to enable opinion mining, locate opinion expressions in processed text, analyze opinion expression to identify features and opinions expressed about them, aggregate opinions at multiple levels of granularity as desired by user, and developing a Graphical User Interface to facilitate user interaction”. The performance measured used were standard evaluation measures of precision, recall, and F-score. The results reached 94% precision, and the recall value reached 75%. Their studies showed different concept of text mining by cleaning the text from noisy. By applying this technique, the results improved in terms of quality. 13 Lee, Baker, Song, and Wetherbe (2010) proposed a paper contains a comparison of four text-mining methods namely “Latent Semantic Analysis (LSA), Probabilistic Latent Semantic Analysis (PLSA), Latent Dirichlet Allocation (LDA), and Correlation Topic Model (CTM)”. In their work, they identify the advantages and limitation of each one. Vector Space Model (VSM) used to represent the words in each document with term frequency inverse document frequency (tf-idf) technique. In their experiment, they collect 258 users’ comments on hardware product (Digital Camera type Canon) from Amazon website. The task is applying the four algorithms for Topic Detection from the users’ comments. From the experiment’s results, LSA gave best result in case of single topic in a document. PLSA gave best result in case of single topic in one document. LDA and CTM gave best result with both single and multiple topics for one document. CTM has the ability to model the relationships between topics. However, LSA shows better ability to deal with synonyms problem. 2.3 HTML Parser for text extraction Jianli, Jie, and Cuihua (2004) have used HTML parser in their studies, initially; they developed a model to segment the web document with document object model. They adopted different sets of filters to filter non-text information and all links, the process of filtering achieved by scanning HTML pages and removing “src” and “herf” attributes with the value containing “img”, “swf”, “gif” etc. Their algorithm implemented by java program language and testing data comes from web document in English. Other algorithms have used in the process are removing stop-words and stemming the words with Paice/Husk Stemmer. The results varied from 78% - 100%, 14 and there were some errors in their experiment. However, the HTML parser works perfectly in their studies. There are many HTML Parser available freely for different purpose. Alonso-Rios, Luis-Vazquez, Mosqueira-Rey, Moret-Bonillo, and del Rio (2009) proposed in their studies an HTML analyzer for web usability. In their system, the HTML analyzer examines HTML code to extract usability information about web pages. The analyzer identifies usability problems, checks compliance with usability guidelines, and interprets usability information to facilitate the task of human usability experts. In their experiment, they were able to classify each component in the web page like images, tables and frames. However, they did not extract the text from the HTML source code. Ke, Yueming, Zongyan, and Xiang (2008) developed HTML Parser used in the embedded browser based on deterministic finite automaton. Several resource limitations such as CPU type, system function definition, display equipment, and electricity considered in the study. An embedded browser with four functional models: “network transport, memory management, HTML parser, and display model” was proposed. The parser has the capability to parse and download at the same time. By using this parallel process, a parsing time will decrease because it does not need to wait for the stream to fully downloaded. The architecture and design explained with some details but the authors did not provide any experiment result. Ziegler, Vogele, and Viermetz (2009) propsed an approach for web page information extraction. Their approach is “fully-automated extraction of news content from 15 HTML pages”. Their algorithm divided into three phases, namely “the merging of text blocks on a structural level, the computation of text block features, and the decision to keep or discard the current block”. By using those phases, the algorithm was able to calculate “average sentence length, number of sentences, character distribution, and stop-word ratio”. In their experiment, they have used 220 news sources in five western languages and the results compared with other two systems: Crunch and Body Text Extraction. For evaluation, they implemented classification models such as C4.5, One Rule, SVM, Naïve-Bayes, and Random Forest. The highest accuracy obtained by using Random Forest classifier around 78%. The results are acceptable because the Parser used to extract text from hierarchical websites, which is more difficult than flat websites in term of classification. 2.4 Summary This chapter presents some of the previous works that have been done by researchers. The topics that been discussed are more about website classification techniques, text mining, and HTML parser. Many algorithms and techniques have been investigated to highlight the advantages and disadvantages of them. It is clear from all the mentioned works that there are three important factors in web page classification. First is the feature selection technique, second is how to present the extracted data, and third is the classifier algorithm. All the three factors are important but the most important one which has strong influence on the classification accuracy is the feature selection technique. If the selected features can represent the web page perfectly, then the classification accuracy will be high regardless of using any classifier algorithm. 16 CHAPTER THREE RESEARCH METHODOLOGY This chapter provides the methodology used in this study. There are two main phases i.e., formulating an algorithm for text extraction and developing the prototype. 3.1 Text extraction algorithm There are three phases involved in formulating text extraction algorithm. The phases are web page extraction, rule formulation, and algorithm validation. Fig 3.1 illustrates the phases. 17 Figure 3.1: Text Extraction Algorithm 3.2 Web Page Extraction The process starts by loading the web pages from specific folder or by loading the web page from the internet to the memory. The algorithm has the facility to handle any number of pages with any number of targets as long as there is enough physical Start Algorithm Validation Web Page Extraction Rule Formulation Data Processing Data Set Creation Text Classification 18 memory in the machine. After that, extraction process is applied by reading the pages one by one and extracting source code with all the textual contents. The source code contains HTML tags, text paragraphs, hyperlinks, and any other text characters. The user will be asked for the target of the web page/s in order to label them in the data set. The user can specify how much percentage can be extracted from the page/s. The output of this phase is a string array. Each item in the array represents the extracted web page. The following algorithm was used for this phase: Extracting Page Source Code Algorithm (Online web page): Step 0: Initialize Uniform Resource Identifier (URI) object; Step 1: Send request to the server using (URI) address; Step 2: Send the Credentials with request (default credentials is Windows credentials). Step 3: Initialize stream object in the memory; Step 4: Store the stream from the server in stream object; Step 5: Initialize string; Step 6: Read and convert (to string) the stream object and save it in the String Array; Step 7: Return the String Array; Extracting Page Source Code Algorithm (From Documents in the Hard Disk): Step 0: Initialize Directory object; Step 1: Initialize array of File type; Step 2: Initialize array of String type; Step 3: Initialize Stream Reader object; Step 4: Give the directory path (on the hard disk) to the Directory object; Step 5: Get all the files information from the Directory object; Step 6: For each file in file array (filesInfoi, i=0,...,n) do steps 7-9; Step 7: Read the file content into the Stream Reader object; Step 8: Read the Stream until the end and convert it to String; Step 9: Save the String in the String array; Step 10: Return the String Array; 3.3 Rule Formulation Rule Formulation consist of three steps namely Data processing, Data set creation, and Text classification. The followings are the details of each step. 19 3.3.1 Data Processing After extracting the web page source code, the output contains text that includes noise data like HTML tags and some other components, which are not relevant for classification process. This step is called text transformation (Croft, Metzler, & Strohman, 2009), and it includes deleting HTML tags, digits and stop-words, and stemming the words using Porter stemming algorithm. The output of this step will be two dimensions string arrays representing all of the extracted words as columns and all of the web pages as rows. The following are data processing step: I. Remove HTML Tags: This process will start after collecting the source code from the web page. The input of this step will be in form of string array, which has both the text and HTML Tags. The algorithm will check each line in the string to remove the Tags. The output from this step is a string array contains only the page paragraph without any HTML Tags. The following algorithm describes the process. Remove HTML Tags Algorithm: Step 0: Initialize the input string array of extracted source code (output of the previous process “SourceCode”); Step 1: Initialize output array type of string; Step 2: Replace any of this character (" < ( . | \ n ) * ? > ") with empty space and save the output in the output array; Step 3: For each item in the output array (TextSourcei, i=0...,n) do steps 4- Step 4: Check if the item contains non-Letter character then, replace it with empty space; Step 5: Return the output array; II. Remove Digits: For classification purpose, digits are not required because they do not reflect any important value to the model. Therefore, this step will remove the digits from the paragraph by checking the characters one by one; if it is a digit then it will remove from the string. The output of this step is the 20 same string array of the previous step but cleaned from any digit. The following algorithm describes the process: Remove Digits Algorithm: Step 0: For each item in the output array (TextSourcei, i=0...,n) do step 1; Step 1: Check if the item contains non-Letter character then, replace it with empty space; Step 2: Return the output array; III. Split lines to words: Until this point, each string in the array contains lines of words; this step will split each line to words by using splitting criteria for example, empty space, comma, semi comma, or any kind of delimited character. The output of this step will be a list of string array, each item (array) in the list represent a web page and each item in the array (string) represent the word in that page The following algorithm describes the process: Split lines to words Algorithm: Step 0: Initialize the input (Text array from the previous step); Step 1: Initialize List of string; Step 2: For each input (Texti, i=0,...,n) do steps 3-4; Step 3: Split the word from the text if one of the following character is after the word: { ' ', '\n', '-', '_', '{', '}', '[', ']','=', '(', ')', ';', '"', ',', '.', '+', '!' }; Step 4: Add the split word to the List; Step 5: Return the List; IV. Remove duplicate words from the list: The list of string array, which is created from the previous step, has many duplicate words. In this step, a list of list type string created and each item in the list of string array added to the list of list if it is not identical to any item in the list of list. The output is a list of list type of string contains unique words. The following algorithm describes the process: Remove duplicate words Algorithm: Step 0: Initialize the output list (list of list type of string) 21 Step 1: For each page in the input list (pagei, i=0,..., n) do step 2; Step 2: For each item in the page (itemj, j=0,...,m) do step 3; Step 3: Check if the item[i] is not empty and does not exist in the output list[i], then do step 4; Step 4: Add the item[j] to the output list[i]; Step 5: Return the output list; V. Remove the words have less than specific number of characters: The user will specify how many characters should be in each word in order to process at run time. Small words they only give overload to the data set and most of them will be in form of preposition, therefore it is better to remove those words and focus on bigger words only. This step will do this process by checking the length of character in each word, if it is less than what the user specified then it will remove. The user can change the number of character length at run time. The output of this step will be the same list of list type of string but contains only words that have passed the checking criteria. The following algorithm describes the process: Remove specific words Algorithm: Step 0: For each page in the input list (pagei, i=0,..., n) do step 1; Step 1: For each string in the page (stringj, j=0,...,m) do step 2; Step 2: Check if the string characters length is less than 2 (or any number specified by the user), then do step 3; Step 3: Remove string[j] from the input list[i]; j--; Step 4: Return the input list; VI. Remove stop-words: Not every word in the page is important or can reflect any value to the model. For example if a word has occurred on most or all the pages, such a word doesn’t give any value to the model and doesn’t help to map the function, and some time will confuse the decision if it occurred alternatively in many classes. Therefore, it is better not to include such a word in the data set. For this purpose, a list of stop-words was created to use by this 22 step. The process will start by reading the file that contains the stop-words in the hard disk and adding the words to an array of string. After that a loop will start to check if there is any word from the stop-words match any word in the list of list, if there is a matching then this word will be deleted from the list of list. The output of this step will be a list of list type of string clean from stopwords. The following algorithm describes the process: Remove stop-words Algorithm: Step 0: Initialize stop-words array (string array) contains the stop-words; Step 1: For each word in the in the stop-words array (wordi, i=0,..., n) do step 2; Step 2: For each page in the pages-list (pagej, j=0,...,m) do step 3; Step 3: Check if the pages-list[j] contains stop-words[i], then do step 4; Step 4: Remove stop-words[i] from the pages-list[j]; Step 5: Return the pages-list; VII. Stemming the words: In English language each word or verb has many formations for example the verb “play” can be formatted to (player, plays, playing, played) or irregular verb such as “eat” can be formatted to (ate, eaten, eats, eating). Another example of words variety is singular and plural words. Therefore, it is very essential to get the root of those words in order to unify them if they have the same root, this process called stemming. This step uses Porter Stemming Algorithm (PSA), each word in the list of list will be send to PSA, the return word from the algorithm is the stemmed word. The output of this step will be a list of list type of string contains stemmed words. Appendix B contains Porter Stemming Algorithm. VIII. Remove duplicate words after stemming: Although there was a process of removing duplicate words before stemming, but because of stemming process, many words, which appear un-identical at that time, are identical now after stemming. In addition, the reason why removing duplicate step is not omit if 23 there is a similar process at this step? Removing duplicate words before stemming and after stemming will save more time than stemming all the words including the duplicate and then removing the duplicate words. The output of this step will be a list of list type of string contains stemmed unique words. The following algorithm describes the process: Remove duplicate words Algorithm: Step 0: Initialize the output list (list of list type of string) Step 1: For each page in the input list (pagei, i=0,..., n) do step 2; Step 2: For each word in the page (wordj, j=0,...,m) do step 3; Step 3: Check if the word[i] does not exist in the output list[i], then do step 4; Step 4: Add the word[j] to the output list[i]; Step 5: Return the output list; 3.3.2 Data Set Creation At this point, a list of list type of string (equal to two dimensional array) contains processed words is ready. I. Create Binary Data Set Each page is presented as a record and passed to the algorithm. The algorithm will add a new row and check every column in the first row, if any word from the page exist as a column, then it will add 1 to that column in the new row otherwise it will add new columns and the word will be the title of that column. This loop will continue until all the pages included in the two dimensional array. The output of this step will be a two dimensional array contains all words from the extracted web pages. The following algorithm describes the process: Binary Data Set Algorithm: Step 0: Initialize a List of List (rows) (columns) type of string (equal to multidimensional array or Table); Step 1: For each document (Di, i = 0,...,n) do steps 2 – 12; Step 2: Initialize string array with words extracted from the Di. Step 3: Add new row (item in the list); 24 Step 4: For each word in the word array (Wj, j=0,...,m) do steps from 5 – 11; Step 5: Check if the word (Wj) is exist in the List. IF exist, do steps from 6 to 7; Step 6: integer r = row[n], integer c= column[i] of (existed word); Step 7: Insert 1 to List[r,c]; IF not exist do steps 8 to 12; Step 8: Add new columns “The column title is the new word”; Step 9: integer ro = row[n], integer col =column[n]; Step 10: Insert value of 1 to the List [ro, col]; Step 11: For each row (Rk, k=0,..., p-1) insert value of 0 to the column[n]; Step 12: For each row (Rv, v=0,...,z) insert Target category. Change each item in the Binary data set contains Null to 0 From the previous algorithm, the data set array contains all the extracted words as columns title and a value of “1” if the record contains that word or Null if it does not contain it. This step will convert all the Null values to zero values by looping through all the items in the array and checking if the value is Null then change it to “0”. The output of this step will be a two dimensional array contains words in the first row and values (1, or 0) in the other rows. The following algorithm describes the process: Convert Null to Zero Algorithm: Step 0: For each row in the dataSet (rowi, i=0,..., n) do step 1; Step 1: For each column in the dataSet (columnj, j=0,..., m) do step 2; Step 2: Check if dataSet [i, j] = Null, then do step 3; Step 3: dataSet [i, j] = 0; Step 4: Return the dataSet; II. Create tf-idf Data Set: In this type of data set, each word has a weight depends on its frequency in the document. The weight calculated by using the following formula: Each page is represented by “d” and the total number of pages represented by “D”, such that d  D. 25 The “tf” calculated using the following formula: tf i.j = 푁푗 ∑ 푤푖푗 푛 푖=0 where: N is the total number of the same word (wi) appeared in document (j) i (i = 0,...,n) is the index of the word (w) in document (j) j(j=0,...,p) is the index of document (d) in (D) and the denominator is: ∑ 푤푖푗 푛 푖=0 is the sum of all the words (w) in document (dj). The calculation of idf is done by using the following formula: idfi = log |퐷| |1+ { 푑∶푤푖 ∈푑 } | where |D| is the total number of documents extracted. and | { 푑∶푤푖 ∈푑 } | is the number of documents where the word (wi) appeared. “1” added to avoid division by zero in case if the word did not exist in any other page. (tf-idf)i,j is computed by multiplying tf i.j and idfi The idea behind this calculation is if the term occurred many times in the same page means, it is more important. However, if the same term occurred in many other pages means it is less important. By using this technique, each term has specific weight according to its occurrence and frequency. This approach gives better value for each term instead of giving 0 and 1 or simply counting the frequency of the word in the 26 document. This method will also create some relations between each word by counting the frequency of term in each document. The relations were destroyed previously when the algorithm extracted the text and splitting the words into individual term without preserving the location of the each word, which gives different meaning. The data set created by using this algorithm has two dimensions. Each row in the data set represent individual page and each column represent unique term. The following algorithms describe all the mentioned calculation and data set creation. Creating tfidf Data Set Algorithm (First part) Step 0: Initialize two-dimensional Array (DS) type of string; Step 1: For each document (di, i = 0,...,n) do steps 2 – 11; Step 2: Initialize string array (W) with words extracted from di. Step 3: For each word in the word array (Wj, j=0,...,m) do steps from 4 – 11; Step 4: Check if the word (Wj) is exist in the DS. IF exist, do steps from 5 to 6; Step 5: integer r = row[n], integer c= column[i] of (existed word Wi); Step 6: Add 1 to the value in DS[r,c]; IF not exist do steps 7 to 10; Step 7: Add new columns “The column title is the new word”; Step 8: integer ro = row[n], integer col =column[n]; Step 9: Insert value of 1 to the DS [ro, col]; Step 10: For each row (Rk, k=0,..., p-1) insert value of 0 to the column[n]; Step 11: For each row (Rv, v=0,...,z) insert Target category. Step 12: Return DS. Change each item in DS contains Null to 0 (Second part) This step will convert all the Null values to zero values by looping through all the items in the array and checking if the value is Null then change it to “0”. The output of this step will be a two dimensional array contains words in the 27 first row and values (words count, or 0) in the other rows. The following algorithm describes the process: Convert Null to Zero Algorithm: Step 0: For each row in the DS (rowi, i=0,..., n) do step 1; Step 1: For each column in the DS (columnj, j=0,..., m) do step 2; Step 2: Check if DS [i, j] = Null, then do step 3; Step 3: DS [i, j] = 0; Step 4: Return the DS; Calculate tf-idf Algorithm (Third part): Step 0: Initialize two-dimensional Array (DS) type of string (Output of the second part); Step 1: Initialize int N “represent all the same word in the page”; Step 2: For each row in the DS (rowi, i=1,..., n) do step 2; Step 3: For each column in the DS (columnj, j=0,..., m) do step 3; Step 4: For each row in the DS (rowk, k=1,..., m) do step 4; Step 5: IF DS[k,c] != “0” do step 6; Step 6: N++; Step 7: IF DS[r,c] != “0” do step 8; Step 8: DS[r,c] = CalcTFIDF(DS[r,c], N); Step 9: N = 0; CalcTFIDF (DS[r,c], N) Algorithm (Forth part): Step 0: Initialize double tfidf; Step 1: Initialize double N “represent all the same word in the page”; Step 2: Initialize double tf; Step 3: Initialize double idf; Step 4: tf = N / the sum of all the words in document; Step 5: idf = Log (total pages) / (1 + the number of pages where the word appeared); Step 6: tfidf = tf * idf; Step 7: Return tfidf; III. Build Physical Data Set This step will create physical data set from the data set in the memory. The following sub-steps and algorithms will describe this step. a) Count how many columns has non 0 value in the data set 28 The initialization of the two-dimensional array was approximated to (1500, 15000) which is quite enough to hold all the data but it is not precisely represent the data set dimensions. This step is the first point towards calculating the exact size of dimensions by counting how many columns has not “0” values which means the columns contain words. It is clearly the process needs to loop only in the first row because it represents all the words in the array. The total number of counting plus “1” as a target column will represent the boundary of the second dimension in the data set while the first dimension can be obtained from the total number of pages extracted in the beginning plus “1” as a row (for columns title). The following algorithm describes the process: b) Counting Array Boundary Algorithm: Step 0: Initialize integer variable (counter); Step 1: For each column in the dataSet (columni, i=0,..., n) do step 2; Step 2: Check if dataSet [0, j] Not equal to 0, then do step 3; Step 3: counter++; Step 4: Return counter; c) Initialize the size of two dimensional array type of string: After obtaining the required dimensions for the data set, this step will initializing the new two dimensional array with the obtained dimensions. The output of this step will be a new two-dimensional array, which represent the data set size exactly. The following algorithm describes the process: Initialize the size of two-dimensional array Algorithm: Step 0: Initialize integer variable (dim1); Step 1: Initialize integer variable (dim2); Step 2: dim1 = number of page extracted + 1; Step 3: dim2 = counter + 1 (counter from previous algorithm); Step 4: Initialize 2 dimensional string array NewDataSet [dim1, dim2]; Step 5: Return NewDataSet; d) Copy every item from the old data set array to the new array: 29 This step will copy every item in the previous array to the new two dimensions array. The output of this step will be a two dimensions array contains all the words and values with specific and known size. The following algorithm describes the process: Copy all items Algorithm: Step 0: For each row in the old dataSet (rowi, i=0,..., n) do step 1; Step 1: For each column in the old dataSet (columnj, j=0,..., m) do step 2; Step 2: (Copy item from old array to new array) NewDataSet [i, j] = dataSet [i, j]; Step 3: Return the NewDataSet; e) Create CSV file This is the last step in creating data set. This process starts by creating a stream writer object to create and write a file on specific location with .CSV extension. After that, the process will loop through all the rows and columns in the data set array to read its content and write it to the stream object with a comma after each value. Between looping in each row, a new line is created in the file to represent the next row in the array. The output of this step will be a .CSV file which is ready for classification process. The following algorithm describes the process: Create CSV file Algorithm: Step 0: Initialize stream writer (SW) object to write into specific file; Step 1: For each row in the NewDataSet (rowi, i=0,..., n) do step 2; Step 2: For each column in the NewDataSet (columnj, j=0,...,m) do step 3; Step 3: Write the value of NewDataSet [i, j] to the file using SW object; Step 4: Write new line in the file using SW object; Step 5: Close SW object; 3.3.3 Text Classification Two algorithms used in this experiment are the Naïve-Bayes (NB) and C 4.5. Both applied with different size of data set and extraction parameters to investigate which algorithm give better results in terms of accuracy and processing time. 30 The WEKA application was used in this project where the Explorer option was selected from the WEKA GUI. WEKA has the facility to handle (CSV) file therefore there is no need to convert the data set because it is already in CSV format. 3.4 Algorithm Validation At the final phase, a validation technique applied to validate the accuracy of the rules and module built by WEKA. The module used to classify a new web page in the same scope. For this purpose, a prototype built using Visual C# 2008. It is a window application mixed with web connection protocol. It uses HTTP protocol for internet connection. The prototype has the ability to view the web page inside the User Interface and downloading the source code for further process. The prototype gives the user the facility of saving the data set in CSV or TEXT extension. 3.5 Developing of the Prototype The methodology adopted for developing the prototype “HTML Parser” called Code and Fix methodology. Code and Fix methodology consists of two steps: Write some code and Fix the problem in the code (McConnell, 1996). This methodology is suitable for small projects like the developed one. Because of the iterative nature of the methodology, it was easy to change the prototype functionality when it is required. This methodology focuses on writing the code to implement the required functions, execute the functions, and fix the code if the functions do not work correctly as it is required. 31 3.6 Summary The methodologies discussed in this chapter are practical methodologies. The first methodology designed specifically to formulate the text extraction algorithm. All the algorithms mentioned in this chapter implemented as classes in the prototype. The second methodology adapted from (McConnell, 1996) which is perfect to develop a small prototype like this one. The methodologies were suitable and quite sufficient for this study. 32 CHAPTER FOUR PROTOTYPE DEVELOPMENT This chapter describes the methodology used in developing the prototype. The code and fix methodology adapted in developing the prototype “HTML_Parser”. 4.1 Implementation Details The prototype implemented by using Visual C#. All the algorithms mentioned in chapter three are implemented as classes in the prototype HTML_Parser. HTML_Parser is windows application mixed with web connection protocol. It uses HTTP protocol for internet connection. The prototype has the facility to view the web page inside the User Interface and downloading the source code for further process. The prototype gives the user the facility of saving the data set in CSV extension. Many considerations were taken into account especially the processing time in developing the HTML Parser prototype. The prototype gives the user many options while processing the documents like how much percentage to be processed from each page, remove stop-words, and stemming the words etc. The prototype can handle any number of web pages to process but the data set size will be limited to the free size available in the physical memory. By trying many number of pages (max 737), it was able to create a data set with (738 rows and 12847 columns). 33 The processing time will be different from machine to machine depend on the processor speed and the number of pages. The machine used to run this prototype is Intel Core 2 Duo, 2.00 GHz with 3.00 GB RAM and the performance was quite acceptable in term of time. The prototype divides the algorithms into classes. Each class represents one algorithm. Fig 4.1 shows the classes diagram and their relationships with each other. Figure 4.1: Classes Diagram 34 4.2 Classes Implementation I. Get Source Code Algorithm Implementation There are two methods to extract the source code. First method is extracting online web page. And the second one is extracting web pages from the hard disk. Appendix A contains both methods implemented in Visual C#. II. Calculate Percentage Algorithm Implementation This class provides a method used for calculating how much percentage from each page to be extracted. Appendix A contains the method implemented in Visual C#. III. Remove HTML Tags Algorithm Implementation: The object of this class has a method called Remove. This method takes one parameter “array of string” as the extracted web pages. The method will remove any HTML tags in those pages. Appendix A contains the method implemented in Visual C#. IV. Remove Digits Algorithm Implementation The object of this class has a method called RemoveDigits. This method takes one parameter “array of string” as the extracted web pages. The method will remove any digits in those pages. Appendix A contains the method implemented in Visual C#. 35 V. Split lines to words Algorithm Implementation This class provides the functionality to split the text extracted from the web pages into separated words. Appendix A contains the method implemented in Visual C#. VI. Remove duplicate words Algorithm Implementation This class provides the methods to search and remove duplicate words before and after stemming. Appendix A contains the method implemented in Visual C#. VII. Remove the words have less than specific number of characters This class provides a method used to select only words have specific number of character. Appendix A contains the method implemented in Visual C#. VIII. Remove stop-words Algorithm Implementation This class provides the functionality to remove the stop-words from the extracted web pages. Appendix A contains the method implemented in Visual C#. IX. Stemming the words Algorithm Implementation This class implements Porter Stemming Algorithm. The class implementation adapted from (http://tartarus.org/~martin/PorterStemmer/). Appendix B provides the algorithm implementation. 36 X. Create Binary Data Set Algorithm Implementation This class provides methods for creating Binary data set. Appendix A contains the method implemented in Visual C#. XI. Create tf-idf Data Set Algorithm Implementation This class provides methods for creating tf-idf data set. Appendix A contains the method implemented in Visual C#. XII. Build Physical Data Set Algorithm Implementation This class provides a method to create the data set type .CSV on the hard disk. Appendix A contains the method implemented in Visual C#. 4.3 Graphical User Interface The interface provides all the facilities required for text extraction. The extraction can be done from online web page by providing the web page address to the built-in browser or from web pages stored in the hard disk. Fig 4.3 shows the HTML_Parser interface. 37 Figure 4.2: HTML_Parser Interface. 4.4 Summary This chapter describes the implementation tools and process of developing HTML_Parser prototype. The prototype provides all the facilities required for Web Text Extraction such as extracting text from web pages, processing the text, and creating the data set. The prototype performance was acceptable in term of processing time consumption and extraction accuracy. 38 CHAPTER FIVE RESULTS AND FINDINGS This chapter presents the experiment results conducted for this study. 5.1 Experiment Design The experiment was divided into two parts. First, is the extraction process using HTML Parser to create a data set. Second, is the classification activities performed using WEKA application to classify the web pages. Two classifier algorithms were used in this study, Naïve-Bayes algorithm and C4.5 which is J48 in WEKA. Those algorithms were applied on two types of data set, Binary and tf-idf. 5.2 Extraction process The experiment conducted used the web pages from Carnegie Mellon University called “CMU World Wide Knowledge Base (Web->KB) project”. This package contains WWW-pages collected from the universities of Cornell, Texas, Washington, and Wisconsin. The pages were manually classified into the following categories: student, faculty, staff, department, course, and project. The number of pages for each category is 1641, 1124, 137, 182, 930, 504, and 3764 respectively. Only four of the categories have been used in this experiment. The categories are course, department, project, and student. Where there are 150 pages for each category total of 600 web pages. 39 The files are organized into a directory structure, where one directory represents one class. Each of these four directories consists of five universities web pages. Some of the pages do not contain useful information. For example, about 80 pages only contain information for redirecting the browser to a different location; therefore the pages were selected after testing each one manually. After downloading and organizing the pages in the directories, the next step is reading those pages. There are two options for this step, either by reading the page individually one by one or by reading all the pages inside the folder in one shot, this experiment was conducted by using the second option in order to save time. The HTML Parser prototype used here has the facilities to read multi-pages, providing the folder path and the target “class type” manually. In this experiment, 100% of the pages content was selected to be processed into a data set. Many words in the web pages have small number of characters, therefore the next option selected is processing only the words consist of more than 3 characters. To apply tf-idf technique, duplicate words are needed to weight the frequency of that word. For this requirement, the option remove duplicate is not selected. The other option remove stop-words selected to remove the words that don’t give any value for classification. Stemming option was selected to apply Porter Stemming Algorithm on the extracted words. The last option is to decide which technique needs to be used for creating data set. There are two techniques, one is creating binary data set and the other is creating tf-idf data set. In this experiment, both techniques are used to create two type of data set. The first data set is created by using tf-idf technique. After creating the data set, the file was in form of comma separated value (CSV). The final data set contains 600 rows with 4 classes namely: course, department, project, and student. Fig 5.1 shows part of tf-idf data set. 40 Figure 5.1: tf-idf Data set. The second data set created by using binary technique instead of tf-idf technique. For this type of data set, no need to count the duplicate words, therefore the option remove duplicate was selected. Fig 5.2 shows part of binary data set. Figure 5.2: Binary data set. 41 5.3 Classification Activities WEKA 3.7.1 application used in this experiment, Explorer option selected from the Weka GUI chooser. WEKA has the facility to handle (CSV) file therefore there is no need to convert the data set because it is already in CSV format. The algorithm chosen in this experiment was Naïve-Bayes algorithm and C4.5 (J48). Many splitting rate was used to find out which rate is the best for training and test parts. Table 5.1 shows the classification results for 60% as training set and 40% as testing set using tf-idf data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified tf-idf 7328 Naïve- Bayes No 4.42 8.3333 % 91.6667 % tf-idf 7328 Naïve- Bayes Yes 5.65 13.75 % 86.25 % tf-idf 7328 C4.5 N/A 40.43 22.9167 % 77.0833 % Table 5.1: classification results (60%-40%). By using tf-idf data set with (60% - 40%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 91%. Table 5.2 shows the classification results for 65% as training set and 35% as testing set using tf-idf data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified tf-idf 7328 Naïve- Bayes No 4.08 9.5238 % 90.4762 % tf-idf 7328 Naïve- Bayes Yes 5.76 15.7143 % 84.2857 % tf-idf 7328 C4.5 N/A 39.65 19.0476 % 80.9524 % Table 5.2: classification results (65%-35%). 42 By using tf-idf data set with (65% - 35%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 90%. Table 5.3 shows the classification results for 70% as training set and 30% as testing set using tf-idf data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified tf-idf 7328 Naïve- Bayes No 3.61 7.7778 % 92.2222 % tf-idf 7328 Naïve- Bayes Yes 5.3 17.2222 % 82.7778 % tf-idf 7328 C4.5 N/A 41.45 24.4444 % 75.5556 % Table 5.3: classification results (70%-30%). By using tf-idf data set with (70% - 30%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 92%. Table 5.4 shows the classification results for 60% as training set and 40% as testing set using Binary data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified Binary 7209 Naïve- Bayes No 4.34 10.8333 % 89.1667 % Binary 7209 Naïve- Bayes Yes 5.3 14.5833 % 85.4167 % Binary 7209 C4.5 N/A 49.42 20.4167 % 79.5833 % Table 5.4: classification results (60%-40%). By using Binary data set with (60% - 40%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 89%. 43 Table 5.5 shows the classification results for 65% as training set and 35% as testing set using Binary data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified Binary 7209 Naïve- Bayes No 3.63 11.9048 % 88.0952 % Binary 7209 Naïve- Bayes Yes 5.14 13.8095 % 86.1905 % Binary 7209 C4.5 NA 47.96 20 % 80 % Table 5.5: classification results (65%-35%). By using Binary data set with (65% - 35%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 88%. Table 5.6 shows the classification results for 70% as training set and 30% as testing set using Binary data set. Data Representation No of attributes Classifier Algorithm Discretized Time to build model (second) Incorrectly classified Correctly classified Binary 7209 Naïve- Bayes No 4.18 12.2222 % 87.7778 % Binary 7209 Naïve- Bayes Yes 5.65 16.1111 % 83.8889 % Binary 7209 C4.5 NA 46.39 26.1111 % 73.8889 % Table 5.6: classification results (70%-30%). By using Binary data set with (70% - 30%) as splitting rate and without applying Discretization option, Naïve-Bayes shows the highest accuracy around 87%. The number of instance has a vital rule in the results of accuracy. By trying many data set sizes, the results are that neither a small number of instances nor a very big number of instances was useful in the classification accuracy. This is because a small 44 number of instances will build a weak model which doesn’t have the ability to classify new input to the correct class. In opposite case when using very big number of instances, the model will be very complex and the probability of over fitting will increase, therefore the data set in middle size (600 instances) gives the best results. From the results, it is clear that the highest accuracy around 92% obtained by using tfidf data set and Naïve-Bayes classifier without applying Discretization technique. By using tf-idf technique the classification accuracy increased around 3% from using Binary technique. This type of classification called Hierarchical classification (Xiaoguang & Brian, 2009). This type of classification is more difficult than flat classification because the task here is to classify the pages which are in the same area (educational in this case). Those pages have similar nature in structure and contents, therefore it is not an easy task to classify such pages and we need a sufficient algorithm to extract the text from those pages like the developed one. Songbo, Xueqi, Moustafa, Bin, and Hongbo (2005) have proposed a clssification algorithm called “DragPushing”. In their approach, they have used the same data set used in this study (WebKB) and seven classification algorithms namely: Refining Centroid Classifier (RCC), Centroid, Refining Naïve-Bayes (RNB), Naïve-Bayes (NB), K-Nearest Neighbor (KNN), Winnow, and Support Vector Machines (SVM). Table 5.7 shows the classification accuracy details for each algorithm. RCC Centroid RNB NB KNN Winnow SVM 0.8802 0.7847 0.9063 0.8601 0.7015 0.7785 0.8766 Table 5.7: Classification accuracy Details (DragPushing). 45 RCC and RNB algorithms gave the highest classification algorithm, while NB gave around 0.86 classification accuracy. Comparing their approach with this approach, this approach was able to produce classification accuracy approximately 2% higher than RNB and approximately 6% higher than NB in their approach. This indicated that, this study has improved the web page classification in term of accuracy. 5.4 Summary Experiment was conducted by using Naïve-Bayes and C4.5 classifier. Both algorithms applied on Binary data set and tf-idf data set. Naïve-Bayes classifier shows accuracy around 92% which is higher than C4.5 around 80%. Another important factor is the time consumption. C4.5 consume around 41 second while Naïve-Bayes consume around 3 second. Therefore, the classifier Naïve-Bayes consider better than C4.5 in term of classification accuracy and time consumption. Extracting more words from each page gives more accuracy later in the classification, the other options like selecting the number of character in each word should be considered as critical option as well. Naïve-Bayes has good capability to handle large data set. By using hierarchical data set, the classification accuracy was acceptable. The accuracy will increased if the classification type is flat structure or if the data set classes are less than 4 classes. Educational data set is very challenging data set and it is good date set for evaluating an algorithm. The proposed approach is a promising approach for text extraction in web classification. 46 CHAPTER SIX CONCLUSION AND RECOMMENDATIONS This chapter concludes the study and discussion of this project for Text Extraction Algorithm. Finally, recommendation and further study in this area are given in this chapter. 6.1 Project Conclusion The objective of this project is to develop a text extraction algorithm for web text classification. The HTML Parser that been developed provides all the facilities required for text extraction from the web pages. Two type of data set can be created using the HTML Parser, one is Binary data set and the other one is id tf-idf data set. The experiment validates the prototype by applying classifier algorithms on the data set. Two classifiers algorithms used, one is Naïve-Bayes algorithm and the other one is C4.5 algorithm. Naïve-Bayes classifier produced better classification results in term of accuracy and time consumption. 47 6.2 Recommendation for Future Research For more research on text extraction algorithm for web page classification, deep investigations on feature selection algorithms are required to improve the data set quality. The feature selection algorithm shows strong influence on classification accuracy. Furthermore, it is also suggested that in future research, how to represent the extracted data should be investigated. It is very vital to find a method to preserve the relations between extracted words. 48 REFERENCES Aaron, H. (2004). Introduction to Bayesian Learning. New York: University of Toronto. Alonso-Rios, D., Luis-Vazquez, I., Mosqueira-Rey, E., Moret-Bonillo, V., & del Rio, B. B. (2009, 11-14 Oct. 2009). An HTML analyzer for the study of web usability. Paper presented at the IEEE International Conference on Systems, Man and Cybernetics, 2009. SMC 2009, San Antonio, TX, 1224-1229. Anagnostopoulos, I., Kouzas, G., Anagnostopoulos, C., Vergados, D., Papaleonidopoulos, I., Generalis, A.,...Kayafas, E. (2002). Automatic Web site classification in a large repository under information filtering and retrieval techniques. Paper presented at the Electrotechnical Conference, 2002. MELECON. 11th Mediterranean, 279-283. Cooley, R., Mobasher, B., & Srivastava, J. (1997, 3-8 Nov 1997). Web mining: information and pattern discovery on the World Wide Web. Paper presented at the Proceedings Ninth IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, 558-567. Croft, B., Metzler, D., & Strohman, T. (2009). Search Engines: Information Retrieval in Practice: Addison Wesley. Dey, L., & Haque, S. K. M. (2008). Opinion mining from noisy text data. Paper presented at the Proceedings of the second workshop on Analytics for noisy unstructured text data, Singapore, 83-90. Feng, C., & Cherkassky, V. (2009). SVM+ regression and multi-task learning. Paper presented at the International Joint Conference on Neural Networks, 2009. IJCNN 2009, Atlanta, GA, 418-424. Fresno, V., & Ribeiro, A. (2004). An Analytical Approach to Concept Extraction in HTML Environments. Journal of Intelligent Information Systems, Volume 22, Number 3(0925-9902), 215-235. Hui, G., Yan, F., & Jian-Ping, L. (2008). Classification of Sensitive Web Documents. Paper presented at the International Conference on Apperceiving Computing and Intelligence Analysis, 2008. ICACIA 2008, Chengdu, 295-298. Jianli, L., Jie, S., & Cuihua, X. (2004, 15-18 Sept. 2004). Segmenting the Web document with document object model. Paper presented at the IEEE International Conference on Services Computing, 2004. (SCC 2004), China, 449-452. Ke, Y., Yueming, L., Zongyan, L., & Xiang, D. (2008, 25-27 June 2008). Architecture and design of the parser used in the embedded browser based on DFA. Paper presented at the 7th World Congress on Intelligent Control and Automation, 2008. WCICA 2008, Chongqing, 6627-6630. 49 Lee, S., Baker, J., Song, J., & Wetherbe, J. C. (2010). An Empirical Comparison of Four Text Mining Methods. Paper presented at the International Conference on 43rd Hawaii System Sciences, (HICSS), Honolulu, Hawaii, USA, 1-10. Liu, Y., A process-based search engine. Ph.D. dissertation, University of Kansas, United States -- Kansas. Retrieved April 16, 2010, from Dissertations & Theses: Full Text.(Publication No. AAT 3360489). McConnell, S. (1996). Rapid Development: Taming Wild Software Schedules. Washington: Microsoft Press. Meijuan, G., Jingwen, T., & Shiru, Z. (2009). Research of web classification mining based on classify support vector machine. Paper presented at the International Colloquium on Computing, Communication, Control, and Management, 2009. CCCM 2009. ISECS, Sanya, 21-24. Morales, S. d. P. B., Fandino, H. A. B., & Rodrivguez, J. R. (2009). Hypertext classification to filtrate information on the web. Paper presented at the Proceedings of the 2009 Euro American Conference on Telematics and Information Systems: New Opportunities to increase Digital Citizenship, Prague, Czech Republic, 1-7. Qi, X., & Davison, B. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys (CSUR), 41(2), 1-31. Ronen, F., & James, S. (2007). The text mining handbook. New York: Cambridge University Press. Shiqun, Y., Yuhui, Q., & Jike, G. (2007). Research and Realization of Text Mining Algorithm on Web. Paper presented at the International Conference on Computational Intelligence and Security Workshops, 2007. CISW 2007, Harbin, 413-416. Songbo, T., Xueqi, C., Moustafa, M. G., Bin, W., & Hongbo, X. (2005). A novel refinement approach for text categorization. Paper presented at the Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 469-476. Suanmali, L., Salim, N., & Binwahlan, M. S. (2009). Feature-Based Sentence Extraction Using Fuzzy Inference Rules. Paper presented at the 2009 International Conference on Signal Processing Systems, Singapore, 511- 515. Sun, B., Sun, Q., Chen, Z., & Fu, Z. (2009). A Study on Automatic Web Pages Categorization. Paper presented at the IEEE International Advance Computing Conference, 2009. IACC 2009, Patiala, 1423-1427. Tomar, G. S., Shekhar, V., & Ashish, J. (2006). Web Page Classification using Modified Naive Bayesian Approach. Paper presented at the TENCON 2006. IEEE Region 10 Conference, Hong Kong, 1-4. 50 Wang, J., Huang, Y., Wu, G., & Zhang, F. (1999). Web mining: knowledge discovery on the Web. Paper presented at the 1999 IEEE International Conference on Systems, Man, and Cybernetics, IEEE SMC '99 Conference Proceedings, Tokyo, 137-14. Xiaoguang, Q., & Brian, D. D. (2009). Web page classification: Features and algorithms. ACM Computing Surveys (CSUR) 41(2), 1-31. Yu, H., Han, J., & Chang, K. C.-C. (2002). PEBL: positive example based learning for Web page classification using SVM. Paper presented at the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, Edmonton, Alberta, Canada, 239-248. Zhang, Y., Zincir-Heywood, N., & Milios, E. (2005). Narrative text classification for automatic key phrase extraction in web document corpora. Paper presented at the Proceedings of the 7th annual ACM international workshop on Web information and data management, Bremen, Germany, 51-58. Zhongda, L., Kun, D., & Yanfen, H. (2007). Research of Web Pages Categorization. Paper presented at the IEEE International Conference on Granular Computing, 2007. GRC 2007, Fremont, CA, 691-691. Ziegler, C.-N., Vogele, C., & Viermetz, M. (2009). Distilling Informative Content from HTML News Pages. Paper presented at the International Joint Conferences on Web Intelligence and Intelligent Agent Technologies, 2009. WI-IAT '09. IEEE/WIC/ACM Milan, Italy, 707-712. 51 Appendix A I. Get Source Code Algorithm Implementation class GetSourceCode { public StreamReader streamReader; public string GetOnlineSource(string address) { Uri uri = new Uri(address); WebRequest request = WebRequest.Create(uri); request.Credentials = CredentialCache.DefaultCredentials; HttpWebResponse response = (HttpWebResponse)request.GetResponse(); Stream dataStream = response.GetResponseStream(); streamReader = new StreamReader(dataStream); string sourceCode = streamReader.ReadToEnd(); response.Close(); dataStream.Close(); streamReader.Close(); return sourceCode; } public string[] GetSourecCodeFromFolder() { FolderBrowserDialog folderDialog = new FolderBrowserDialog(); folderDialog.ShowDialog(); if (folderDialog.SelectedPath != "") { DirectoryInfo di = new DirectoryInfo(folderDialog.SelectedPath); FileInfo[] filesInfo = di.GetFiles(); string[] allPages = new string[filesInfo.Length]; // 1- Collect files and read source code for (int i = 0; i < filesInfo.Length; i++) { streamReader = new StreamReader(filesInfo[i].FullName); allPages[i] = streamReader.ReadToEnd().ToLower(); } return allPages; } string[] allPages2 = new string[0]; return allPages2; } } II. Calculate Percentage Algorithm Implementation class CalcPercentage { public void Calc(double percentage, string[] webPages) { for (int i = 0; i < webPages.Length; i++) { if (webPages[i].Length > webPages[i].Length * percentage) { int start = Convert.ToInt32(webPages[i].Length * percentage); int charCount = webPages[i].Length - start; webPages[i] = webPages[i].Remove(start, charCount); } 52 } } } III. Remove HTML Tags Algorithm Implementation: public class RemoveHTMLTags { public void Remove(string[] webPages) { for (int i = 0; i < webPages.Length; i++) { webPages[i] = Regex.Replace(webPages[i], @"<(.|\n)*?>", " "); } } } IV. Remove Digits Algorithm Implementation class RemoveDigits { public void Remove(string[] webPages) { for (int i = 0; i < webPages.Length; i++) { for (int j = 0; j < webPages[i].Length; j++) { if (!char.IsLetter(webPages[i][j]) && webPages[i][j] != ' ') { webPages[i] = webPages[i].Replace(webPages[i][j], ' '); } } } } } V. Split lines to words Algorithm Implementation class SplitText { public void Split(string[] allPages, List<string[]> listArray) { for (int i = 0; i < allPages.Length; i++) { string[] words = allPages[i].Split(new char[] { ' ', '\n', '\\','/', '-', '_', '{', '}', '[', ']', '=', '(', ')', ';', '"', ',', '.', '+', '!' }); List<string> wordList = new List<string>(); for (int w = 0; w < words.Length; w++) { if (words[w].Trim() != "") { wordList.Add(words[w]); } } string[] cleanWords = new string[wordList.Count]; for (int cw = 0; cw < cleanWords.Length; cw++) { cleanWords[cw] = wordList[cw]; } listArray.Add(cleanWords); } 53 } } VI. Remove duplicate words Algorithm Implementation class RemoveDuplicate { public void Remove(bool remove, List<string[]> pageList, List<List<string>> webPagesList) { if (remove) { for (int i = 0; i < pageList.Count; i++) { for (int j = 0; j < pageList[i].Length; j++) { if (!webPagesList[i].Contains(pageList[i][j].Trim())) { webPagesList[i].Add(pageList[i][j]); } } } } else { for (int i = 0; i < pageList.Count; i++) { for (int j = 0; j < pageList[i].Length; j++) { webPagesList[i].Add(pageList[i][j]); } } } } public void InitializaeUniqeWords(List<List<string>> stemedList, List<List<string>> uniqueWordsList) { for (int i = 0; i < stemedList.Count; i++) { uniqueWordsList.Add(new List<string>()); } } internal void Remove(bool removeDuplicate, List<List<string>> stemedList, List<List<string>> uniqueWordsList) { if (removeDuplicate) { for (int i = 0; i < stemedList.Count; i++) { for (int j = 0; j < stemedList[i].Count; j++) { if (!uniqueWordsList[i].Contains(stemedList[i][j])) { uniqueWordsList[i].Add(stemedList[i][j]); } } } } else 54 { for (int i = 0; i < stemedList.Count; i++) { for (int j = 0; j < stemedList[i].Count; j++) { uniqueWordsList[i].Add(stemedList[i][j]); } } } } } VII. Remove the words have less than specific number of characters class RemoveMinNoOfChar { public void Remove(int charLength, List<List<string>> webPagesList) { if (charLength < 100) { for (int i = 0; i < webPagesList.Count; i++) { for (int j = 0; j < webPagesList[i].Count; j++) { if (webPagesList[i][j].Length < charLength){ webPagesList[i].Remove(webPagesList[i][j]); j--; } } } } } } VIII. Remove stop-words Algorithm Implementation public class RemoveStopWords { public void Remove(bool remove, List<List<string>> webPagesList) { if (remove) { string[] stop_words = File.ReadAllLines(@"stop.txt"); for (int i = 0; i < stop_words.Length; i++) { for (int j = 0; j < webPagesList.Count; j++) { if (webPagesList[j].Contains(stop_words[i].ToLower())) { webPagesList[j].Remove(stop_words[i].ToLower()); } } } } } } IX. Stemming the words Algorithm Implementation class Stemmer { public void InitializeStemmedlIST(bool steam, List<List<string>> webPagesList, List<List<string>> stemedList,ProgressBar pb ) { if (steam) { 55 int max = 0; for (int i = 0; i < webPagesList.Count; i++) { stemedList.Add(Steam(webPagesList[i])); if (max <= 100) { pb.Value = max; max++; } } pb.Value = 100; } else { for (int i = 0; i < webPagesList.Count; i++) { stemedList.Add(webPagesList[i]); } } } static public List<string> Steam(List<string> list) { List<string> resultList = new List<string>(); string fullString = ""; for (int i = 0; i < list.Count; i++) { fullString += list[i] + " "; } char[] w = new char[501]; Stemmer s = new Stemmer(); Stream _in = new MemoryStream(ASCIIEncoding.Default.GetBytes(fullString)); while (true) { int ch = _in.ReadByte(); if (Char.IsLetter((char)ch)) { int j = 0; while (true) { ch = Char.ToLower((char)ch); w[j] = (char)ch; if (j < 500) j++; ch = _in.ReadByte(); if (!Char.IsLetter((char)ch)) { for (int c = 0; c < j; c++) s.add(w[c]); s.stem(); String u; u = s.ToString(); resultList.Add(u); break; } } } if (ch < 0) break; 56 } return resultList; } private char[] b; private int i, /* offset into b */ i_end, /* offset to end of stemmed word */ j, k; private static int INC = 50; /* unit of size whereby b is increased */ public Stemmer() { b = new char[INC]; i = 0; i_end = 0; } /** * Add a character to the word being stemmed. When you are finished * adding characters, you can call stem(void) to stem the word. */ public void add(char ch) { if (i == b.Length) { char[] new_b = new char[i + INC]; for (int c = 0; c < i; c++) new_b[c] = b[c]; b = new_b; } b[i++] = ch; } /** Adds wLen characters to the word being stemmed contained in a portion * of a char[] array. This is like repeated calls of add(char ch), but * faster. */ public void add(char[] w, int wLen) { if (i + wLen >= b.Length) { char[] new_b = new char[i + wLen + INC]; for (int c = 0; c < i; c++) new_b[c] = b[c]; b = new_b; } for (int c = 0; c < wLen; c++) b[i++] = w[c]; } /** * After a word has been stemmed, it can be retrieved by toString(), * or a reference to the internal buffer can be retrieved by getResultBuffer * and getResultLength (which is generally more efficient.) */ public override string ToString() { 57 return new String(b, 0, i_end); } /** * Returns the length of the word resulting from the stemming process. */ public int getResultLength() { return i_end; } /** * Returns a reference to a character buffer containing the results of * the stemming process. You also need to consult getResultLength() * to determine the length of the result. */ public char[] getResultBuffer() { return b; } /* cons(i) is true <=> b[i] is a consonant. */ private bool cons(int i) { switch (b[i]) { case 'a': case 'e': case 'i': case 'o': case 'u': return false; case 'y': return (i == 0) ? true : !cons(i - 1); default: return true; } } /* m() measures the number of consonant sequences between 0 and j. if c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence, <c><v> gives 0 <c>vc<v> gives 1 <c>vcvc<v> gives 2 <c>vcvcvc<v> gives 3 .... */ private int m() { int n = 0; int i = 0; while (true) { if (i > j) return n; if (!cons(i)) break; i++; } i++; while (true) { while (true) { 58 if (i > j) return n; if (cons(i)) break; i++; } i++; n++; while (true) { if (i > j) return n; if (!cons(i)) break; i++; } i++; } } /* vowelinstem() is true <=> 0,...j contains a vowel */ private bool vowelinstem() { int i; for (i = 0; i <= j; i++) if (!cons(i)) return true; return false; } /* doublec(j) is true <=> j,(j-1) contain a double consonant. */ private bool doublec(int j) { if (j < 1) return false; if (b[j] != b[j - 1]) return false; return cons(j); } /* cvc(i) is true <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g. cav(e), lov(e), hop(e), crim(e), but snow, box, tray. */ private bool cvc(int i) { if (i < 2 || !cons(i) || cons(i - 1) || !cons(i - 2)) return false; int ch = b[i]; if (ch == 'w' || ch == 'x' || ch == 'y') return false; return true; } private bool ends(String s) { int l = s.Length; int o = k - l + 1; if (o < 0) return false; 59 char[] sc = s.ToCharArray(); for (int i = 0; i < l; i++) if (b[o + i] != sc[i]) return false; j = k - l; return true; } /* setto(s) sets (j+1),...k to the characters in the string s, readjusting k. */ private void setto(String s) { int l = s.Length; int o = j + 1; char[] sc = s.ToCharArray(); for (int i = 0; i < l; i++) b[o + i] = sc[i]; k = j + l; } /* r(s) is used further down. */ private void r(String s) { if (m() > 0) setto(s); } /* step1() gets rid of plurals and -ed or -ing. e.g. caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet */ private void step1() { if (b[k] == 's') { if (ends("sses")) k -= 2; else if (ends("ies")) setto("i"); else if (b[k - 1] != 's') k--; } if (ends("eed")) 60 { if (m() > 0) k--; } else if ((ends("ed") || ends("ing")) && vowelinstem()) { k = j; if (ends("at")) setto("ate"); else if (ends("bl")) setto("ble"); else if (ends("iz")) setto("ize"); else if (doublec(k)) { k--; int ch = b[k]; if (ch == 'l' || ch == 's' || ch == 'z') k++; } else if (m() == 1 && cvc(k)) setto("e"); } } /* step2() turns terminal y to i when there is another vowel in the stem. */ private void step2() { if (ends("y") && vowelinstem()) b[k] = 'i'; } /* step3() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0. */ private void step3() { if (k == 0) return; /* For Bug 1 */ switch (b[k - 1]) { case 'a': if (ends("ational")) { r("ate"); break; } if (ends("tional")) { r("tion"); break; } break; case 'c': if (ends("enci")) { r("ence"); break; } if (ends("anci")) { r("ance"); break; } break; case 'e': if (ends("izer")) { r("ize"); break; } break; case 'l': if (ends("bli")) { r("ble"); break; } if (ends("alli")) { r("al"); break; } if (ends("entli")) { r("ent"); break; } if (ends("eli")) { r("e"); break; } if (ends("ousli")) { r("ous"); break; } break; 61 case 'o': if (ends("ization")) { r("ize"); break; } if (ends("ation")) { r("ate"); break; } if (ends("ator")) { r("ate"); break; } break; case 's': if (ends("alism")) { r("al"); break; } if (ends("iveness")) { r("ive"); break; } if (ends("fulness")) { r("ful"); break; } if (ends("ousness")) { r("ous"); break; } break; case 't': if (ends("aliti")) { r("al"); break; } if (ends("iviti")) { r("ive"); break; } if (ends("biliti")) { r("ble"); break; } break; case 'g': if (ends("logi")) { r("log"); break; } break; default: break; } } /* step4() deals with -ic-, -full, -ness etc. similar strategy to step3. */ private void step4() { switch (b[k]) { case 'e': if (ends("icate")) { r("ic"); break; } if (ends("ative")) { r(""); break; } if (ends("alize")) { r("al"); break; } break; case 'i': if (ends("iciti")) { r("ic"); break; } break; case 'l': if (ends("ical")) { r("ic"); break; } if (ends("ful")) { r(""); break; } break; case 's': if (ends("ness")) { r(""); break; } break; } } /* step5() takes off -ant, -ence etc., in context <c>vcvc<v>. */ private void step5() { if (k == 0) return; /* for Bug 1 */ switch (b[k - 1]) { case 'a': if (ends("al")) break; return; case 'c': if (ends("ance")) break; 62 if (ends("ence")) break; return; case 'e': if (ends("er")) break; return; case 'i': if (ends("ic")) break; return; case 'l': if (ends("able")) break; if (ends("ible")) break; return; case 'n': if (ends("ant")) break; if (ends("ement")) break; if (ends("ment")) break; /* element etc. not stripped before the m */ if (ends("ent")) break; return; case 'o': if (ends("ion") && j >= 0 && (b[j] == 's' || b[j] == 't')) break; /* j >= 0 fixes Bug 2 */ if (ends("ou")) break; return; /* takes care of -ous */ case 's': if (ends("ism")) break; return; case 't': if (ends("ate")) break; if (ends("iti")) break; return; case 'u': if (ends("ous")) break; return; case 'v': if (ends("ive")) break; return; case 'z': if (ends("ize")) break; return; default: return; } if (m() > 1) k = j; } /* step6() removes a final -e if m() > 1. */ private void step6() { j = k; if (b[k] == 'e') { int a = m(); if (a > 1 || a == 1 && !cvc(k - 1)) k--; } if (b[k] == 'l' && doublec(k) && m() > 1) k--; } /** Stem the word placed into the Stemmer buffer through calls to add(). * Returns true if the stemming process resulted in a word different * from the input. You can retrieve the result with * getResultLength()/getResultBuffer() or toString(). */ public void stem() { k = i - 1; 63 if (k > 1) { step1(); step2(); step3(); step4(); step5(); step6(); } i_end = k + 1; i = 0; } } X. Create Binary Data Set Algorithm Implementation class CreateBinaryDataSet { static int col = 0; static int row = 0; static bool flag = false; static int ColCount = 0; public void Create(List<List<string>> uniqueWordsList, string[,] dataSet) { for (int i = 0; i < uniqueWordsList.Count; i++) { string[] arrayOfPageWords = new string[uniqueWordsList[i].Count]; for (int j = 0; j < uniqueWordsList[i].Count; j++) { arrayOfPageWords[j] = uniqueWordsList[i][j]; } AddTODataSet(arrayOfPageWords, dataSet); } } private void AddTODataSet(string[] doc, string[,] dataSet) { row++; for (int i = 0; i < doc.GetLength(0); i++) { flag = false; for (col = 0; col < dataSet.GetLength(1); col++) { if (doc[i] == dataSet[0, col]) { flag = true; break; } } if (flag == true) { AddValueTo(row, col,dataSet); } else if (flag == false) { AddNewCol(ColCount, doc[i], dataSet); ColCount++; } } 64 } private void AddNewCol(int c, string w, string[,] dataSet) { dataSet[0, c] = w; AddValueTo(row, c, dataSet); } private void AddValueTo(int row, int col, string[,] dataSet) { dataSet[row, col] = "1"; } } XI. Create tf-idf Data Set Algorithm Implementation class CreateTFIDFdataSet { static int col = 0; static int row = 0; static bool flag = false; static int ColCount = 0; public void Create(List<List<string>> uniqueWordsList, string[,] dataSet) { for (int i = 0; i < uniqueWordsList.Count; i++) { string[] arrayOfPageWords = new string[uniqueWordsList[i].Count]; for (int j = 0; j < uniqueWordsList[i].Count; j++) { arrayOfPageWords[j] = uniqueWordsList[i][j]; } AddTODataSet(arrayOfPageWords, dataSet); } } public void CalcTFIDF(string [,] dataSetMatrix, List<int> pagesSize) { int no_of_terms_in_all_documens = 0; for (int r = 1; r < dataSetMatrix.GetLength(0); r++) { for (int c = 0; c < dataSetMatrix.GetLength(1) - 1; c++) { for (int cnt = 1; cnt < dataSetMatrix.GetLength(0); cnt++) { if (dataSetMatrix[cnt, c] != "0") { no_of_terms_in_all_documens++; } } if (dataSetMatrix[r, c] != "0") { dataSetMatrix[r, c] = CalcTFIDF(dataSetMatrix[r, c], pagesSize[r - 1], dataSetMatrix.GetLength(0) - 1, no_of_terms_in_all_documens); } no_of_terms_in_all_documens = 0; } } } private void AddTODataSet(string[] doc, string[,] dataSet) 65 { row++; for (int i = 0; i < doc.GetLength(0); i++) { flag = false; for (col = 0; col < dataSet.GetLength(1); col++) { if (doc[i] == dataSet[0, col]) { flag = true; break; } } if (flag == true) { AddValueTo(row, col,dataSet); } else if (flag == false) { AddNewCol(ColCount, doc[i], dataSet); ColCount++; } } } private void AddNewCol(int c, string w, string[,] dataSet) { dataSet[0, c] = w; AddValueTo(row, c, dataSet); } private void AddValueTo(int row, int col,string[,] dataSet) { string cellValue = dataSet[row, col]; if (cellValue == null) { dataSet[row, col] = "1"; } else { dataSet[row, col] = (double.Parse(cellValue) + 1).ToString(); ; } } public void ChangeNullItems(bool binary, string[,] dataSet) { if (binary) { for (int i = 0; i < dataSet.GetLength(0); i++) { for (int j = 0; j < dataSet.GetLength(1); j++) { if (dataSet[i, j] == null) { dataSet[i, j] = "0"; } } } } } private string CalcTFIDF(string termAppears, int totalFreq, int totalPages, int totTermFreq) 66 { double tdidf; double termApp = (double.Parse(termAppears)); double tf = termApp / (double)totalFreq; double idf1 = (double)totalPages / (1 + (double)totTermFreq); double idf = Math.Log(idf1); tdidf = tf * idf; tdidf = Math.Round(tdidf, 4); return tdidf.ToString(); } } XII. Build Physical Data Set Algorithm Implementation class SaveDataSet { public void Save(string[,] dataSetMatrix) { SaveFileDialog sfd = new SaveFileDialog(); sfd.FileName = "dataSet" + DateTime.Now.Millisecond + ".csv"; DialogResult result = sfd.ShowDialog(); if (result == DialogResult.OK) { string sFilePath = sfd.FileName; StreamWriter sw = File.CreateText(sFilePath); for (int r = 0; r < dataSetMatrix.GetLength(0); r++) { for (int c = 0; c < dataSetMatrix.GetLength(1); c++) { sw.Write(dataSetMatrix[r, c] + ","); } sw.WriteLine(); } sw.Close(); Process.Start(sFilePath); } } } 67 Appendix B /* This is the Porter stemming algorithm, coded up in ANSI C by the author. */ #include <string.h> /* for memmove */ #define TRUE 1 #define FALSE 0 static char * b; /* buffer for word to be stemmed */ static int k,k0,j; /* j is a general offset into the string */ /* cons(i) is TRUE <=> b[i] is a consonant. */ static int cons(int i) { switch (b[i]) { case 'a': case 'e': case 'i': case 'o': case 'u': return FALSE; case 'y': return (i==k0) ? TRUE : !cons(i-1); default: return TRUE; } } /* m() measures the number of consonant sequences between k0 and j. if c is a consonant sequence and v a vowel sequence, and <..> indicates arbitrary presence, <c><v> gives 0 <c>vc<v> gives 1 <c>vcvc<v> gives 2 <c>vcvcvc<v> gives 3 .... */ static int m() { int n = 0; int i = k0; while(TRUE) { if (i > j) return n; if (! cons(i)) break; i++; } i++; while(TRUE) { while(TRUE) { if (i > j) return n; if (cons(i)) break; i++; } i++; n++; while(TRUE) { if (i > j) return n; if (! cons(i)) break; i++; } i++; } } /* vowelinstem() is TRUE <=> k0,...j contains a vowel */ static int vowelinstem() { int i; for (i = k0; i <= j; i++) if (! cons(i)) return TRUE; return FALSE; } /* doublec(j) is TRUE <=> j,(j-1) contain a double consonant. */ static int doublec(int j) { if (j < k0+1) return FALSE; 68 if (b[j] != b[j-1]) return FALSE; return cons(j); } /* cvc(i) is TRUE <=> i-2,i-1,i has the form consonant - vowel - consonant and also if the second c is not w,x or y. this is used when trying to restore an e at the end of a short word. e.g. cav(e), lov(e), hop(e), crim(e), but snow, box, tray. */ static int cvc(int i) { if (i < k0+2 || !cons(i) || cons(i-1) || !cons(i-2)) return FALSE; { int ch = b[i]; if (ch == 'w' || ch == 'x' || ch == 'y') return FALSE; } return TRUE; } /* ends(s) is TRUE <=> k0,...k ends with the string s. */ static int ends(char * s) { int length = s[0]; if (s[length] != b[k]) return FALSE; /* tiny speed-up */ if (length > k-k0+1) return FALSE; if (memcmp(b+k-length+1,s+1,length) != 0) return FALSE; j = k-length; return TRUE; } /* setto(s) sets (j+1),...k to the characters in the string s, readjusting k. */ static void setto(char * s) { int length = s[0]; memmove(b+j+1,s+1,length); k = j+length; } /* r(s) is used further down. */ static void r(char * s) { if (m() > 0) setto(s); } /* step1ab() gets rid of plurals and -ed or -ing. e.g. caresses -> caress ponies -> poni ties -> ti caress -> caress cats -> cat feed -> feed agreed -> agree disabled -> disable matting -> mat mating -> mate meeting -> meet milling -> mill messing -> mess meetings -> meet */ static void step1ab() { if (b[k] == 's') { if (ends("\04" "sses")) k -= 2; else if (ends("\03" "ies")) setto("\01" "i"); else if (b[k-1] != 's') k--; } if (ends("\03" "eed")) { if (m() > 0) k--; } else if ((ends("\02" "ed") || ends("\03" "ing")) && vowelinstem()) { k = j; if (ends("\02" "at")) setto("\03" "ate"); else 69 if (ends("\02" "bl")) setto("\03" "ble"); else if (ends("\02" "iz")) setto("\03" "ize"); else if (doublec(k)) { k--; { int ch = b[k]; if (ch == 'l' || ch == 's' || ch == 'z') k++; } } else if (m() == 1 && cvc(k)) setto("\01" "e"); } } /* step1c() turns terminal y to i when there is another vowel in the stem. */ static void step1c() { if (ends("\01" "y") && vowelinstem()) b[k] = 'i'; } /* step2() maps double suffices to single ones. so -ization ( = -ize plus -ation) maps to -ize etc. note that the string before the suffix must give m() > 0. */ static void step2() { switch (b[k-1]) { case 'a': if (ends("\07" "ational")) { r("\03" "ate"); break; } if (ends("\06" "tional")) { r("\04" "tion"); break; } break; case 'c': if (ends("\04" "enci")) { r("\04" "ence"); break; } if (ends("\04" "anci")) { r("\04" "ance"); break; } break; case 'e': if (ends("\04" "izer")) { r("\03" "ize"); break; } break; case 'l': if (ends("\03" "bli")) { r("\03" "ble"); break; } /*-DEPARTURE-*/ /* To match the published algorithm, replace this line with case 'l': if (ends("\04" "abli")) { r("\04" "able"); break; } */ if (ends("\04" "alli")) { r("\02" "al"); break; } if (ends("\05" "entli")) { r("\03" "ent"); break; } if (ends("\03" "eli")) { r("\01" "e"); break; } if (ends("\05" "ousli")) { r("\03" "ous"); break; } break; case 'o': if (ends("\07" "ization")) { r("\03" "ize"); break; } if (ends("\05" "ation")) { r("\03" "ate"); break; } if (ends("\04" "ator")) { r("\03" "ate"); break; } break; case 's': if (ends("\05" "alism")) { r("\02" "al"); break; } if (ends("\07" "iveness")) { r("\03" "ive"); break; } if (ends("\07" "fulness")) { r("\03" "ful"); break; } if (ends("\07" "ousness")) { r("\03" "ous"); break; } break; case 't': if (ends("\05" "aliti")) { r("\02" "al"); break; } if (ends("\05" "iviti")) { r("\03" "ive"); break; } if (ends("\06" "biliti")) { r("\03" "ble"); break; } break; case 'g': if (ends("\04" "logi")) { r("\03" "log"); break; } /*-DEPARTURE-*/ /* To match the published algorithm, delete this line */ } } /* step3() deals with -ic-, -full, -ness etc. similar strategy to step2. */ static void step3() { switch (b[k]) { case 'e': if (ends("\05" "icate")) { r("\02" "ic"); break; } if (ends("\05" "ative")) { r("\00" ""); break; } if (ends("\05" "alize")) { r("\02" "al"); break; } break; case 'i': if (ends("\05" "iciti")) { r("\02" "ic"); break; } break; case 'l': if (ends("\04" "ical")) { r("\02" "ic"); break; } 70 if (ends("\03" "ful")) { r("\00" ""); break; } break; case 's': if (ends("\04" "ness")) { r("\00" ""); break; } break; } } /* step4() takes off -ant, -ence etc., in context <c>vcvc<v>. */ static void step4() { switch (b[k-1]) { case 'a': if (ends("\02" "al")) break; return; case 'c': if (ends("\04" "ance")) break; if (ends("\04" "ence")) break; return; case 'e': if (ends("\02" "er")) break; return; case 'i': if (ends("\02" "ic")) break; return; case 'l': if (ends("\04" "able")) break; if (ends("\04" "ible")) break; return; case 'n': if (ends("\03" "ant")) break; if (ends("\05" "ement")) break; if (ends("\04" "ment")) break; if (ends("\03" "ent")) break; return; case 'o': if (ends("\03" "ion") && (b[j] == 's' || b[j] == 't')) break; if (ends("\02" "ou")) break; return; /* takes care of -ous */ case 's': if (ends("\03" "ism")) break; return; case 't': if (ends("\03" "ate")) break; if (ends("\03" "iti")) break; return; case 'u': if (ends("\03" "ous")) break; return; case 'v': if (ends("\03" "ive")) break; return; case 'z': if (ends("\03" "ize")) break; return; default: return; } if (m() > 1) k = j; } /* step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1. */ static void step5() { j = k; if (b[k] == 'e') { int a = m(); if (a > 1 || a == 1 && !cvc(k-1)) k--; } if (b[k] == 'l' && doublec(k) && m() > 1) k--; } int stem(char * p, int i, int j) { b = p; k = j; k0 = i; /* copy the parameters into statics */ if (k <= k0+1) return k; /*-DEPARTURE-*/ /* With this line, strings of length 1 or 2 don't go through the stemming process, although no mention is made of this in the published algorithm. Remove the line to match the published algorithm. */ step1ab(); step1c(); step2(); step3(); step4(); step5(); return k; }

Automatically extracted. Refer to the original PDF for figures, tables, and formatting.


Cite this