Defining Metrics to Automate the Quantitative Analysis of Textual Information within a Web Page Vasile Avram Academy of Economic Studies in Bucharest, Member IEEE Region 8, 71 Unirii Blvd, Bl G2C, Sc 1, Ap 25, Sector 3, Bucharest, Code 030829 ROMANIA
Abstract — the aim of this paper is to define a system of metrics and experimental ways to valuate them to quantitatively analyze the textual information contained by a web page and to determine the differences between the displayed textual information, shown to the human visitor, and the one extracted by the search engines robots. These metrics are the foundation to specify, design and realize an autonomous intelligent robot to find, signal, and interpret the textual information within a web page, hidden or not. For each of that defined metric is specified a computation formula and, if necessary, a procedure to be followed to obtain the involved elements. For each metric is given an interpretation and usefulness depending on the range of the valuation. The metrics are independent on the search engine type, commercial or semantic.
I. INTRODUCTION The problem revealed in that paper refers to the existence, in some circumstances, of a quantitative difference between the textual information contained within the source web page, accessible to web robots, and the quantity of textual information effectively displayed on screen and accessible to humans. Does no matter that source page is written in one language or another, or using rules to defining and structuring of a classic web or semantic web. The search engines use concepts, rules, and algorithms to determine if a page appears in the result list of a search and if yes in which position. The fact that keywords and relationships used or the content is interpreted do not changes the nature of the problem: user sees something and to the search engine robot provided something else to analyze and classify. Now we can distinguish between two major types search engines: the classic or commercial search engines that are dominant and semantic search engines that are experimental or partly implemented (such as Yahoo!). The technology of semantic search engines is perceived as the next step to understand the huge amounts of data placed online. The commercial search engines, from which Google is away the leader, are based on search directed by humans by intermediate of keywords databases, concepts, and references (links) [1]. These commercial search engines try to find pages having similar
keywords while the semantic search engines try find documents containing similar concepts. A. Ranking and SEO For commercial search engines the most important element that gives the position in the page result of a web page that satisfies the search criteria specified by the user search query is represented by the page rank or ranking given to the page by that used search engine [2]. Generally, when a site published for the first time, the major search engines direct their spiders and crawlers - robots - to download web pages within the website and follow their links, maybe. To speed up the indexing process and make known the website the webmaster registers their site to search engines and web directories. Obtaining a better page ranking is the main goal of search engine optimization (SEO) activity. For commercial search engines the search engine optimization is a collection of strategies that improve the level at which your website is ranked in the result returned when a user searches for a key word or phrase [3]. Different search engines use different page ranking criteria from which the main important common criteria are the following [2]: - Location – that refers to the position of the keyword in the structure of the page, such as in the title (and title level) or in the body, the style used for keyword (strong, regular, etc), at the beginning of title or paragraph, or somewhere inside, etc. - Frequency – the frequency with which the search term appears on the page. After some webmasters exploit this criteria and add too many repetition for a keyword the search engine classifies high frequencies, does no matter the keywords hidden or shown, as keyword spamming and ignores the pages where appears. The frequency must be kept at a reasonable value; - Links – the type and number of links on a web page: quality and number of links that comes into the site’s pages and quality and number of links out of site pages; - Click-throughs – the number of click-throughs has the site versus click-throughs the other pages that are shown in the page ranking. The location and frequency, as for keyword relevance, is a corollary of guidelines used to realize a good design for publications, web pages being a category of these.
Webmasters do SEO campaigns to enrich the page rank of their pages by attracting more clicks and more exposure by search engines and directories. The exposure realized when pages contain keywords included in the user query. The webmasters can hide successful keywords that have noting in common with the subject of their page and/or can hide page links to attract exposure from the part of the organic search engines. I can say also that the same webmasters will be able to play under the new rules stipulated by semantic search engines and that they will be able to conserve their behavior (including to try to trick the robots of that). B. Hidden text We define hidden text by intermediate of two web actors human visitor and robot visitor - as ―textual content that the human visitor cannot see but is readable to the search engine robot [3]‖. Textual information can be: - Unstructured text, it means without a specific meaning to the robots; - Structured text, in the form of an URI (Uniform Resource Identifier) address that is recognized as such by the robots and that is generally followed by those robots. User perceives the non-hidden textual information within a web page together with the textual information may contain the graphic images included and displayed by the web page. Robots perceive all textual information the web page tags and/or their attributes may contain. As outlined previously, two major elements of web pages are used by search engine to better rank them: textual content (including position and frequency) and links to/from other web pages. Both elements become attractive for webmasters to find solutions to trick the robots and obtain better rank. The major search engines specify what will be the punishment they apply to the sites that uses spam techniques. I give here the position of Google and Yahoo, but you may consider that as the official position of all major search engines. Yahoo defines the pages using such techniques as spam: ―Some pages are created deliberately to trick search engines into offering inappropriate, redundant or poor quality search results; this is often called spam‖. The Google team position is ―If your site is perceived to contain hidden text and links that are deceptive in intent, your site may be removed from Google index, and will not appear in search results pages‖. Even that uncompromising position of search engines teams the spammers start a perpetual fight with search engines: they try trick the robots and the robots try detect and exclude the sites involved from appearing in the result page. II. TECHNIQUES TO HIDE TEXTUAL INFORMATION WITHIN A WEB PAGE
There is a brief investigation of the techniques used by spammers and of the possibilities to be detected by robots now: - The usage of the same color for both foreground (text, image, graphic etc) and background (this case is detectable automatically by almost robots of major search engines);
- Overwriting the text by intermediate of a graphic object (a graphic shape such as line, rectangle etc or an image, a picture etc) or by using layouts elements (using
tag). This case is not yet solved automatically. The user can set the browser do not display the graphic elements to reveal the hidden information; - The usage of tiny dimensions for the font size, invisible by human eyes, by intermediate of Cascading Style Sheets (CSS) language. This case is very difficult to detect automatically. If the font dimension declared as fixed is also difficult to detect even manually by using zoom. All CSS based hiding techniques can be revealed manually to the user by using, in the Mozilla Firefox browser for example, the available Addin ―Web Developer‖ that allows disabling CSS. User can use zoom-out to enlarge the font and see the hidden text, if fonts are scalable. If the font size is set to 0 the text becomes invisible but that situation is generally detected by almost robots; - Hiding the link in a small character (a letter, a dash, underscore, dot etc as text to be displayed between the
and tags without underlining, color changed or reaction to mouse-over event); - Using the
tag by one of the techniques: Matching the colors of foreground and background or hiding by using the div attributes ―visibility: hidden‖ and ―display: none‖, both these are detectable, and generally means illegal hiding. The keywords ―hiding‖ and ―none‖ must be used in that context, ―attribute: value‖ since they are vital for many other processing within the web page. For example, the hidden attribute is required for filling information in forms input fields, information that must be hidden to the user but required by the CGI server side scripts to realize their processing; Overwriting divisions (div tag) by exploiting the zindex attribute of CSS and using the same coordinates for div tag (generally this situation is detected); Positioning off viewable section by using CSS, that can be legal in many circumstances, such as hiding / showing information depending on user actions and choices (such as, for example, hiding ―name before marriage‖ if gender is man, in a person card, and revealing if gender is women, or hiding information viewable only after login). This technique is applied in Fig. 1 to hide the textual information as highlighted in Fig. 2, the image of the source code of the page. - Using the tag and placing here the information you want. The usage of this tag is legal and generally what written here is presumed to be visible to maximum 5% of Internet users (since 95% of users have ―JavaScript On‖ according to www.w3schools.com, Browser Statistics). You can set the JavaScript option of your browser to ―Off‖ to see if what the tag, if used, contains; - Using Flash methods to hide such as: Scalable Inman Flash Replacement (sIFR) - that uses JavaScript to read HTML code and to transform that
code in flash file. This technique is not indicated to be used by Google;
by ATIOCR) obtained by applying an optical character recognition (OCR) to the snapshot of the web page over the textual information extracted by spider (I denote this by TIES):
EATI
Fig. 1. The web page as displayed by browser
ATIOCR TIES
(1)
The textual information extracted by spider (TIES) and amount of textual information (ATIOCR) obtained by applying an optical character recognition values can be measured in bytes or multiple units of measurement of these (such as kilobytes, megabytes etc). The value of the ratio can be: - Less than 1, case in which the page contains hidden textual information in reverse proportion with value of the metric (as less the metric is as huge the hidden text amount is); - Equal to 1, the ideal case, when what shown is what contained; - Greater than 1, case in which we have extra textual information and signals that the page may have images or, generally, graphic elements containing textual information which, in most cases, not considered when ranking. As big the value of ratio as much extra text we have. Experimentally, until a complete intelligent autonomous web robot specified and realized, I use the following procedure: 1. Use a spider to extract the textual information within a webpage and determine TIES value required in formula (1). The spider I build is based on theory in [5] and libraries available in [6] and my own functions to clean up the extracted text. The spider is build in PHP [7; 8; 9] scripting language and requires PHP 5 version or greater. The extracted information for the web page introduced in Fig.1 and Fig. 2 is shown in Fig. 3;
Fig. 2. The source code of page
SWFObject - that transform the HTML document in a Flash document without any guarantee for the content (is possible do not be the same for HTML and flash). This document type is detectable by Google.
III. DETERMINING THE ―EFFECTIVE AMOUNT OF TEXTUAL INFORMATION‖ (EATI) METRIC In [2] I defined a new metric, acting as a flag, whose valuation can signal automatically that hidden text used or not (without considering the reasons, or if legal or illegal). The metric I define and propose to use, is called effective amount of textual information (EATI), and is determined as a ratio between the amount of textual information (I denote this
Fig. 3. The image of textual information extracted
2.
Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format accepted as input by OCR tool. The snapshot of the page introduced in Fig. 4;
3.
Apply an OCR tool to the image saved to the previous step and obtain the recognized text, as shown in Fig. 5, required to determine the amount of textual information obtained by applying an optical character recognition - ATIOCR in (1). The resolution of the snapshot must be as required by the OCR component or higher since this cannot recognize correctly if poor resolution used.
Fig. 4. The snapshot of the page as a .jpg file
the last character in the image is something like a \ (backslash) and appears in the snapshot as glued to a / (slash) character from the next image producing a \/ (a V like character) and recognized as such by the OCR application, and that counts as one. If we recognize the individual images we obtain two characters, respectively a \ and a / character. The accuracy of the recognized text depends on the quality of the snapshot (such as resolution, contrast, size etc). The better way to do is to take a snapshot to the downloaded page shown with a large font similar the way Internet Explorer do when displays a web page. A. Determine textual information contained by graphic elements (TIG) metric The elements I want analyze here refers to approximate which amount of information corresponds to page code, which quantity comes from hidden text, and which quantity corresponds to the images contained and displayed by the page. In my first approach I ignored the textual information contained by graphic elements since that is not considered hidden by search engines and ignored (is not indexed so it do not contribute to obtain a better rank). The extracted text can be matched to the web page keywords to see if they can be retrieved or not, or can be the object of a semantic analysis allowing detecting porn words, for example. This semantic analysis will be the subject of another approach and research. The procedure used to determine the textual information contained by graphic elements (I denote that by TIG) within a web page is: 1.
2. Fig. 5. The textual information recognized from the snapshot of the web page (a .rtf file)
IV. DEFINING NEW METRICS FOR TEXTUAL INFORMATION CONTAINED WITHIN WEB PAGES
The comparison realized between extracted texts refers only to quantity with no involvement of any qualitative aspects so that can be used only as a flag that tells if used or not hidden information, or if extra text information present in the graphic elements of the page. The quality invoked here refers to the fact that if a spider is able to read and extract information in the direction this must be read (vertical, horizontal, left to right etc), stipulated in the tags attributes or as default action if not specified, the OCR application extracts characters as they appears into the image from top to down and left to right. More than is possible that the extracted amount of text characters from the snapshot, corresponding to the graphic elements within the page, to be greater than the sum of text recognized in the individual graphic elements. For example,
3.
Use a spider to extract the graphic elements (images, pictures, shapes etc) together with their positional coordinates and recompose a working web page of the same size as the original and containing only that graphic elements positioned at their proper coordinates; Use a snapshot application program that can be called within a robot body to take a snapshot of the page involved in step one and save as an image format accepted as input by OCR tool; Apply an OCR tool to the image saved at previous step and obtain the recognized text required to determine the textual information contained by graphic elements (TIG) value. Even other visual effects, recognizable as characters by OCR application, created between the text in the original page and graphic elements it contains we can say that the amount approximate the quantity of text characters incorporated by the figures and shown (revealed) to the human viewer.
B. Determining the quantity of textual information shown to the user (QTISU) metric Having the amount of textual information contained by graphic elements (TIG) we can determine the quantity of textual information shown to the user (I denote that by QTISU) originating in the code tags as a difference between
the amount of textual information obtained by applying an optical character recognition (ATIOCR) and textual information contained by graphic elements (TIG):
QTISU ATIOCR TIG
(2) The values for ATICOR and TIG must be expressed in the same unit of measurements (bytes, kilobytes, megabytes etc). C. Determining the text information shown to the user (TISU) from tags metric Having the quantity of textual information shown to the user (QTISU) we can determine now the exact ratio between that and the textual information extracted by spider (TIES), that I call text information shown to the user (TISU):
TISU
QTISU 100 TIES
(3)
It can be: - 100, when what is shown to the user corresponds to what extracted by the spider (no hidden information used); - less than 100, shows the percent of hiding textual information from the one contained by tags. As less is as much hidden textual information is. D. The percent of textual information revealed by graphic elements to the user (TIRGU) metric The ratio between textual information contained by graphic elements (TIG) and the amount of textual information obtained by applying an optical character recognition (ATIOCR) shows the percent of textual information revealed by graphics elements to the user (TIRGU):
TIRGU
TIG 100 ATIOCR
(4)
It can be: - 100 when the entire text information shown to the user is contained only by the graphic elements; - less than 100 shows the percent of textual information revealed to the user by graphic elements. As less is as much shown textual information comes from tags.
V. CONCLUSIONS The main utility of all these metrics for webmasters is when they want analysis a site to decide if they accept or not a link to that website, or to analyze the quality of web pages that points to their website. Even a pointing in page acts as a vote in the favor of the referenced site the poor quality links will decrease in time the rank. Another usage is to be the foundation to build robots to detect automatically if the references, returned when the URL of one webpage of the site is queried, are quality references. Generally, all elements signaled must be interpreted by a human but the process becomes faster than the one realized by discovering, announcing Google team for example, waiting to remove bad references from the result query and even reconsider the site rank. The future research and development I want do is represented by the specification, design and realization of a complete autonomous intelligent web robot capable to do all steps and to give a semantic interpretation to the textual information extracted from the web page. The extraction of information within the web page do not exclude the usage of ―text mining‖ techniques. REFERENCES [1]Jorge Cardoso (ed), Semantic Web Services: Theory, Tools and Applications, IGI Global © 2007 Books24x7. [2] Vasile Avram, ―Effective Amount of Text Information (EATI) in a Web Page – A Proposal for a New Metric and Method to Determine‖, The proceedings of the 9th international conference on Informatics in Economy may 2009, Editura Economică, ISBN 978-606-505-172-2, pp 163-168 [3] Jerri L. Ledford – SEO Search Engine Optimization Bible, Wiley Publishing 2008 [4] Google - Hidden text and links, Webmaster Tools, www.google.com [5] Michael Schrenk - Webbots, Spiders, and Screen Scrapers: A Guide to Developing Internet Agents with PHP/CURL, No Starch Press. 2007 Books24x7. [6] http://www.sourceforge.org – Open Source PHP libraries for robots development [7] P.J. Deitel, H.M. Deitel – Internet and World Wide Web How to Program, fourth edition, Prentice Hall 2008, pages 160-190 [8] World Wide Web Consortium - The Specification of Standards for HTML, XHTML, CSS, XML: http://www.w3.org [9] Vasile Avram – Internet Technologies for Business: Documents and Websites-structure and description languages, http://www.avrams.ro/lecturenotes.htm [10] Yahoo! Search Content Quality Guidelines, www.yahoo.com [11] SEO tools-Search Engine marketing, www.seologic.com