Automated Internal Web Page Clustering for Improved ...

19 downloads 1621 Views 108KB Size Report
By doing this we can cluster the content inside a web page ... start creating clusters that are made of elements that perfectly match in structure one to another.
Automated Internal Web Page Clustering for Improved Data Extraction CORNELIA GYŐRÖDI, ROBERT GYŐRÖDI, GEORGE PECHERLE, GEORGE MIHAI CORNEA Department of Computer Science Faculty of Electrical Engineering and Information Technology, University of Oradea Str. Universitatii 1, 410087, Oradea ROMANIA [email protected], [email protected], [email protected], [email protected], [email protected] In this paper, we would like to present an algorithm to determine the repeating patterns inside the DOM tree of a webpage. By doing this we can cluster the content inside a web page and obtain more relevant structured data. Also, the determined DOM structure can be used to mine other web pages that are similar in structure and one hop away from the initial targeted web page. A key element in determining sub clusters inside the web page is to consider that a certain cluster will have all its roots at a constant depth inside the main tree. On the other hand, every cluster element can be slightly different from the others. Having the initial condition that all the clusters have to start at the same depth, we can start creating clusters that are made of elements that perfectly match in structure one to another. By taking the highest rated clusters in order, based on a formula that combines the number of repetitions and the depth of the subtree contained in the cluster, we can determine similarity ratings with other perfectly matching clusters. If the comparison returns a high enough score, then the two clusters are joined and further comparisons will take into account both scores. The comparison formula can also contain a parameter that refers to the number of elements in the targeted cluster, so that two large clusters can be more easily connected one to another, in order to obtain a larger resulting cluster. This comes in hand in order to reject possible single elements clusters that are apparently a match for the main base cluster. All the other perfectly matching sub clusters now have to satisfy the condition that the average score they gained in comparison with all the perfectly matching clusters has to be higher than the predetermined matching score, and every determined matching score should be higher than a lower predetermined threshold (even if in practice, it was determined that this condition is less significant, because the number of elements in the average function is usually small and a low score will drastically affect the average). The way in which the matching score should be calculated is still to be discussed, but the most promising method is to go through the DOM tree of every of the two perfectly matching clusters and create a corresponding string based on the determined nodes. By determining the Levenshtein distance between the two strings, we can determine the matching score between the two perfectly matching sub clusters.

The drawback of the method is that it can provide slightly different outputs for the same page depending on the order in which the sub clusters appear. This can be solved with some overhead in computing by repeating the whole process once a complex candidate cluster is formed. This way, we can include other perfectly matching sub clusters that were missed.

Suggest Documents