Department of Architecture and Computer Technology. University of ... browser leaves; this is even more true in the case of personalized news- papers, in which all the ..... the end of the genetic algorithm to reach the best solution. Using only a ...
Optimizing Web Page Layout Using an Annealed Genetic Algorithm as Client-Side Script J. GonzAlez Pefialver, J. J. M e r e l o Department of Architecture and Computer Technology University of Granada (Spain) {jesus [jmerelo}@kal-el, ugr. es, http ://kal-el. ugr. es/geneura
Abstract. The high volume of information available on the Internet makes it necessary to use search and organization tools to filter and display it. This presentation must make efficient use of the surface the browser leaves; this is even more true in the case of personalized newspapers, in which all the news and publicity must be presented in only one "coup d'oeil", to make them as effective as possible. In this paper, a system to automatically paginate web newspapers on the browser is presented. The system uses a genetic algorithm with integer representation and variable mutation amplitude, fine-tuned by a greedy algorithm. This combination proves to be much better than the genetic algorithm alone. The algorithm is proved to be able to lay out the web page in real time, that is, a time insignificant with respect to the time it takes to load an average page. The system will be embedded in several personalized news sites that are being developed at Granada University.
1
Introduction
In these days of i n f o r m a t i o n o v e r l o a d , one of t h e few resources t h a t are in s h o r t s u p p l y is screen real e s t a t e . W h e n a r e q u e s t is m a d e to an I n t e r n e t search engine, answers are d u m p e d o u t o n t o t h e u s e r ' s browser, w i t h t h e user h a v i n g to scroll down or check several pages to find w h a t she wants. If t h e search engine is s p e c i a l i z e d on news, t h e ideal a n s w e r should look as much as possible as t h e original n e w s p a p e r , t h a t is, all news h e a d e r s should a p p e a r on screen, w i t h s h o r t p a r a g r a p h s d e s c r i b i n g t h e m m o r e in depth. T h e s a m e applies to a p e r s o n a l i z e d web n e w s p a p e r : if a p e r s o n d e s c r i b e s the kind of news she is i n t e r e s t e d in, a n d t h e s e news fall in several categories, all categories should be p r e s e n t e d a t the s a m e t i m e when t h e web p a g e is loaded. Real e s t a t e o c c u p a t i o n m u s t be m i n i m i z e d , a n d at t h e s a m e t i m e e m p t y screen spaces should be avoided, since t h e m o r e t h e screen surface is used, t h e b e t t e r . F u r t h e r m o r e , when push technologies s t a r t to b e m o r e f a s h i o n a b l e ( a n d s t a n d a r d ) , laying out all user windows and channels on t h e screen will be a challenge, a n d it will have to be done in real time. T h e p r o b l e m of l a y i n g o u t t e x t r e c t a n g l e s in a l i m i t e d surface m i n i m i z i n g e m p t y space in it is not r e a l l y new, only t h e m e d i u m is new. It is the s a m e p r o b l e m t h a t , for instance, yellow p a g e s t y p e s e t t e r s faced: if t h e y h a d to t y p e s e t
1019
by hand thousands of pages, it would take also thousands of man-hours; that is why it has been tackled using automated procedures by several firms, one of them in Finland, using the so-called V . I . P E system [12], and another one in Germany, with the Y P S S + + system (which is quoted, but not presented in G r a f ' s web page [4]). A similar problem appears in fax newspapers, in which news were clipped and sent by fax to customers. Both problems were treated in Lagus et al.'s paper [12]. In a general context, it also occurs in multimedia and hypermedia systems, as it is reviewed in Hower et al.'s paper [8]. Besides, automatically laying out a personalized newspaper in a personalized news context has been dealt with in the Krakatoa project [10]. In general, laying out the different articles that form a newspaper can be considered a two-dimensional bin-packing problem (if the content is fixed, and the space can be varied) or a two-dimensional multiple-knapsack problem (if the space is fixed and the contents can be selected). These problems are in general NP-hard, but in the case of a newspaper that is going to be laid out in a grid of a fixed number of columns and rows, it becomes linear in time [12]. The newspaper layout or pagination problem can be then formulated as the problem of laying out a fixed set of news contained in rectangular boxes, with no overlapping, putting all of them within the limits of the browser, in a minimum surface and in a minimum amount of time. Since a web page can have an a priori infinite length, but a finite width, all available articles can be put on a page (with, of course, a weight limit, but if articles have no graphics, that is not really a problem), which means that laying out a web page is a fixed-content, variable-space problem, that is, a two-dimensional bin-packing problem. This problem has been widely studied and solved using genetic algorithms before in [2]. The approach presented in this paper will be to combine a genetic algorithm with variable mutation amplitude and a greedy algorithm to solve that problem, and do it using a program that is delivered to the web client program along with the content. There are two ways of paginating a web page that is going to be delivered to a client: do it on the web server, or do it on the client, that is, the user's own computer. Since a personalized news site is expected to be a high-volume site, many requests could pile up at any one time; if each delivered page is going to be laid out on the server, it will most probably result in server overload, and in an overall slowdown of server operation; on the other hand, if a lean chunk of code can be delivered along with the information, and that code can carry out pagination with enough speed in the client, no server resources are consumed (other than those used to process the query and send the content), and the whole process can have a much higher throughput, being able to process many more requests per time unit. In this paper, we will describe how the layout of a web page served by a personalized news site is optimized, on the client, using a Genetic Algorithm implemented in JavaScript. The layout of this paper is as follows: the state of the art in layout optimization, especially page layout optimization, will be described in section 2; the genetic algorithm used will be presented in section
1020
3, to proceed to results in section 4 and a discussion and presentation of future lines of work in section 5.
2
State
of the
art
Some of the results obtained in facility layout [7, 6, 3, 11] can be applied to automatic pagination, although the formulation of a space layout problem is slightly different, since relations among the different objects are assumed. Different news in a page can be related by their content, but the page layout algorithm is not concerned with it, at least not in this stage of research. In any case, since it is a NP-hard problem which starts to be intractable for a number of objects bigger than 10, genetic algorithms have been used to find suboptimal solutions by Kochhar et al [11] and Gero et al. [3]. In facility layout problems, objects placed can have shapes other than rectangular, they are usually placed in a fixed grid, and the restrictions are that, besides not overlapping, machines that work in sequence must be close to one another, paths that lead from one machine to another should not overlap, and so on. These requirements are different from what we find in page layout optimization. Besides, usually facilities are divided as a grid, and objects within those facility occupy a small number of cells, which allows the problem to be treated as a combinatorial optimization problem, instead of a function optimization problem: minimization of the surface occupied by the articles. Some papers have also focused in automatic layout of multimedia presentations [8], but with an emphasis on a correct presentation of information more than on cramming the maximum amount of information in available screen space. Automatic generation of presentations is treated as a constrained optimization problem by some researchers like Graf et al.[5], whose approach is to take into account constraints such as semantic and pragmatic relations specified by a presentation planner. The problem is much more complex than page layout, but some of its results could be used when applying automatic layout to real newspapers; for instance, placing related articles together, or the most important articles in the top. Personalized newspaper layout, as such, has been treated in several papers. The Krakatoa project [10], which is a personalized newspaper that, presented in a Java applet form, customizes the layout for each user. The newspaper layout does not depend on the size of each article, but on the user and community preferences; thus, it does not really optimize layout: it typesets the newspaper in two columns, with available space partaked among articles depending on the user and user community profile. Another group of workers in the Finnish Research Institute V T T applied simulated annealing optimization of page layout for paginating fax newspapers and the yellow pages of several countries [12]. In this paper, a heuristic and two simulated annealing methods are presented. The best simulated annealing algorithm (SA2) selects which articles are going to be included, and situates them on the page at the same time. Overlapping is allowed, and sometimes a
1021
slight overlap of articles is observed in the final result. They present two different SA algorithms: - In algorithm SA1, the articles used are fixed, they can be changed in shape, and placed in different positions. Total surface and overlap is minimized. Results mentioned in the paper indicate that the method is not too good, taking too much time and resulting in bad layouts - In algorithm SA2, the articles are chosen from the pool and placed using a best-fit situation. The operators that move the SA algorithm from one configuration to another are insert and remove article. This method is the best of all, and it takes only 30 seconds in a Sun Microsystem's SPARC20 workstation. This delay is probably too much for setting a web page, but most of today's machines are faster than that, so it can be considered acceptable. The method presented in this paper tries to solve a problem very similar to SA1, but using a finetuned genetic algorithm instead of simulated annealing. GAs are usually slower than SA, but manage to find better solutions. That is one of the reasons why GA is used in this paper. Besides, the time requirement is also important: layout should be around one order of magnitude faster than what is achieved by both SA algorithms, The object of this paper is to present a method that is, at the same time, faster and more accurate than SA.
3
Method
The surface of the window is divided in columns and rows, for instance, 4 columns and a row every 50 pixels, this has the effect of reducing the search space and, at the same time, making the result more similar to real newspapers. Article boxes have a fixed width (multiple of columns width) and fixed height (multiple of rows height), but position of their upper left corner is variable. The chromosome will include the representation of an (x, y) pair for each article where x represents the column and y the row where its upper left corner lays. In principle, bitstring representation could have been used; but JavaScript does not feature packed bitstrings as a native data type, and besides, a representation adequate for the data type that is going to be evolved was chosen, as is proposed by Michalewicz [13] and other authors. Thus, each box position is represented by a pair of integers: the genetic representation includes the coordinates of each box as such, it does not have to be decoded to compute fitness; each chromosome is an array of 2*(number of article boxes) integers. We don't know of anybody using this representation for this problem, but perhaps it has been used before. Once the genetic representation has been chosen, the genetic operators should be able to act on it. Obviously, classical bit-flip is not an option, since binary representation is not used; gene-disrupting crossover either; but a diversitygeneration mutation-like operator will be needed, as well as a feature-combination operator like crossover.
1022
In this case, mutation will change the x or y coordinate so t h a t the article box moves an integer number of columns or an integer amount of rows. The amplitude of this change decreases with time using an hyperbolic function (that is why this algorithm is denominated annealed); at the beginning, positions change widely, while only small changes allowed by the end of training.
Fig. 1. How mutation operator works: it moves one of the article boxes, in one direction, by an integer amount of rows or columns; in this case,the article box in the lower right corner has moved several rows down the page. The crossover operator will interchange the genes of two chromosomes that are situated between two randomly chosen points. It is quite similar to normal 2-point crossover, except that coordinates are fully transferred to the offspring. Two-point crossover is considered more efficient than single-point crossover in normal genetic algorithms; the same stands for integer-representation algorithms. Besides, this crossover operator combined with the integer representations avoids the gene-disrupting actions of normal binary crossover, and makes it act as a pure crossover operator, without mutating any gene at the same time, as it happens when binary crossover operator splits some genes. Selection, reproduction and elimination ~)rocedures correspond to an steady state algorithm : a part of the population is eliminated and substituted by the offspring of the remaining ones. The proportion of the population that is eliminated is the sum of the proportion t h a t undergoes mutation (Pm) and the one that undergo crossover (Px), Px + Pm < 1. The genetic algorithm is run for a prefixed number of generations, usually 20, before being applied a greedy algorithm for finetuning. If there is overlapping after the prefixed number of generations, the GA continues until there is no more overlapping. A population is always composed of 50 tentative layouts, which have been found to be enough for the problem, and fit well within the memory constraints of a JavaScript script, which cannot use more than 30Ks. The JavaScript GA source code is delivered together with the page, is public domain, and is available from the demo web page
http ://gargamel. ugr. es/~j esus/layout.
1023
3.1
Fitness function
The function to be minimized in this problem is the total area, account the restrictions of no overlapping and using only available former is taken into account already in the fitness function (that surface already means minimal width), but the latter must be also the fitness function.
taking into width. The is, minimal included in
Fig. 2. Overlapping article boxes, t is the vertical dimension of the overlapped rectangle, s the horizontal one. The pictures at the right hand side estimate the two posible layouts without overlapping as far from the best one as the first one. Taking that into account, the fitness function F calculates the surface of the smaller rectangle that can contain the layout stored inside a chromosome. Once this rectangle is computed, for each couple of boxes that are overlapping, a penalty is applied. As shown in figure 2, the function estimates the two layouts without overlapping as far from the optimal one as the original one and chooses the one with bigger surface. This function can be defined as follows: n
F=(x+
(1) i
xi = 2si ,Yi = xi =
0
0
, Yi = 2 t i
i
if(x+2si)y>x(y+2t,) otherwise
where n is the number of overlappings in a layout, xi and Yi are the horizontal and vertical penalty terms and ti and si measure the extent of the horizontal and vertical overlap for each overlapping i; that is, the fitness function takes into account a penalty term that varies with the amount of overlapping; that way, good solutions with no overlapping can evolve from solutions with a lot of overlapping, through solutions with less overlapping. Using a penalty function usually obtains better results than simply eliminating invalid solutions, as has already been pointed out by some researchers [1]. Almost in every optimization problem, the optimal solution is close to a lot of bad solutions. If these bad solutions are eliminated in the search process, it will be nearly impossible to find the optimal one because the search process will look for solutions far from it [9]. T h a t is why the fitness function estimates the distance from a solution with overlapping to the best one and adds the double
1024
Cost Evolution 1800000 1600000 14000OO
I~Bestr ~ t " Average cost]-
~'~1200000 § 1000000 8
800000
w
600000 400000 200000
Generations
Fig. 3. Evolution of fitness during a typical GA run. y axis is the surface plus penalty, in square pixels, and x axis is the generation number. The fitness usually goes down steeply at the beginning of training, with a slow decrease towards the end of training. of this distance as shown in figure 2, to get the surface of a solution as far from the best one as the original but without overlapping. Fitness, which is equivalent to area plus penalty, is minimized. The usual evolution of a GA run is shown in figure 3, with the average and minimum cost plotted against the number of generations. The cost or fitness after the greedy algorithm is not plotted.The algorithm starts usually by eliminating solutions with overlap, and then minimizes the surface that all boxes occupy. This fitness function has the main problem that empty space within the rectangle that surrounds all article boxes is not minimized by the GA, this is usually taken care of by the greedy algorithm; this fact will be taken into account in further versions of the algorithm. For instance in figure 2, the rectangle in the lower right corner could move freely withouth overlapping with anyone else, and without its right hand side corner being bigger than x, and still have the same fitness, but the layout is not optimal, is not exactly the same: it is much better if it is as close as possible to the boxes on the left and top.
3.2
Improving the solution by means of a greedy algorithm
Since genetic algorithms are not gradient descent algorithms, it usually takes them a long time to reach the best solution; and even in this case, finding the best solution is not guaranteed. In this application, real-time operation is a must, that is why a greedy surface-size gradient-descent algorithm is applied at the end of the genetic algorithm to reach the best solution. Using only a greedy algorithm would not always find the best solution from the initial values, and a genetic algorithm would take a long time to reach the global solution: using a genetic algorithm followed by a greedy one takes the best of both worlds. There could be other ways of combining stochastic optimization algorithms like GAs with gradient-descent algorithms like a greedy algorithm: for instance,
1025
a greedy algorithm could have been applied to every solution each generation; but this would make the genetic algorithm much slower. The greedy algorithm applied after the genetic serarch just moves each article box to the left and to the top while it doesn't overlap with others, eliminating the gaps between boxes.
4
Results
The first result is that the GA+greedy algorithm combination finds web pages with an optimal layout; a demo is running at h t t p : / / g a r g a r a e l . u g r . e s / - j e s u s / l a y o u t (not integrated yet in a personalized news site). An example of the web page after the algorithm is run is shown on figure 4. Second, we wanted to check how good is the algorithm depending on how crowded is the page; it should be more difficult, or at least it should take longer, when the surface occupied by the articles in relation to the total Web page surface is higher. We used the same set of articles, reducing the web page surface, so that articles occupied 50% to 90% of available surface in 10% increments. Results are shown in table 1.
Surface% 50% 60% 70% 80% 90%
T 6500 6700 6730 6666 6664
• • • • • •
180 100 121 233 175
F 280000 279000 279000 330000 318000
• • • • • •
26000 24000 24000 70000 43000
In this table, T is the mean and standard deviation of execution times measured in milliseconds in a Intel Pentium II 233MHz with 64MB of main memory, and F is the mean and standard deviation of the fitness during 10 executions. The time needed to find the solution is virtually the same in all cases, but the surfaces of the solutions found are slightly bigger for more crowded environments, with a higher standard deviation too. However, solutions are good enough; it should be taken into account that the default surface in most Web browsers (without resizing) is 640x480 = 307200 pixels; that size is lately being substituted by the bigger 800x600 = 480000 pixels. Another set of tests was made to prove the need of the greedy algorithm applied to the best chromosome found by the genetic search. A pure genetic algorithm without greedy fine-tuning was run 5 times, obtaining an average of 200 generations with a range of (24, 695); in average, it would take around 5 times more for a genetic algorithm alone than for a genetic+greedy algorithm to find a solution. It could take up to 10 times as much, in the worst case.
1026
Fig. 4. Final looks, after the GA and the greedy algorithm of a simulated newspaper page with 7 articles.
5
Discussion
The Internet has opened a wide field of applications in the last few years, but few evolutionary algorithms have been applied to it so far. In particular, this paper opens the possibility of sending a genetic algorithm along with web pages so that it might perform many duties within it, from layout optimization, through any tutorial application we might come up with, to functional optimization. In particular, the JavaScript genetic algorithm script presented in this paper is able to perform the layout of a news web page in real time in a Pentium machine, that is, in a time significantly less than the typical load time for a web page (which could be placed around one minute). The algorithm's speed is in the same order of magnitude or faster than the simulated annealing algorithms presented in Lagus et al.'s paper [12], and performance seems a bit better, but this is dificult to compare. This might be due to using a faster computer basically. The JavaScript implementation of a genetic algorithm is available from the authors, and besides, the script will be integrated in several personalized news sites that are already working in Granada University, like for instance, the Spanish newspaper search engine at http://www-etsi2.ugr, es/hermes. In the future, the fitness function will be improved to take into account the empty spaces between the articles, so that the greedy algorithm can be eliminated. Other applications will also be investigated:for instance, application to interactive chat web pages, and layout of personal ads pages. It would be also interesting to implement simulated annealing in the same web page~ and compare time and performance for the same setup: same language, same computer.
1027
6
Acknowledgements
This work has been supported in part by C I C Y T ' s project Proyecto BIO96-0895 (Spain) and D G I C Y T ' s project PB-95-0502.
References 1. D. M. Tate A. E. Smith. Genetic optimization using a penalty function. In Stephanie Forrest, editor, Proceedings of the 5th International Conference on Genetic Algorithms, pages 499-505. University of Illinois at Urbana - Champaign, Morgan Kaufmann, July, 17-21 1993. 2. G. Bilchev. Evolutionary metamorphs for the bin packing problem. In Proceedings of the Fifth Annual Conference on Evolutionary Computing, 1996. 3. J. S. Gero and V. A. Kazakov. Evolving design genes in space layout planning problems. Technical report, Dept. of Architectural and Design Science, University of Sydney, 1997. 4. W. H. Graf. Graf's home page. Web adress: h t t p : / / w ~ . d f k i . d e / - g r a f / . 5. W. H. Graf. Constraint-based graphical layout of multimodal presentations. In Lefvialdi Catarci, Costabile, editor, Advanced Visual Interfaces, Procs. of the Int. Workshop AVI92, World Scientific Series in Computer Science. World Scientific Press, 1992. 6. S. S. Heragu. Recent models and techniques for solving the layout problem. European Journal of Operations Research, (57):136-144, 1992. 7. S. S. Heragu and A. S. Alfa. Experimental analysis of simulated annealing based algorithms for the layout problem. European Journal of Operations Research, (57):190-202, 1992. 8. W. Hower and W. H. Graf. Research in constraint-based layout, visualization, cad, and related topics: A bibliographical survey. Technical report, Deutsches Forschungzentrum ffir Kfintsliche Intelligenz GmbH, 1995. Research Report RR95-12. 9. M. Hilliard J. T. Richrardson, M. R. Palmerm G. Liepins. Some guidelines for genetic algorithms with penalty functions. In J. David Schaffer, editor, Proceedings of the Third International Conference on Genetic Algorithms, pages 191-197, San Mateo, California, June 4-7 1989. George Mason University, Morgan Kaufmann. 10. Omonari Kamba, Krishna Bharat, and Michael C. Albers. The Krakatoa chronicle - an interactive, personalized newspaper on the web. Technical Report Number 95-25, Technical Report, Graphics, Visualisation and Usability Center, Georgia Institute of Technology, USA, 1995. 11. J. S. Kochhar, B. T. Foster, and S. S. Heragu. Hope: A genetic algorithm for the unequal area facility layout problem. Computers and Operations Research, 1997. 12. K. Lagus, I. Karanta, and J. Yl~a-J~fiski. Paginating the generalized newspapes - a comparison of simulated annealing and a heuristic method. In Hans-Michael Voigt, Werner Ebeling, Ingo Rechenberg, and Hans-Paul Schwefel, editors, Parallel Problem Solving From Nature - PPSN IV, volume 1141 of Lecture Notes in Computer Science, pages 595-603, Dortmund, Germany, September 1996. Springer-Verlag. 13. Zbigniew Michalewicz. Genetic Algorithms + Data Structures = Evolution programs. Springer-Verlag, 2nd edition edition, 1994.