FOUNDATIONS Vol. 36
OF
COMPUTING AND (2011)
DECISION
SCIENCES No. 2
DATABASE SCHEME OPTIMIZATION FOR ONLINE APPLICATIONS Jakub MARSZAŁKOWSKI ∗, Jędrzej M. MARSZAŁKOWSKI †, Jędrzej MUSIAŁ ‡
Abstract. With a rapid growth of Internet usage, e-commerce projects, such as content management systems, on-line shops, social networking pages, and online webbased gaming, meet new performance challenges. In this note we enclose information on database systems using MySQL working in the background of web-based applications. Effect of table name length in simple queries is measured in experimental environment. Conclusions over execution time and amounts of data sent are made. Moreover, comparison performance tests for long and short table names are made on a model and two real web-based applications. Finally, simple yet effective database scheme optimization is proposed. Keywords: Database, database scheme optimization, MySQL, e-commerce, webbased applications.
1
Introduction
The number of Internet users has been growing rapidly during the last decade. According to the latest usage statistic [6] 28.7% (1966 million) of World population has access to the Internet. North America represents highest penetration, which is equal to the 77.4%. Afterwards there are: Oceania/Australia (61.3%), Europe (58.4%), Latin America (34.5%), Middle East (29.8%), Asia (21.5%) and Africa (10.9%). Most important is that the growth in the last decade (years 2000-2010) has reached 444.8%, from 360 million to the above mentioned 1966 million. ∗ Poznan University of Technology, ul. (
[email protected]) † alias.net (
[email protected]) ‡ Poznan University of Technology, ul.
[email protected])
Piotrowo Piotrowo
2,
2,
60-965
60-965
Poznan,
Poznan,
Poland
Poland
(Je-
122
J. Marszaákowski, J. M. Marszaákowski, J. Musiaá
According to the GUS [4] report, Internet penetration for Poland reaches 57% (7,1 million people). Another survey states that the penetration is equal to 50%, the differences may result from research methodologies. So many Internet users generate tremendous number of web page views / request which implicate the need of an optimization process of all possible web-based application elements. The database is among the most important ones. A survey concerning American Internet users behaviour published by Pew Internet & American Life Project [11] shows that e-commerce revenue has grown from $7.4 billion at the middle of 2000 to $34.7 billion in the third quarter of 2007. The rest of the paper is organized as follows. In section 2 we present and describe web-based applications usage. Literature references, as well as surveys, were analysed to provide most important information on this part of the e-commerce market. Following section 3 encloses information on database systems working in the background of web-based applications. Scheme of database for measurement experiment is described in section 4 and its testing is made and analysed in section 5. Tests on a model and two real applications with analysis are made in section 6. Conclusions, propositions of good practice in performance of web-based applications along with suggestions for future research are given in section 7.
2
Web-based applications usage
Increasing number of Internet users, as well as new software agents technologies [2, 3, 5, 15] influence the e-commerce growth and affect the world of online shopping. Forecast [13] predicts 70% growth of the proportion of people shopping online. By 2013, almost half of Europeans are expected to shop online, up from 21% in 2006. The benefit of shopping on-line, when compared to traditional shops, comes from lower prices and possibility of shopping in distant locations. However, cheaper shopping over the Internet carries an extra cost of product delivery and has a hidden cost of searching effort. This extra work is time consuming and requires manual human interaction with price comparison computer systems, where products are not defined in identical manner. It also happens that human is necessary to decide if compared offers relate to identical products. Searching for a bargain – a single product – is pretty well supported by, so called, price comparison applications. According to Alexa Rank [1] the most popular price comparators and Internet stores belong to top 1000 mostly visited websites worldwide. MySQL database is a very popular database system for the mentioned applications. However, aside of online shopping, it is impossible to miss other extremely fast evolving e-commerce projects, such as content management systems, social networking pages and online web-based gaming. A Content Management System (CMS) is typically a database application which enables the user to publish web page content and to administrate it, where content comes in different forms: articles, papers, video streaming, audio data, press release and company news, pictures, products information, questions/answers information, and many more. Many good quality CMSs are open source applications – they can be
Database scheme optimization for online applications
123
downloaded and used without any additional buying costs. The most popular systems are: Wordpress, Joomla!, Mambo, MODx, PHP-Nuke and many more [12]. Most of the projects allow usage of several database systems, but encourage to use the most popular one – MySQL, usually set by default. Social Networking Services (SNSs) are websites, where users establish online personal portfolio, so called profile. SNSs provide possibility to interact with friends (people added to the personal friend list), messaging with them, sending photos, commenting their posts and many more. It subsides real-time life with computer and keyboard. The major advantage of SNSs is that people staying far away from each other can still have pleasant interaction. The most popular SNSs are without a doubt Facebook and Myspace. Facebook has more than 500 million active users [1] and 50% of them log in to the web page in any given day, with daily page views around 7 billion. It means that for 39 people living all over the World 10 of them uses Facebook and visits 14 pages of it every day. Myspace, with 61 million of users, is the second largest SNS. According to Online Gaming 2010 (The NPD Group) 54% of people report that they personally play games online [10]. The average number of hours spent per week on online gaming has gone up from 7.3 hours per week in the 2009 study to 8.0 hours per week in the 2010 study. Browser based games are available also for players, who have never bought a single computer game and have no intentions to ever do so. Many of those game projects are build using PHP + MySQL technologies.
3
Database system for web-based applications
LAMP is a commonly used platform for web application development, where each letter in acronym stands for part of this system: Linux or unix-like open source operating system, Apache web server, MySQL as database system, and PHP scripting language. By year 2005 MySQL found usage in about 6 million of those applications [7]. With MySQL as database system, there are many already well known performance issues, some of them widely described [14]. Meanwhile, other problems can still be found, so far never addressed. The one considered in this paper, found accidentally during other MySQL performance research, is how table names length affects commonly used queries. One of the ways to divide systems for web-based applications is to organise them according to database placement. Three levels can be noticed, with proceeding from smallest to largest web-based applications: shared hosting, dedicated (or virtual private) server, server farms. In dedicated single server database, the system is working on the same machine as the rest of the server, while in both other cases the database server is a separate machine that communicates with the web server, usually over the tcp/ip protocol. On the first level of shared hosting, performance is far from being the most important factor, but as web traffic grows, usually performance problems force migration to a dedicated server. This type of server is the main concern of the research done in this paper, however some clues for systems with tcp/ip communications between
124
J. Marszaákowski, J. M. Marszaákowski, J. Musiaá
separate web server and database server will also be pointed.
4
Experimental database environment
We want to create a database experiment referring well to how real web-based applications communicate with database on usual basis. To make it representative, we have observed many applications which belong to the price comparison sites, social networking sites group, Internet on-line shops, and CMS. Some of them are shortly described in section 2. Most common type of query used there is simple SELECT query selection of usually one, more rarely of few rows of data. Applications use from almost dozen to several dozens of those to read data necessary for generating page. Data types are numeric or text – from shortest to quite long. Database scheme we created could be defined as follows: n ∈ {1, 64} - lengths of table name, one short and one long, a = 32 - number of columns, t ∈ {t1 , t2 , t3 , t4 } - type of attributes in columns: 1 tinyint(1 byte) 2 int(4 bytes) 3 char(32 bytes) 4 text(512 bytes) r ∈ {1000, 2000, . . . , 10000} - number of rows. Chosen attribute types cover widest set of typical needs of web services. Shortest one (tinyint) stores one byte of information, works as a flag or option checker where there will be up to 256 values or possibilities like gender, spoken languages, country, etc. Following integer (int, 4 bytes) attribute is commonly used to describe most of the numbers, like values, numerical identifiers, timestamps and so on. 32 bytes long char attribute can be used to store short text strings - names of persons and items, logins, passwords and other hash strings as session identifiers are just some examples. Last field text with 512 bytes will be helpful to describe product, person or other generalized text storage.
5
Measurement experiment
All tests were performed on an IBM S50 computer with Intel Pentium IV 2,8 GHz processor and 1 GB RAM. Operating system used was FreeBSD 8.1-RELEASE with Generic kernel. From many Linux distributions we have chosen this one, because of its popularity and the trust that many companies put in it [9]. Report states that
Database scheme optimization for online applications
125
“FreeBSD remains a favoured operating system for web hosting services. In Netcraft’s survey of the most reliable Web hosting companies for May 2009 FreeBSD was the host operating system for three of the top five.” The web server for our project was Apache version 1.3.42 running PHP 5.3.3_2 with APC 3.1.4. MySQL distribution 5.1.51_1 was tested, and at time of the taken tests it was last stable version of commonly used 5.1 branch. After preliminary tests we confirmed that number of rows r ∈ {1000, . . . , 10000} is irrelevant for this test case, as queries select one row found upon use of primary key index. For the model described in section 4 we perform experimental measurements. For each single problem instance we designate table name n ∈ {1, 64}, attribute type t ∈ {t1 , t2 , t3 , t4 } and number of columns selected c ∈ {1, 2, . . . , 32}. For each trio (n, t, c) 100 tests were made to avoid random abnormal results. Cumulatively 25600 queries were committed. Table 1 and figure 1 contain average processing times in µs measured with MySQL profiler [8], new tool available since MySQL version 5.0.37. Results were confirmed with PHP native time measurement. Also with another tool SHOW SESSION STATUS, number of bytes sent with each query were tested. Results were deterministic and linear, showing that difference in transfer between the same query for tables named short and long equals to c∗(64−1)∗2 where c is the number of columns selected. This allowed to formulate a conclusion that with each column selected in query table name is sent twice in header row – in many later tests no evidence to deny this was found. This explains the measurement results – overhead for long table name is the largest when query selects large number of columns with small variables. On selection of c columns, variables taking b bytes each, make total cb bytes of data, while n characters long table name adds to this c ∗ n ∗ 2 bytes in form of a header. With 32 columns of type t1 shortint (1 byte) and 64 chars long table name, it is 32 bytes of data, and 4096 bytes of redundant table names – processing time overhead reached over 30%. On the other side, with fields taking many bytes as t4 text (512) and few columns selected differences in processing times are almost negligible. For systems with separate web server and database server, not taken into account on this tests, this excessive table header can affect number of tcp/ip packets sent. Many queries selecting small amount of data, mostly numerical – that takes few bytes to store – can fit into one packet. With usage of long table names, header row grows to two or more packets, requiring costly packet splitting. Following figure 1 presents data (values acquired from table 1) in form of shortname longname — percentage gain of time on using short table names.
6
Practical use experiment
Experiment was conducted on a model of web-based application and two real applications. The model was build to reflect a set of most basic queries done by simple application, e. g. reading antiflood data, configuration, user data, news and articles data, and so on. Totally it consisted of 12 tables containing from 5 to 40 columns
J. Marszaákowski, J. M. Marszaákowski, J. Musiaá
126
100%
Processing time (short name / long name)
95%
90%
85%
text(512) char(32) int(4) tinyint(1)
80%
75%
70% 1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
31
Number of columns retrieved
Figure 1: Processing time ratio for short and long table names
with data types as fair mix of those used in section 5. For each table it made one query to select row identified by primary key. As example of web-based application two of leading CMS systems: PHP-Nuke and MODx [12] were chosen. Both offer all basic and many advanced features, characteristic for CMS systems, while differ in weight and methods of communication with database. Both also make few dozens of select queries to show single web page – for PHP-Nuke almost all of them are simple queries for a single row, just like in previous experiment and proposed model, while MODx uses table joins for large number of its queries. For each short table name experiment – 2 characters long names were used, as we assumed it is least number, that allows use of large number of tables, and some basic identification of them. For long table name experiment the model used 34 characters long names, and for both CMS their standard names were used with 16 characters long prefix1 , set with their own install scripts as an example of bad practice. Each measurement was repeated 100 times - a summary of results is shown in table 2, where the gain is the difference between long table names performance and this achieved for short ones. Least gain was for MODx – it uses separate object wrapper in MVC architecture to communicate with database. As it made changing table names very simple, only one string in one definition file was changed for each table, it also builds complicated queries with use of aliases for every name in a query. That and 1 Set of characters to distinct group of tables in database, widely used in open source web-based applications.
Database scheme optimization for online applications
127
Table 1: Queries processing time in µs per variables type (bytes used) and number of columns selected for tables with long and short names columns 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
tinyint (1) short long 270 278 280 299 287 312 299 320 301 332 308 345 315 354 323 370 328 387 335 390 339 402 347 413 350 424 355 437 361 455 366 460 370 470 380 483 387 494 392 510 396 524 403 532 407 547 416 556 423 547 427 554 434 568 439 572 445 583 457 594 459 609 464 614
int (4) short long 270 279 278 292 286 306 290 316 297 327 304 338 310 346 317 358 325 368 331 378 336 390 343 400 346 407 350 417 360 432 364 437 371 447 377 458 389 469 388 479 406 490 404 503 407 511 416 524 423 533 426 543 436 550 440 563 444 572 451 582 454 593 460 606
char (32) short long 271 281 282 295 289 305 295 315 302 323 307 336 312 343 322 354 325 364 333 376 338 385 341 393 348 403 352 415 359 424 364 433 367 442 375 451 382 462 388 471 396 485 400 494 405 509 411 521 419 528 421 535 433 547 438 553 440 564 449 573 456 585 457 594
text (512) short long 316 321 331 335 336 345 345 357 353 368 362 388 366 399 385 408 385 421 393 431 403 444 406 451 411 482 417 493 438 506 454 512 461 528 468 534 477 560 486 560 485 582 496 586 503 611 516 616 525 648 537 650 540 676 547 669 553 689 583 687 583 714 595 721
the excessive usage of the table joins probably affected the gain. MODx also uses several layers of caching that had to be turned off, in order to allow any database tests. PHP-Nuke uses simple queries thus the gain was better, although modifying it for short table names required search and replace operation over many files.
7
Conclusions and suggestions for future research
As was shown in this paper, in MySQL table name length makes big difference in simple query execution time as it is sent twice in each column header. First of all, this should be treated with caution during any further measurement experiments for MySQL performance, as it can flaw the results - over 30% overhead was measured in
128
J. Marszaákowski, J. M. Marszaákowski, J. Musiaá
Table 2: Web page generation times in ms for short and long table names in a model and two real e-commerce projects table names avg. time table names avg. time gain model 34 chars 5,19 2 chars 4,72 9,1% PHP-Nuke 25-47 chars 38,26 2 chars 35,91 6,1% MODx 21-45 chars 473,53 2 chars 464,63 1,9% this experiment for long table name. Also for web-based applications usage of short table names can be recommended. When installing ready made applications, instead of long table name prefixes (for example using site name), empty ones, or at most single character long ones should be used. With slightly greater effort, but still at no costs, table names used there can be shortened, offering some greater performance gain. While projecting new applications advantages of short table names should be considered. While two characters names can be inconvenient, those can be covered with usage of variables that will store them and will have more descriptive names. In any case, usage of table names that consist of several words connected with dashes is a bad practice. For object wrappers covering completely database communication, a construction taking advantage of short table names should be proposed, while keeping them invisible for programmers. Creation of such wrapper and its performance tests remain open for future research. The influence of this performance issue over database systems communicating with web servers over tcp/ip network, only theoretically pointed here, should also be separately tested.
References [1] Alexa Rank, www.alexa.com, (last seen: October 2010). [2] J. Błażewicz, M.Y. Kovalyov, J. Musiał, A.P. Urbański and A. Wojciechowski. Internet Shopping Optimization Problem. International Journal of Applied Mathematics and Computer Science, 20, 2, 2010, 385-390. [3] Chu W., Choi B., Song MR, The Role of On-line Retailer Brand and Infomediary Reputation in Increasing Consumer Purchase Intention, International Journal of Electronic Commerce, 9, 3, 2005, 115-127. [4] Główny Urząd Statystyczny (Central Statistical http://www.stat.gov.pl/gus, (last seen: October 2010).
Office
of
Poland),
[5] Holsapple C. et. al., Decision Support Applications in Electronic Commerce in: Shaw M. et al. (eds.), Handbook on Electronic Commerce, Springer-Verlag, Berlin Heidelberg, 2000.
Database scheme optimization for online applications
129
[6] Internet Usage Statistics, Internet World Stats, http://www.internetworldstats.com/stats.htm, (last seen: October 2010). [7] George Lawton, LAMP Lights Enterprise Development Efforts, Computer, 9, 38, 2005, 18-20. [8] MySQL AB, MySQL: The World´s Most Popular Open Source Database., http://www.mysql.org (last seen: November 2010). [9] Netcraft Ltd., Most Reliable Hosting Company Sites in May 2009, http://news.netcraft.com/archives/2009/06/02/ (last seen: November 2010). [10] Ng B.D., Wiemer-Hastings P., Addiction to the Internet and Online Gaming, CyberPsychology & Behavior, 8, 2, 2005, 110-113. [11] Pew Internet & American Life Project. On-line Shopping, http://www.pewinternet.org/Reports/2008/Online-Shopping.aspx, (last seen: October 2010). [12] Shreves, R., Open Source CMS market Share., Water & Stone, 2008. [13] The Future Foundation. E-commerce across Europe - Progress and prospects, 2008. [14] Zawodny, J., Balling, D., High Performance MySQL, O´Reilly Media, 2004. [15] Wojciechowski A., Musial J., Towards Optimal Multi-item Shopping Basket Management: Heuristic Approach, in: R. Meersman et al. (eds.), OTM 2010 Workshops, LNCS, 6428, Springer, Heidelberg, 2010, 349-357. Received January, 2011