Bioinformatics Advance Access published November 23, 2007
Databases and Ontologies
Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries Unitsa Sangket1,*, Amornrat Phongdara1, Wilaiwan Chotigeat1, Darran Nathan2, Woo-Yeon Kim3, Jong Bhak3, Chumpol Ngamphiw4, Sissades Tongsima4, Asif M. Khan5, Honghuang Lin5 and Tin Wee Tan2,5. 1
Center for Genomics and Bioinformatics Research, Prince of Songkla University, Thailand, 2Asia-Pacific Bioinformatics Network, 3Korean BioInformation Center (KOBIC), KRIBB, Korea, 4Biostatistics and Informatics Lab., Genome Institute, National Center for Genetic Engineering and Biotechnology, Thailand and 5Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore. Associate Editor: Prof. John Quackenbush ABSTRACT Summary: Bioinformatics involves the collection, organization, and analysis of large amounts of biological data, using networks of computers and databases. Developing countries in the Asia-Pacific region are just moving into this new field of information-based biotechnology. However, the computational infrastructure and network bandwidths available in these countries are still at a basic level compared to that in developed countries. In this study, we assessed the utility of a BitTorrent-based Peer-to-Peer (btP2P) file distribution model for automatic synchronization and distribution of large amounts of biological data among developing countries. The initial country-level nodes in the Asia-Pacific region comprised Thailand, Korea, and Singapore. The results showed a significant improvement in download performance using btP2P - three times faster overall download performance than conventional File Transfer Protocol (FTP). This study demonstrated the reliability of btP2P in the dissemination of continuously growing multi-gigabyte biological databases across the three Asia-Pacific countries. The download performance for btP2P can be further improved by including more nodes from other countries into the network. This suggests that the btP2P technology is appropriate for automatic synchronization and distribution of biological databases and software over low-bandwidth networks among developing countries in the Asia-Pacific region. Availability: http://everest.bic.nus.edu.sg/p2p/ Contact:
[email protected]
1
INTRODUCTION
Bioinformatics research often involves processing of large amounts of biological data (Lim et al., 2003; Gilbert et al., 2004), which are regularly updated – ranging from daily to quarterly updates. Consequently, bioinformatics centers around the world have to update their database repositories with the latest releases frequently. These updates are normally carried out over the Internet by using the traditional client/server distribution of files such as File Transfer Protocol (FTP) and Hypertext Transfer Protocol (HTTP). However, this process requires large network bandwidth to ensure that the latest database releases are downloaded reliably and within a reasonable timeframe. In 1997
the Bio-Mirrors project (Gilbert et al., 2004 – http://bio-mirror.net) linking up a network of database mirror sites in the Asia Pacific, was established to assist in this dissemination of data. Developing countries in the Asia-Pacific region have in recent years started moving into the field of bioinformatics (Ranganathan et al., 2002; Ranganathan et al., 2006), but computational infrastructure and network bandwidths available between and within these countries are still at a primitive level compared to that in more developed countries. Network bandwidth within these countries are very slow, and the low reliability of connections means breaks or aborts in downloads are common. Therefore, in spite of the availability of Bio-Mirrors nodes, many developing countries still face a major problem in regularly updating these databases. With the growing sizes of these Bio-Mirror databases (approximately 10 GB in 1998, 150GB in 2003 and 707GB as of 18 August 2007) (Gilbert et al., 2004; http://bio-mirror.net/biomirror/docs/about-databanks.txt), the problem will only deteriorate in the future as the growth of databases surpasses the rate of bandwidth increase. For example, even with APAN—Asia Pacific Advanced Network (http://www.apan.net) and TEIN2—Trans-Eurasia Information Network (http://tein2.net), there is still significant difficulty for universities in developing countries to obtain the bandwidth that can guarantee regular nightly updates of these databases. In the late 90’s, the Internet community witnessed the start of a major revolution in the way people shared files. The Peer-to-Peer (P2P) file sharing model was introduced with the widely popular Napster. Since then, the technology continued to evolve and improve. BitTorrent (btP2P) is a recent P2P communications protocol that has become very popular lately (Hales et al., 2005). Any program that uses the BitTorrent protocol is termed as a btP2P client. The protocol allows the client to prepare, request and transmit any type of computer file over a network. A peer is any computer hosting the client and can be connected by other peers to transfer data. Peers usually do not have the complete file and those which do have the complete file and offer it for upload to other peers are called seeds. In contrast, a leech is a client which has the complete file, but does not share it with other peers in the network (for more definitions visit: http://www.azureuswiki.com/index.php/This_funny_word). A peer
*
To whom correspondence should be addressed.
© 2007 The Author(s) This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
1
Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries
starts sharing file(s) by creating a “.torrent”, which contains the meta-data about the files to be shared and the tracker that coordinates the file distribution. The client divides the file to be shared into smaller fragments, typically to a quarter of a megabyte. Clients requesting to download the file first obtain the torrent file for it, through which they connect to the specified tracker that responds by providing information on peers that can be connected to download the fragments of the file. The difference between the traditional client/server file distribution model and the P2P file distribution model using btP2P is illustrated in Figure 1 and 2. As the number of downloading clients in the traditional distribution architecture increases, demand for bandwidth placed on servers will dramatically increase, which eventually leads to network bottlenecks. With the btP2P architectural model, a single site is no longer taking the burden of solely supplying the data for others to download. The more peers there are, the more nodes are available to distribute fragments of the file (Guo et al., 2007). High demand will actually lead to greater throughput as more bandwidth from additional nodes becomes available to the group. Therefore, it can be seen that if btP2P technology is used, it simultaneously addresses two major problems plaguing the distribution of biological data to developing countries which are: 1) low international bandwidth and 2) unreliable connections. With a btP2P architecture, downloads need not be from a central server in one country, that is, every network-connected peer which synchronizes its databases or software, whether from the same institute, state, country or region, will act as a server and provide additional bandwidth that will speed up the overall download rate for all the peers. In the conventional server/client architecture, all downloads are from a single server and if this connection becomes very slow or unreliable, there is “no” failover to automatically continue downloading from another source. Given the benefits of btP2P over the traditional file transfer model, in this study, we would like to assess the utility of btP2P for automatic synchronization and distribution of large amounts of biological data among developing countries in the Asia-Pacific region. The initial country-level nodes in the region comprise Thailand, Korea and Singapore.
was selected because of the following reasons: 1) it is open-source and has a large active development community, 2) it runs on Java, allowing it to be deployed on any operating system, and 3) it has a well documented plug-in interface that makes it easy to develop additional enhancements that may be necessary for this work. Four trial nodes were setup for the first phase of testing the btP2P network in the Asia-Pacific region. These sites comprised 1) Prince of Songkla University (PSU, Thailand), 2) Korean Bioinformation Center (KOBIC, Korea), 3) National University of Singapore (NUS, Singapore), and 4) National Center for Genetic Engineering and Biotechnology (BIOTEC, Thailand). We have set up three tracker sites to publish the .torrent files, as shown in Table 1. An RSSFeed Scanner Plugin (http://azureus.sourceforge.net /plugin_list.php) was used to trigger automatic synchronization of data at regular intervals. This allows the Azureus client program to download data automatically from the seed nodes without user intervention. To compare the download performance between FTP and btP2P, we downloaded biological databases using both methods at the PSU node over seven days. To achieve uniformity, we performed downloads using both the methods on the same machine, same date and same network; this ensured that the load on the network was the same for both of the methods during the test period. For evaluation of the FTP performance, the PSU node was set to download data from the KOBIC FTP server (ftp://ftp.kobic.re.kr/, Proftpd program version 1.2.10) by using the FileZilla program version 3.0.0-beta7 (http://filezilla.sourceforge.net/). Because the ftp client does not provide a function for automatic downloads like btP2P, we had to manually check for new files everyday and download them. On the other hand, for evaluation of btP2P, PSU node was set to download data from three seeds or peers – KOBIC, NUS, and BIOTEC nodes. The btP2P client was set for automatic download of new .torrents files every 2 hours from the KOBIC tracker. Table 1. Tracker sites to publish torrents in the Asia-Pacific region. Node PSU KOBIC BIOTEC
3
Fig. 1. Traditional Client/Server distribution of files.
2
Fig. 2. P2P distribution of files using BitTorrent protocol (btP2P).
METHODS
After extensive analyses and trials of various btP2P software, such as BitTorrent (http://www.bittorrent.com/download), uTorrent (http://www.utorrent.com/download.php), BitComet (http://www. bitcomet.com/doc/download.htm) and Azureus (http://azureus. sourceforge.net/download.php), the Azureus suite version 2.5.0.4
Tracker URL http://biotracker.psu.ac.th:6969 http://ftp.kobic.re.kr:6969 http://protcluster.biotec.or.th:6969
RESULTS AND DISCUSSION
Trace results from only the PSU node are shown in Figure 3. With only three nodes available as peers/seeds to the PSU node, the results demonstrate a significant improvement in download performance using btP2P over FTP. After seven days, 23.2 GB of data was successfully downloaded using FTP and about 70 GB using btP2P. Although the download throughputs for both FTP and btP2P slightly dropped after the third day, it increased a bit after the fifth day. Variations in the rates of transmission for both protocols are quite similar and are likely to be due to fluctuations in daily Internet traffic and the underlying network quality of service. The use of btP2P appears more effective and is at least three times faster in terms of overall download performance than conventional FTP in this test. Our results were in agreement with the findings of Scully L (http://cs.winona.edu/CSConference/2007proceedings/lincoln.pdf), who compared the download performance between FTP and btP2P on different subnets of the Saint Mary’s University network in the USA.
2
Automatic Synchronization and Distribution of Biological Databases and Software over Low-Bandwidth Networks among Developing Countries
Hales,D. and Patarin,S. (2005) Computational Sociology for Systems “In the Wild”: The Case of BitTorrent. IEEE Distributed systems, 6, 1-6. Lim,Y.P. et al. (2003) The S-Star trial bioinformatics course - An on-line learning success, Biochemistry and Melecular Biology Education, 31, 20-23. Ranganathan,S. et al. (2002) APBioNet: The Asia Pacific regional consortium for bioinformatics. Applied Bioinformatics, 1, 101-105. Ranganathan,S. et al. (2006) Establishing bioinformatics research in the Asia Pacific. BMC Bioinformatics, 7. (Suppl. 5).
Fig. 3. Download performance comparison between FTP and btP2P in the Asia-Pacific region.
In conclusion, the btP2P protocol appears significantly more effective than traditional FTP for synchronizing large multigigabyte public biological databases across the three Asia-Pacific countries, Thailand, Korea and Singapore. The download performance for btP2P can be further improved by including more nodes/seeds from other countries in the network (http://cs.winona.edu/CSConference/2007proceedings/lincoln.pdf). This suggests that the btP2P technology may be appropriate for automatic synchronization and distribution of biological databases and software over low-bandwidth networks among developing countries in the Asia-Pacific region. In the next test phase, more nodes from various Asia-Pacific countries will be included for large-scale real-time tests of the performance of the btP2P network. A potential drawback of btP2P, purely from network traffic point of view, is that they can result in waste of network resources (http://portal.acm.org/citation.cfm?id=1146882). For example, because there is no assurance that the list of seeds provided by the tracker are “good” – offering fast downloads, the peer will be searching for the best set of seeds for download by trying the different seeds in the seeds set, resulting in many unnecessary network connections and thus increasing the traffic of the network (http://portal.acm.org/citation.cfm?id=1146882). The Azureus program provides a function to minimize this traffic by allowing the user to set the maximum number of connections allowed. Additionally, recent research suggests that, in future, the waste of network resources by btP2P can be avoided by looking for the best set of seeds using the best neighbors list from Internet Service Providers (ISPs) (Aggarwal et al., 2007) or using a Content Distribution Network (CDN) mechanism (http://portal.acm.org/citation.cfm?id=1146882).
ACKNOWLEDGEMENTS This work was supported by the International Development Research Centre (IDRC) Canada, the Asia-Pacific Bioinformatics Network (APBioNet), and MOST under grant number M10508040002-07N0804-0021 and KADO of MIC in Korea.
REFERENCES Aggarwal,D. et al. (2007) Can ISPs and P2P users cooperate for improved performance? ACM SIGCOMM Computer Communication Reviews, 37, 29-40. Gilbert,D. et al. (2004) Bio-Mirror project for public bio-data distribution. Bioinformatics, 20, 3238-3240. Guo,L. et al. (2007) A Performance Study of BitTorrent-like Peer-to-Peer Systems. IEEE Journal on selected areas in communications, 25, 155-169.
3