A method to measure software adoption in ... - Semantic Scholar

1 downloads 0 Views 197KB Size Report
[17] Perens, Bruce. (1999). Open Sources: Voices from the Open Source Revolution. O'Reilly Media. 1999. [18] Perens, B. Open Standards Principles and.
A method to measure software adoption in organizations: a preliminary study Bruno Rossi, Barbara Russo, Giancarlo Succi Free University of Bolzano-Bozen {Bruno.Rossi,Barbara.Russo,Giancarlo.Succi}@unibz.it Abstract The decision about the adoption of Free/Libre/Open Source Software (FLOSS) is a key issue in Small and Medium Enterprises (SMEs). Indeed, very often such organizations don’t have the resources needed to fully evaluate the migration from existing legacy systems. To help the decision process of these organizations, we propose a preliminary study about an instrument based on the analysis of files’ generation of targeted data standards. We model the file generation process as a selfreinforcing mechanism through the usage of urn models. By applying the instrument to a large dataset in the office automation field, we found a first confirmation about the importance of network externalities as reported by theory and the importance of past historical file generation for the subsequent file generation process. Keywords: FLOSS, software adoption, measurement, data standards, path dependent process. 1. Introduction Free/Libre/Open Source Software (FLOSS) is growing in popularity in recent years. The term was adopted by European Commission (EC) to avoid the controversy between two different views, the supporters of the Open Source definition and proponents of the Free Software movement [13]. In the specific,

the Open Source Initiative (OSI), defines Open Source Software as the software that complies with a set of ten criteria: the free redistribution of the software, the free access to source code, the absence of any constraint on modification and derived works, the integrity of the author’s source code, the non-discrimination against any person or group, the absence of restrictions about the field of usage, the automatic distribution of the license, the non restriction of the license to a product, the non restriction of other software, and the technology-neutral assumption of the license. [17] Free Software instead is focused on the ethical view of the problem, as the term open doesn’t clearly imply the term free. The Free Software Foundation (FSF) refers to a set of freedoms that the software must grant in order to be included in the Free Software category [12]: - the freedom to run the program, for any purpose (freedom 0); - the freedom to study how the program works, and adapt it to your needs (freedom 1). Access to the source code is a precondition for this; - the freedom to redistribute copies so you can help your neighbor (freedom 2); - the freedom to improve the program, and release your improvements to the public, so that the whole community benefits (freedom 3). Access to the source code is a precondition for this.

In the rest of the paper we will use the neutral term FLOSS, avoiding to propend for either one of the two views. In any case, FLOSS constitutes an appealing solution, particularly for Small and Medium Enterprises (SMEs) that want to renew their IT infrastructure with highly customizable and transparent systems. In this context, another aspect often overlooked in the IT world is the importance of Open Data Standards (ODS) as a way to store data in order to increase software interoperability and transparency of services offered. Many different definitions of ODS have been given (among others, see [16,18]). We would like to use the definition given in 2004 by the Danish Government [4]: - an open standard is accessible to everyone free of charge (i.e. there is no discrimination between users, and no payment or other considerations are required as a condition of use of the standard); - an open standard of necessity remains accessible and free of charge (i.e. owners renounce their options, if indeed such exist, to limit access to the standard at a later date, for example, by committing themselves to openness during the remainder of a possible patent's life) - an open standard is accessible free of charge and documented in all its details (i.e. all aspects of the standard are transparent and documented, and both access to and use of the documentation is free); In this sense, the importance of relying on a standard that is accessible free of charge and is guaranteed to remain so during time is of key importance particularly for organizations that store large quantities of documents.

Not always FLOSS adopts ODS as a standard for data exchange, as can be seen table 1. Table 1. Connection between data standards and software

We can have the typical cases of FLOSS using ODS (quadrant I), as proprietary software connected to proprietary formats (IV). Less frequent cases involve also proprietary software using ODS (II) and FLOSS using proprietary standards (III). In any case, although the terms FLOSS and ODS are not always coupled, it is common idea that a standard cannot be truly defined as open, if it is not supported by FLOSS [23]. In this sense, we consider ODS as an important and fundamental complement to FLOSS, in such a way that it can be a facilitator for the introduction of this software typology on the market. For this reason, in the rest of the work we consider FLOSS and ODS as strongly connected. In this paper, we focus on software adoption, an important field of research in current FLOSS research. The field investigates in particular the factors that act as facilitators and inhibitors to software adoption. In this area, we provide an investigation about the importance of the impact of network externalities and file size on the adoption process. A network externality is an effect that causes a particular information good to be valued by potential buyers not only based on the intrinsic value, but also on the number of other people that possess the same good [15]. This is an effect peculiar of many information goods, in particular

software[21]. Furthermore, Shapiro and Varian (1989) give large importance to the amount of files generated in information markets. They classify them as one of the categories that can potentially lead to high switching costs and lock-ins (information and database), due to the fact that maintaining large amounts of files in proprietary formats can lead to high costs during the migration process [21]. The rest of the paper is structured in the following way. Section 2 gives an overview of literature about the adoption of technology and FLOSS in particular. Section 3 provides a description about the dataset used for the construction of the model, while sections 4 and 5 investigate path dependent systems, urn models and fitting of the model identified to the dataset identified in section 3. Evaluation of different factors, namely file size and network externalities by using a multi-urn design is provided in section 6. Conclusions, limitations and future work conclude the paper. 2. Background Technology adoption literature studies how technological innovation spreads inside society and organizations. Rogers (1995) pioneering work, lies the foundations for the study of the diffusion process that characterizes technology adoption. In his seminal work, technology adopters are categorized according to the phase in which they make the adoption decision [19]. Arthur (1989) and David (1985) started discussing the question, whether due to increasing returns, it may not be possible to easily switch from a certain

technology once a certain critical level of adoption has been reached. This can lead to the adoption of what authors consider as an inferior technology [2, 5]. Regarding FLOSS adoption, there are not many studies available, but it is a research field that is growing. Some interesting studies are the ones of Bitzer and Schroder (2003) that analyze the innovation performance of FLOSS versus proprietary solutions, showing the results of the competition between FLOSS and proprietary software [3]. The focus is more on innovation that on the adoption process itself. An interesting overview of the factors that impact on FLOSS adoption is given in Dedrick and West (2004), where the factors identified are then categorized in technological, organizational and environmental [7]. Glynn, Fitzgerald & Exton (2005) study another framework with environmental, organizational, individual and technological factors that have then been validated in the context of a large-scale adoption of FLOSS. A large investigation of the factors has been conducted on a large sample of organizations [14]. More recently, Economides (2006) studied the incentives that lead to platform innovation. A case study between Linux and Windows is provided [10]. 3. Empirical Dataset analyzed To collect information about software usage, we developed and deployed a custom application called FLEA (FiLe Extension Analyzer). The purpose of FLEA is to collect information about data formats of files available on target systems. Data collected include the type of extension, date of creation, date of last access, size of the file, and for

particular extensions, also information about macros contained. It has been deployed in our case to perform a scan of data standards available on fileservers typically used by employees of one large European Public Administration, collecting in detail information of over 5 millions files. The data collection activities considered a period of time in which the cited Public Administration was migrating towards OpenOffice.org. More details about the experimental protocol, experimental design, data collection and analysis can be found in [20]. In the next section, we will overview the concept of path dependence, an important concept for the model that has been applied to the dataset. 4. Path dependence and urn models Path dependence is a broad concept that has been studied in many fields, from economics, biological evolution, physics, history and social sciences. In all fields, it has a common meaning of "history matters". A process is considered path dependent when its outcome depends on past historical decisions. In particular, when choices made on the basis of transitory conditions can persist long after the initial conditions change [6]. But how does a path dependent process emerge? According to studies in different fields, positive feedbacks are one of the main reasons of this particular outcome. They create a self-reinforcing mechanism that once established is difficult to change, due to high switching costs involved. To model path dependent systems, several approaches have been used. In particular, different stochastic path

dependent processes have been used, like the Polya urn-process, certain kinds of Markov chain models, branching processes, and reversible spin systems [1]. With the term urn model, we refer to a set of models that starting from the generalized Polya Process [11], model an outcome function on the basis of the selection of balls of different colors inside urns. In the specific, the rule states the following: given an urn that contains K different colours, we start with w balls for any colour k, where k=1,2,..,K. We then draw a ball randomly from the urn, remember the colour drawn, return the ball to the urn and mark it as k'. We proceed by adding to the urn α balls of k' and β balls that are different from k'. Such a design can be defined as UD(w, α, β ). Furthermore, we define Wk,n, as the number of balls of k colour at step n, and Nk,n, as the number of balls assigned of type k after n steps. The relationship between Wk,n and Nk,n is the following: Wk,n = w + α Nk,n + β (n – Nk,n) In this sense, Nk,n forms a Markov chain with transition probabilities in the form Pr(Nk,n+1=l+1 | Nk,n= l)=(w+αl+β(n-l))/εn where εn=Kw + (α + (K-1) β) n We considered the files generated as balls of the model, and restricted the analysis to a k=2 model, in which k’ maps to OpenOffice.org Writer sxw format and k’’ maps to Microsoft Word doc format. In the next section we show the results of a simulation of the process by using the data acquired during data

collection activity and the fitting of the distribution to the dataset. 5. Urn model simulations and fitting to the dataset In order to provide an example of the process of adoption, we applied the standardized Polya urn process to the dataset available. As a model design, we applied the UD (21,1,0) with w1=20 and w2=1, meaning that the starting conditions see a ratio of proprietary files 20 times bigger that the number of open data standards. These parameters were taken from the dataset gathered; clearly the starting condition will be different depending on the institution examined and the status of migration. Subsequently, 10.000 simulations of the process of files generation have been performed. Assuming that at some point the file generation process will end, we are interested in examining what are the final outcomes of files’ generation that are more likely to happen, given the initial conditions. In Figure 1, we show the distribution of the proportions of possible outcomes in proportions of files generated.

Figure 1. Distributions of files proportions outcomes by applying Polya urn process. Frequency on y-axis, proportion proprietary over open files generated on the x-axis. Arrows show the prevalence of one format over the other.

The arrows in figure report the trend of generation of files, on the right the proprietary formats increase in number, on the left open formats are getting adopted with a higher ratio. From our starting point it’s clear that proprietary formats, namely doc files are at advantage. The phenomenon that the distribution depicts is interesting in particular for two reasons: 1. Given the initial conditions, no single run of the simulation reached a level in which the generation of open formats outperformed in numbers the generation of proprietary files. We note from the simulations performed, that the final outcome most favorable to open data standards is a ratio of 0.54 proprietary over data standards. In this case the generation rate of proprietary files is still higher than the one of files in open data standards. Given no change in strategy, this is the maximum level of usage for the new data standard that can be foreseen. 2. The result that was more likely in terms of files generation was a ratio of 0.99 (proprietary over open data standards). According to the simulations by means of the Polya urn process, this outcome has a probability of 0.1462. Most important, we must note that all cases where the generation of open data standards is greater than proprietary formats is not contemplated by the model, given the initial settings and the urn design. Simulations performed give us a starting point to evaluate the fitting of the model to our dataset. To be more precise, by using Polya probability distribution function

and minimizing the distance function

we found a fitting of 0,0246 for the UD (21,1,0). Furthermore, fitting of the observed data distribution to the theoretical distribution represented by the urn model identified could not be rejected by performing a KolmogorovSmirnov goodness of fit test with a twotailed significance level of 0.01. In the next section we will use the model with the parameters identified in this section and apply a more advanced selection of the urns, in order to evaluate the effect of two variables: file size and network externalities on the generation process. 6. Relevant factors for file generation We follow a naïve approach in studying the adoption process of FLOSS. First, we study the introduction and the level of acceptance by examining the data standards that have been generated through time. Specifically, we compare the creation of office automation .sxw files versus .doc files. Second, we apply the urn schema to the definition of the probability of creation of a new file, in this way modeling file generation as a self-reinforcing mechanism. Third, as a process of adoption does not occur in isolation, we analyze the factors or variables that impact on file creation. In this sense, we follow a similar approach as the one in [24], subdividing factors considered in levels and evaluating the selection process. We then follow a multi-urn selection to identify the factors and levels that mostly influence the generation process when assuming a balancing strategy, a

strategy in which the final goal is to try to reach a situation in which the proportion of files generated is equally distributed between the two formats. In summary, these are the steps followed: 1. identification of the variables; 2. definition of the urns according to factors and levels; 3. selection of the urn; 4. analysis of pattern of selected urns; We start first with the identification of the factors to consider. After an exploratory overview, the following aspects have been selected for an evaluation given by the model. a) Network externalities, considered as the number of files generated with the other formats and their impact on the generation of other files. The DurbinWatson autocorrelation analysis [9] reports a d-value of 0.8474 which leads us to reject H0 (non-correlation) at a significance level of .01. For the rest of the paper the term network externality will be referring to this measure. b) File size, to evaluate how the file size impacted on the number of files generated. Non-parametric Spearman’s rank correlation [22] reports in particular a significant correlation between file count and sum of file size in the same day. We use as next step the factors considered to be evaluated by the model. We suppose in our case, that there are two possible outcomes for each observation n: the generation of a file of an open or proprietary data standard. We further subdivided each factor in three levels. This is an arbitrary choice, the same as the range of each level, as no particular statistical technique or consideration has been made or

employed. In Table 2, the different ranges of values for the different levels are displayed. Table 2. Factors and different levels considered

Each file has been subsequently mapped to one category according to the relative level they were included as they were generated. We constructed a table from these data and then we selected the urn by using our specific urn design. By using α=1 and β=0, we get Pr(Nk,n+1 = l+1 | Nk,n = l) = (w+l)/εn. Results are available in table 3, where we show an example of selection in one observation. Each urn is classified as Ufactor,level where factor 1 equals network externality and factor 2 refers to file size. Table 3. Derived distributions of files by urns. Urns defined as Ufactor,level

Table 4. Ranking of urns according of probabilities of selection of the urn Urn 1,2

d 0,99891

1,1

0,99154

2,2

0,98533

2,1 2,3

0,98533 0,94898

1,3

0,94232

At the current step, we select Urn U1,2 (network externality at level 2), as it has the highest probability and leads to the highest possibility of ‘balancing’ our data. Thus, in our specific case and at this step, we add α=1 files of type sxw. We then replicated the urn selection process 30 times in order to evaluate the influence of the factors. At each step we chose the urn that was maximizing the balancing of the final sample. 24 times (80%) the network externality factor was leading the urn choice, while 6 times (20%) file size was the factor forcing the decision. This shows how network externalities are an important factor that impact on the files generation process. 7. Conclusions

We then ranked urns according to a distance function which maps (W1,n W2,n)/εn as in table 4.

We have proposed a new method to measure the use of software in real environments. The method relies on a sophisticated stochastic technique that has been applied to the generation of data standards. We have applied the method to compare the usage of Microsoft Office and OpenOffice.org suites for office automation. In particular, we analyzed empirically different factors that impact on the adoption process of FLOSS through the generation of data standards, namely the size of the files available and the presence of network externalities in the file generation process. As a confirmation to general theory that states

that a large number of files in one format can have a large impact in case of a migration [21], we found the effect of network externalities to be significant and in such a way to lead the adoption process towards certain well defined equilibrium. In such situations as the one examined, certain strategies should be adopted in order to ease the transition to the new data standard introduced. One proposed approach is that managers should introduce the FLOSS solution in parallel with the legacy systems already available. By monitoring the evolution of file generation by means of a constant measurement and adopting strategies that take into account the effects of network externalities, the target is to reach a certain level of balancing in the file generation process. Once the balancing of files’ generation has been reached, it can be seen as a starting point in order to exploit a full migration. In any case, we showed by applying the model that without a strategy that takes into account network externalities, it is difficult to reach a complete balance of files’ generation, when large amounts of proprietary files have already been generated. In this sense, it is easy to note how history of past files’ generation can influence the generation of new files and thus the process of software adoption. 8. Limitations As a major limitation of the work, we must note that the modification and creation dates reported on the Microsoft Windows system are not always precise and can sometimes be misleading. An example of this behaviour has been reported also in [8], where authors were interested in collecting various statistics about file usage of different file systems.

Another limitation of the study is that the data collection activity refers to a single institution, generalization requires more datasets. In this sense, peculiar policies of the institution, like training can produce an impact on the analysis performed. 9. Future Work Future work can be seen as evolving in three directions. First, we would like to perform the analysis on files that belong to different data standards as the ones of office automation and crossing results with more datasets. Second, the force of the path dependent process depends on the model selected. In this paper, only a weaker form of path dependence in which past events lead the evolution of the current generation has been investigated. A stronger dependence in which the order of events is also important has not been considered. Whether this kind of models is appropriate for modeling the situation of file evolution is part of future work. Third, the possible strategies applied by organizations that potentially impact on the file generation process have not fully been considered. In this sense, it will be interesting to perform a first evaluation of more advanced models, in which the specific urn model changes depending on internal and external factors. 10. Acknowledgements This work has been partially supported by COSPA (Consortium for Open Source Software in the Public Administration), EU IST FP6 project nr. 2002-2164.

11. References [1] Antonelli, C. (1997). "The economics of pathdependence in industrial organization" International Journal of Industrial Organization (15) 1997.

[13] Gosh, R.; Glott, Ruedigger; Krieger, B.; Robles, G. (2002) Free/Libre and opensource Software: Survey and Study. International Institute of Infonomics. Retrieved, August 10th, 2007, from http://www.infonomics.nl/FLOSS/report/

[2] Arthur, W.B. (1989). Competing Technologies, Increasing returns, and lock-in by historical events. Economic Journal, 99, 116-131.

[14] Glynn, Fitzgerald & Exton (2005; "Commercial Adoption of Open Source Software: An Empirical Study", International Symposium on Empirical Software Engineering, IEEE.

[3] Bitzer, J. and P. J. H. Schroder, (2003). Competition and Innovation in a Technology Setting Software Duopoly. DIW Discussion Paper No. 363.

[15] Katz M. L., and Shapiro C. (1985). Network externalities, competition, and compatibility. American Economic Review 75: 424-440.

[4] Danish Board of Technology, The. (2002). Definition of Open Standards. Retrieved, 14th January 2007, from http://www.oio.dk/files/040622_Definition_of_open_s tandards.pdf

[16] Krechmer, K. (2005). The Meaning of Open Standards. In Proceedings of the Proceedings of the 38th Annual Hawaii international Conference on System Sciences (Hicss'05) - Track 7 - Volume 07 (January 03 - 06, 2005). HICSS. IEEE Computer Society, Washington, DC, 204.2.

[5] David, P., A. (1985). Clio and the economics of QWERTY, American Economic Review, 75 (2), May. [6] David, Paul, A. (2000), "Path dependence, its critics and the quest for ‘historical economics’", in P. Garrouste and S. Ioannides (eds), Evolution and Path Dependence in Economic Ideas: Past and Present, Edward Elgar Publishing, Cheltenham, England. [7] Dedrick, J., West, J., (2004). An Exploratory Study into Open Source Platform Adoption, HICSS, p. 80265b, Proceedings of the 37th Annual Hawaii International Conference on System Sciences (HICSS'04) - Track 8, 2004. [8] Douceur J. and Bolosky W. (1999). A Large-Scale Study of File-System Contents, Proceedings of the 1999 ACM Sigmetrics Conference, pp. 59–70, June 1999. [9] Durbin, J., and Watson, G. S., (1950). "Testing for Serial Correlation in Least Squares Regression, I." Biometrika 37 (1950): 409-428. [10] Economides, N., Katsamakas, E. (2006). Linux vs. Windows: A comparison of application and platform innovation incentives for open source and proprietary software platforms, in Juergen Bitzer and Philipp J.H. Schroeder (eds.) The Economics of Open Source Software Development, Elsevier Publishers, 2006. [11] Eggenberger, F., Polya, G., (1923). Ueber die Statistik verketteter Vorgaenge, Z. Angew. Math. Mech. 3 (1923) 279–289. [12] Free Software Foundation. The Free Software Definition. Retrieved, 10th August, 2007, from http://www.gnu.org/philosophy/free-sw.html

[17] Perens, Bruce. (1999). Open Sources: Voices from the Open Source Revolution. O'Reilly Media. 1999. [18] Perens, B. Open Standards Principles and Practice. Retrieved, 10th August, 2007, from http://perens.com/OpenStandards/Definition.html. [19] Rogers, E. (1995). Diffusion of Innovations. N.Y.: The Free Press. [20] Rossi, B., Russo, B., Succi, G. (2007). Evaluation of a Migration to Open Source Software. In K. St.Amant, B. Still, Handbook of Research on Open Source Software: Technological, Economic, and Social Perspectives (Chapter XXIV). Hershley PA: Idea Group Reference, 2007. [21] Shapiro, C., & Varian H.R. (1999). Information Rules: A Strategic Guide to the Network Economy. Harvard Business School Press. [22] Siegel S., Castellan N. and John JR. (1988). Nonparametric Statistic for the behavioral sciences, second edition, McGraw-Hill, New York. [23] Simcoe, T (2006). Open Standards and Intellectual Property Rights. In: Chesbrough, H., Vanhaverbeke, W. & West, J. Open Innovation: Researching a New Paradigm. Oxford University Press. [24] Wei, L. J. (1978). An Application of an Urn Model to the Design of Sequential Controlled Clinical Trials in Journal of the American Statistical Association, Vol. 73, No. 363 (Sep., 1978), pp. 559563.

Suggest Documents