Computational Statistics (2007) 22:91–108 DOI 10.1007/s00180-007-0023-6 O R I G I NA L PA P E R
Excel :: COM :: R Thomas Baier · Erich Neuwirth
Published online: 24 February 2007 © Springer-Verlag 2007
Abstract R is a powerful system for statistical computing. Its great flexibility makes it the perfect tool for a wide range of applications. Unfortunately this flexibility also leads to a level of complexity which is hard to handle for the casual user. On the other hand tools like Microsoft Excel are very easy to handle but are not well-suited for more complex applications. This article describes how to make use of the flexibility of R while still providing a familiar and easy to use GUI in Microsoft Excel. We will provide a description of the design and show the various ways of installation and user interaction with R using Excel. 1R An in-depth discussion of R (R Development Core Team 2005b) is far beyond the scope of this article. We will provide a short description of R and show some of the advantages of this system and also some of its disadvantages. “The R FAQ” (Hornik 2005) provides a short description of R. It starts with the following paragraph: R is a system for statistical computation and graphics. It consists of a language plus a run-time environment with graphics, a debugger, access to certain system functions, and the ability to run programs stored in script files.
T. Baier Department of Statistics, Vienna University of Technology, Vienna, Austria E. Neuwirth (B) Department of Scientific Computing, University of Vienna, Vienna, Austria e-mail:
[email protected]
92
T. Baier, E. Neuwirth
The programming language implemented in R is the S language version 4 (see Chambers 1998). This language is also implemented in a software product called S-Plus by Insightful Corporation (Insightful Corporation 2005). The specific variant implemented in R is described in R Development Core Team (2005e). R runs on many different platforms, including various flavours of Unix, Microsoft Windows and MacOS X. More detailed information can be found in R Development Core Team (2005d) and Hornik (2005). The “look & feel” on the various platforms may be different, but it is common for all platforms, that R provides a command prompt where the user can enter textual commands. The character output of the commands is shown in the same (console) window, while the graphical output shows up in distinct graphics windows. The command language used in the command prompt is S version 4. All global symbols or functions can be used by simply typing their names. In addition to the built-in functions, R can be extended by means of packages. Hundreds of packages are available for public download from one of the CRAN 1 sites. Many of the available packages provide useful functions and objects for use in computational statistics. At the same time other packages provide connectivity to a broad range of applications and data formats. Some examples are • • • • •
import and export from/to text files is a built-in functionality (found in the base package. reading and writing XML files using the package XML (Lang 2005c) importing data from EpiInfo, Minitab, S-PLUS, SAS, SPSS, Stata and Systat or exporting data to Stata [package foreign, R-core members et al. (2005)] data base connectivity, with e.g., MySQL (DuBois 2000) using RMySQL (James and DebRoy 2005) or via ODBC using RODBC (Lapsley and Ripley 2005). the packages RDCOMClient (Lang 2005a), RDCOMServer (Lang 2005b) and rcom (Baier 2005) can be used to connect R to third party applications using COM on Microsoft Windows operating systems. (see Sect. 3 and Microsoft Corporation & Digital Equipment Corporation 1995)
An introduction into R’s connectivity options—mostly focused on data import and export—can be found in R Development Core Team (2005c). R is developed as an open-source project and distributed under the “GNU GENERAL PUBLIC LICENSE” (Free Software Foundation 1991). Everyone is allowed to use R free of charge—even in commercial projects. The same license or a license similar to GPL applies to many of the packages available for R. As has been mentioned before, the primary way of user interaction with the system is via R’s command prompt. While this allows very fast interaction for the advanced user and a high degree of flexibility using the S language, this kind of user interface is very hard to use for the casual user. Novice users are 1 Comprehensive R Archive Network, available via http://www.cran.r-project.org/.
Excel :: COM :: R
93
presented with a GUI which is not very common for modern applications. User interaction does not follow the model of menu and dialog driven application software. In the case of R, the user is is interacting with an interpreter for an object oriented statistical programming language. A command prompt provides a very steep learning curve for the beginner and requires quite an investment in time by users new to the system. Packages, like Rcommander (Fox et al. 2005) try to assist in learning R by providing a menu-driven GUI but from the user’s point of view, the command prompt is still the main user interface. Many users familiar with applications like OpenOffice (OpenOffice.org 2006) or Microsoft Office do not want to invest the time necessary to learn using R. So there is a large group of potential users who cannot use R because it either is too complex or at least seems to be too complex. In addition to that, there are tasks where the command line is not the ideal way of user interaction. While R contains an integrated spreadsheet component which can be used to comfortably enter and edit tabular data, users familiar with Microsoft Excel or other full featured spreadsheet programs will miss many features they have grown accustomed to. Therefore we decided to connect Microsoft Excel with R. The software package implementing (among other things) this connection is called R (D)COM Server V2.01 (Baier and Neuwirth 2005) and can be downloaded from http://www.cran.r-project.org/ from section Other. 2 Excel Microsoft Excel is a current state of the art spreadsheet program. Spreadsheet programs are very convenient tools for numerical computations and in fact computationally equivalent to many programming language based software systems for numerical computation. Spreadsheet programs are unique by some important properties: • • • •
creation of formulas with a point and click user interface, relative and absolute cell references instead of named variables, iteration and multiple computations by copying formulas, automatic recalculation when input values change
These properties create a unique way of interaction with data. Since changing cell contents immediately also changes computed results, spreadsheets support a very explorative way of analyzing data. The fact that the algebraic notation for formulas is not the primary way of interacting with formulas makes spreadsheets accessible for a much wider audience than programming languages for numerical computations. Additionally, it also not too difficult for the average user to change formulas and therefore spreadsheet programs are not closed application programs (like accounting systems) but allow end users a scaled down version of programming and even software development. These aspects of modeling allow smooth progress from simple tasks like invoicing, accounting, and very simplistic statistics to quite complex statistical and mathematical models. These end user programming ideas are discussed in Nardi (1993), and a
94
T. Baier, E. Neuwirth
detailed account on the modeling aspects of spreadsheets is given in Neuwirth and Arganbright (2003). An additional advantage of Excel is its integration into the Windows desktop. Transferring data and images between Excel and other applications is very easy, and it is even possible to embed parts of Excel sheets into text documents in a way that the text document will be updated automatically when the Excel sheet contents change. Excel even has some support for statistics, both in the form of worksheet functions and in the form of menu based procedures, but these methods are not well accepted by the statistical community. The numerical precision of some of these methods (e.g. methods based on matrix inversion) is not very good, and in many cases, the parametrization of the arguments of the functions are somewhat strange. Performing complex statistical analyses with Excel without any extensions is not advisable. To alleviate the situation, Excel has an Add-In mechanism. All current versions of Microsoft office programs have a builtin programming language (VBA) and an integrated development environment (IDE) for this language. The language is reasonably complete, and it allows access to other libraries installed either on the same or even on a different computer accessible by a network connection. The Add-In mechanism allows programmers to define new worksheet functions seamlessly integrated into Excel (they can be used like the functions built into Excel’s core engine). Add-Ins also can add menus and dialog boxes to Excel, making procedures supplied by other libraries accessible through an extension of Excel’s user interface. Our Excel extension package uses this Add-In mechanism to allow Excel to call R directly from formulas in cells, and we also allow users and programmers to call R methods from their own VBA programs. As a third way of connecting Excel and R we added operations started from menu items on additional menus integrated into Excel’s main menu and the cell context menu (available when a cell is rightclicked).
3 COM For embedding R into Microsoft Excel, the DCOM technology is used. COM (Microsoft Corporation & Digital Equipment Corporation 1995) is shorthand for component object model, a technology widely used on Microsoft Windows platforms to encapsulate functionality in a common way. This makes it possible to use such a component exposed by a so-called server application in a client application. A component is a set of functionality and data (an object). It can be as simple as, e.g. the encapsulation of a simple integer value and as complex as a whole application like Microsoft Excel. The services COM provides and its whole architectural model is very similar to CORBA (Object Management Group 2002), an object model used mostly on Unix platforms. COM
Excel :: COM :: R
• • •
95
defines a way for a server applications to expose objects to its clients, defines methods to handle object life-time by enforcing the use of a simple reference-counting mechanism and provides standardized mechanisms for object creation and sharing objects between different processes.
COM components are used very similarly to objects from a class library. The component author can decide whether the component is integrated into the client application (running in the same process context as the client application) or if it runs in a separate process. In the latter case, COM transparently handles sharing components across process boundaries and so allows to integrate components provided by one executable file or program into another one. For a client application (like Microsoft Excel) the component itself is treated just like an object provided by the application itself. Using this component then is similar to calling an internal Visual Basic for Applications (VBA, Microsoft Corporation 2001b) function or object. Therefore, with respect to the programming language, nothing new is to be learnt, the programmer just has access to additional objects. Of course, the properties and methods of the object itself are new. The integration into the VBA IDE, however, is so tight that it is even possible to use the integrated object browser in the IDE to browse objects provided via COM. One of the major advantages of COM itself is its wide support on the Microsoft Windows platform. Nearly every programming language, as, e.g. Visual Basic (Microsoft Corporation 2001c), Delphi (McNab et al. 1996) or C++, scripting languages like JavaScript (Flanagan 2001), Perl (Wall et al. (1996) or more specifically ActiveState Tool Corporation (2000) for a Windows version able to use COM) or Python (Martelli 2003) or even applications providing macro support as the Microsoft Office family of products (VBA) provide support for using the functionality exposed through COM objects by a server application. DCOM stands for distributed COM and extends the COM model with a very important feature. While COM itself provides methods for performing function or method calls across process boundaries, DCOM goes one step further. DCOM makes COM objects transparently available in a network of computers. In our previous example of Microsoft Excel utilizing R as a computational component, DCOM now allows to run the component (R) and the client application (Microsoft Excel) on different machines. DCOM will take care of the necessary communications over the network to make the services exposed by the R component available to Excel. COM requires developers to separate interface and implementation. Along these lines we have defined a COM interface called IStatConnector (see Sect. 4 which formally defines the functionality our COM server provides. Client applications work with the COM interface and the true binding to the implementation is done when the COM object is instantiated (while the application is running). This separation of interface and implementation and the run-time binding mechanism allows to ensure compatibility between different versions
96
T. Baier, E. Neuwirth
of the COM server. For example, client applications created in 1999 for the first release of our COM server for R still work without modifications with the current version released in October 2005. The COM interface defines the functions (and variables) a COM object provides. An implementation of a COM interface is called a coclass. In the last few years, Microsoft has developed a new component technology as part of the .NET (Microsoft Corporation 2001a) system. Why we chose COM as our component technology is easy to explain: • •
• •
COM and DCOM can be used on all 32 Bit Windows platforms (Windows 9x, ME, NT 4, 2000, XP and even Windows CE/Pocket PC) mature COM support is found in most programming languages and applications, while good support for .NET’s component technology is still not found very often. Even Microsoft Excel does not have native .NET support at the moment. when the concepts for integrating R into Excel has been developed back in 1999, .NET was not available at all the .NET → COM bridge technology allows to use our COM components from .NET applications quite well.
For the near future we want to stay with COM as our base component technology, but to make it easier for the new .NET environment, we will develop native .NET components both for the computational components and for the controls and applications. Since the .NET technology is fully documented and implemented non-Microsoft platforms also (Mono Project 2006) this will allow to use our mechanism on a wider range of operating systems. 4 Integrating Excel and R Our goal was to make R’s computational engine available to third party applications in general, and to Microsoft Excel in particular. In COM terminology, this makes Excel a COM client application and R a COM server. When talking about COM, this also includes the DCOM technology. COM clients do not distinguish between COM and DCOM when accessing an object’s methods and properties. Only the client machine’s configuration and the process of object creation may be different. In this integrated system with components Excel and R, Excel (the client) is the controlling part, whereas R (the server) offers its services on request to Excel. Figure 1 shows the connection between the two applications including data flow. The COM server provides R’s functionality through a COM interface called IStatConnector. Below we show relevant parts of this COM interface. interface IStatConnector : IDispatch { // starting and stopping the interpreter HRESULT Init([in] BSTR bstrConnectorName); HRESULT Close();
Excel :: COM :: R
97
Fig. 1 Microsoft Excel uses R’s computational engine via COM
eval(expr)/set data
R
Excel result data
COM
// setting and retrieving symbol data HRESULT GetSymbol([in] BSTR bstrSymbolName,[out,retval] VARIANT* pvData); HRESULT SetSymbol([in] BSTR bstrSymbolName,[in] VARIANT vData); // evaluating an expression in the interpreter HRESULT Evaluate([in] BSTR bstrExpression,[out,retval] VARIANT* pvData); HRESULT EvaluateNoReturn([in] BSTR bstrExpression); ... };
Our work is centered around the coclass StatConnector, which is an implementation of the IStatConnector COM interface. Excel is connected to R by an Add-In for Microsoft Excel called RExcel. From StatConnector’s point of view, the Add-In performs the following tasks to integrate R into the spreadsheet application: 1. 2. 3.
4. 5.
Excel (by means of the Add-In) creates an instance of the IStatConnector interface, and calls Init on the COM object to start up R normal operation: either interactively or using Excel’s recalculation loop, transfer data to R, perform computations in R and get result data back to Excel shut down R by calling Close release the COM object
The functions (methods) will return on completion of the requested operation. E.g., Evaluate takes an expression as its input argument. Control is then handed over to R. The client application waits until the R has finished computing the expression and the result of the computation is returned to Excel (see Fig. 1). The COM server (R) only reacts to Excel’s requests. When initializing the COM object, a fresh R environment is created and initialized. In our case, this “R instance” only belongs to the Excel Add-In. If another application (or another instance of Excel) is started and wants to use R for its own purpose, a different R process is created. All applications are using separate R processes. This is similar to running R multiple times from the command line at the same time. Our COM server is a true back-end application. It does not provide any kind of user interface. Even the command prompt described in the first section is invisible to the user. This allows to truly embed R’s computational engine into
98
T. Baier, E. Neuwirth
another application (Excel in this case) and completely hide R’s own GUI from users. The COM interface IStatConnector Excel (or more specifically: RExcel) uses is separated from the interface’s implementation, the coclass StatConnector. The package rcom makes use of this concept and provides an alternative implementation of the IStatConnector interface which displays R’s “normal” GUI and allows manual interaction with R in parallel to using the COM client. By simply changing the object creation mechanism in Microsoft Excel, we can exchange the COM implementation and provide access to an R process with its own GUI accessible for the user, while R is still integrated into Excel.
5 Concepts of the implementation in and around R The COM server (the coclass StatConnector) mainly consists of two parts. The first part is tightly coupled with the implementation of R itself, the latter is the “real” implementation of the COM interface IStatConnector. It is not the goal of this article to give a detailed description of either the R implementation or of the COM implementation. We will only provide a short description of the design goals and the advantages (and disadvantages) of our approach. On Windows platforms, as well as for most other operating systems, GCC (see Stallman 2005) is used to build the R executables and libraries from source code. When we started our project, there was not much support for creating COM servers using GCC. Commercial compilers, on the other hand, provided good support to create COM server applications and contained class libraries making creation of COM servers an easy task. Unfortunately—unlike on most Unix-alike platforms—interoperability between different vendors’ C and C++ compilers is not possible (easily). Only when using the so-called system calling convention, which is defined for C code only, it was possible to create implementations with one compiler (e.g., with GCC), which could safely be called from an implementation compiled with a different compiler (in our case Microsoft VC++). This abstraction layer only uses C functions (no COM or C++) and utilizes Microsoft Windows’ system calling convention. This guarantees that the functions can safely be called from a C program compiled with any C compiler for Windows. The abstraction layer (below referenced as the proxy object SC_Proxy_Object) consists of a set of pointers to C functions stored in a structure and not only defines functions to access R but also a data format which maps the R-specific internal storage format (SEXPs, see R Development Core Team 2005f for more information) to the so-called BDX data format (Binary Data eXchange format) designed specially for this goal. This data format has been designed for efficient (structured) data exchange with as few memory/conversion operations as possible. The proxy object SC_Proxy_Object is a general interface object and its definition is independent from R. The same interface object (definition) and the data format could be used for other systems than R, too (e.g., GNU Octave,
Excel :: COM :: R
99
see Eaton 2005). The implementation in rproxy.dll makes extensive use of the R API as described in R Development Core Team (2005f) and is tightly coupled to R. Its implementation is part of the R source code and the binary for rproxy.dll is delivered with the Windows binary distributions of R. SC_Proxy_Object and BDX are both stable interfaces and make the COM server itself (the coclass StatConnector) independent from the R version. This decouples the COM server from R in a way that release cycles for R and the COM server can be independent from each other. StatConnector is an out-of-process server written with Microsoft Visual C++ 6.0 using Microsoft’s Active template library (ATL). Its connection to R is based on the SC_Proxy_Object interface. The interface’s implementation is loaded dynamically from rproxy.dll. Data transfer is performed using the BDX format. The COM server’s main goals are • • • • •
provide an implementation of IStatConnector start and stop R convert from BDX to the VARIANT data format used in the COM interface and vice versa perform error handling allow callback objects to be installed for R for e.g., graphics or text output (see Sect. 7) and handle the callback functionality
StatConnector is implemented as a single-use out-of-process server. This means that every application creating a StatConnector object gets its own server process where the COM object lives. R is running in the process context of the StatConnector object. Therefore, separate R processes exist for every client application. The overhead created by the COM server is that of an out-of-process COM call (including marshaling and VARIANT data transfer) vs. a direct function call to achieve the results. Converting the data from VARIANT format to BDX format and then to R’s SEXP storage format can be neglected for most applications. The same holds true for forwarding the COM method calls to R. The COM server itself provides very fast and easy access to R. When creating interactive applications, the performance bottleneck is mostly found in longrunning scripts or expressions on the R side. The advantages and disadvantages of this design and implementation approach are obvious now: + +
−
COM makes it easy to access R from many different applications and programming languages StatConnector provides an implementation of the stable interface IStatConnector. This guarantees compatibility for client applications using IStatConnector over time while still benefiting from improvements found in new versions of the StatConnector implementation. Because of two different interfaces (COM interface IStatConnector and C interface SC_Proxy_Object) special care must be taken to ensure compatibility. Since the first release in 1999, full compatibility for client applications relying on IStatConnector could be guaranteed. In
100
+
+
+ + + −
+
+
T. Baier, E. Neuwirth
this time many new versions of both StatConnector and R have been released. SC_Proxy_Object provides a stable interface for C and C++ programmers using any Windows C compiler. For C programmers this may be easier than going the COM way via IStatConnector but still decouples the implementations from a specific R version. Changes in R may require code to be changed. As the R interface is implemented in rproxy.dll and rproxy.dll is part of R (and the R distribution/setup) this can be done easily. It is very unlikely to have to make changes to the StatConnector implementation because of changes in R. This helps keeps maintenance cost low. Installing a new R version does not require to switch to a new version of StatConnector in most cases. Multiple versions of R can be installed at the same time and used by the client applications by simply setting a registry key to point to the version of R which shall be used. Different applications use different R processes. This makes the client applications independent from each other and does not require any cooperation between them. A small overhead for data conversion and additional function call overhead is imposed by this architecture. Practically, this does not have any impact on a typical application’s performance. Minimizing data transfer and function calls keeps this overhead low. Problems in R code or maybe some bug in a package (resulting in a crash) does not affect the client application. The client application always only gets an error code from the COM implementation and can handle faults like those gracefully. The same infrastructure which is used by StatConnector can be used by alternative implementations, too. rcom (see Baier 2005) is another COM server for R implementing IStatConnector. The implementation reuses rproxy.dll to provide a different level of integration between the COM client and R and also allows to implement a different user interface paradigm.
Next, we will describe how the Add-In for Microsoft Excel uses StatConnector to integrate both spread-sheet and R functionality. 6 Excel implementation The COM interface IStatConnector only supplies a very basic mechanism for communication between Excel and R. Besides the administrative tasks of initiating the R process and shutting it down it allows to • • •
send data to R, retrieve data from R and send as string containing R commands to be executed
Excel :: COM :: R
101
R has many complex data types implemented in R’s object system. Excel essentially only has vectors (columns or rows) and matrices. The more complex data types of R are conceptually incompatible with Excel’s tabular paradigm. Therefore, our interface between the two applications only handles arrays (containing only one basic data type like string or real) and dataframes. Dataframes in Excel are represented as arrays of columns. Each column has a name (of the variable) in the top row and consists of data of equal type (string, real, time,…). Different columns may have different types. The interface allows to transfer Excel ranges of one underlying data type to R as array and to transfer rectangular areas of cells (called “ranges”) following the “dataframe convention” to R as dataframe. Similarly, scalars, vectors, and arrays in R can be transferred to Excel ranges. The current version of the interface will even handle date, time and complex numbers reasonably. Both data types are defined in Excel and in R, but they are not implemented in the same way. Therefore, great care must be taken when transferring these data types. Incidentally, most parts of RExcel are implemented in VBA, which is an interpreted language. To speed up data transfer for large arrays and dataframes, some routines had to be implemented in compiled Visual Basic. An additional problem is handling of missing values. Excel treats empty cells differently under different conditions. For arithmetic functions in many cases empty cells are treated the same as cells containing the value 0. For statistical projects, this is a serious issue which has been extensively documented in McCullough and Wilson (2002). Therefore, our interface allows to specify different methods of handling for empty cells, and furthermore allows for different conventions of indicating missing data in Excel. To allow Excel to connect to R, the interface installs a new menu item RExcel in Excel’s main menu. This menu items opens a submenu containing, among other items, commands to transfer the currently selected data to R and to transfer an array or a dataframe from R to a range in Excel. This menu also has an item for connecting to R and to select the type of R server to be used (R(D)COM or rcom). In addition to transferring data from Excel to R and back, a mechanism for executing R procedures and functions from Excel is needed. If the the rcom mechanism is used, starting R brings up an R command line, so the user can run R commands from this interface in the same way he would interact with a standard R GUI. The advantage is that data can be transferred from Excel most easily, and results can be transferred directly into Excel. If the underlying R process is the R(D)COM server, however, no command line interface is available. Therefore, another way is needed to run R commands. RExcel allows to enter R commands as text into Excel cells. Then, a range can be selected interactively and the text in these cells will be interpreted as a sequence of R commands and executed. This way, Excel ranges become the R command line. Additionally, the R code is saved as part of the worksheet and therefore one Excel file can contain all the data and the R code needed to perform complete statistical analyses.
102
T. Baier, E. Neuwirth
When working with R code from within Excel, debugging can become rather tedious. Therefore, the interface offers tools to help with debugging. There is a special debug mode where all R commands executed are displayed in a special popup window. When an error occurs, this window will also display R’s error messages. So Excel can even be used as a mini development environment. The interface also has a command for getting the output of the last command executed by R. This output can be put into a cell range in Excel as text. This is useful for command producing output which cannot easily be represented as arrays or dataframes. This way, when programming R one can inspect the results of performing operations in R in an informal manner. Using this mechanism (data transfer in both directions, execution of R commands initiated by Excel), it is possible to write Excel macros performing statistical tasks and start them from menus in Excel. This way, complete statistical applications can be written in Excel and the user only sees some additional menu items performing these tasks. RExcel enhances Microsoft Excel with statistical methods not part of Excel itself. These enhancements are integrated seamlessly into the GUI and Excel’s user interface paradigms, like, say, Excel’s solver for multivariate equation solving and optimization. To be able to implement such embedded applications, the implementor has to know R, the spreadsheet part of Excel, and VBA. The hub for such applications is VBA. Macros written in this language take care of data transfer. R commands to be run are constructed as strings in VBA and then executed by calling an appropriate procedure in VBA. Here is a typical small example demonstrating the usual pattern of using R in Excel this way: Sub RegreDemo() Call RInterface.StartRServer Call RInterface.PutDataframe("mydf", _ Range("Regression!A1:C26")) Call RInterface.RRun("attach(mydf)") Call RInterface.GetArray("lm(y˜x1+x2)$coefficients", _ Range("Regression!F2")) Call RInterface.StopRServer End Sub
Excel allows to start parameterless macros directly from menu items or toolbar buttons. Therefore, an Excel spreadsheet can have a menu with an item performing such an operation. For a naive end user, performing such an operations looks identical to use one of Excel’s menu based tools (e.g. sorting data). The most important feature of spreadsheets is automatic recalculation. Our R-Excel integration methods described so far have not linked R with this Excel feature, but there is a special mechanism integrating R into Excel’s automatic recalculation loop. There are some Excel functions (defined in VBA) which perform R computations. RApply("pchisq",C4,D4,E4)
Excel :: COM :: R
103
computes the inverse (noncentral) χ 2 -distribution function for a probability value given in cell C4, degrees of freedom in cell D4, and noncentrality parameter in cell E4. Whenever the contents of one of these cells are changed, Excel will immediately call R and update the value computing the χ 2 -value. Therefore, our interface integrates R’s computational engine with Excel’s automatic recalculation features, producing an R spreadsheet program. The mechanism for creating formulas using R is Excel’s mechanism for creating formulas: pointand-click can be be used to indicated the position of the parameter values of function calls. This integration provides a radically different approach to the usual batch oriented way of using R. R not only offers powerful computational features, it also offers a wide range of statistical graphical representations. R graphics at the moment is not fully integrated into Excel the same way as native Excel graphics, but it is relatively easy to produce R graphics and get a snapshot of the image into Excel. RExcel allows to execute any R commands. Therefore, graphics can be produced, too. Such graphics is displayed in a windows belonging to the R process, it is not embedded in an Excel worksheet. R graphics can be copied to the clipboard either manually with the menu commands available in graphics windows, or with the savePlot command. After copying the graphics (preferably in a vector format like WMF), pasting the clipboard contents into an Excel worksheet will embed the chart in the worksheet. This kind of chart, however, behaves differently from native Excel charts: changing the data will not automatically change the chart. The copy-paste cycle needs to be repeated manually to get an updated chart. This will be changed in future releases of RExcel. This technique can only be used when the R server is running on the same machine as Excel. There is another way of combining Excel’s graphics features and R, and using this mechanism it is possible to produce animated graphics. In the worksheet in Fig. 2 the numbers partially covered by the graph are the numerical representations of a kernel density estimator. These numbers are computed by R. The slider on top of the window controls the window width for the density estimator. Whenever the slider is moved, thereby changing the window width, the numbers are recomputed by R and the graph (an x-y-chart produced by Excel) is updated. In this example, Excel initiates R’s computation whenever necessary to update the graph. An important consideration when designing RExcel was that it should support different user interaction modes: scratchpad and data transfer mode menu controlled data transfer from R to Excel and back, immediate command execution either from Excel cells or from R command line. macro mode macros invisible to the user control data transfer and R command execution. spreadsheet mode formulas in Excel cells control data transfer and command execution, automatic recalculation is controlled by Excel. Our current implementations supports all three modes. Future releases will concentrate on more complete graphics integration.
104
T. Baier, E. Neuwirth
Fig. 2 Excel graphics using results computed by R
7 Additional tools So far, this article has focused on the “core components”, which are the coclass StatConnector (including the COM interface IStatConnector) and the Microsoft Excel Add-In RExcel. In other words, the missing link between R’s computational engine and the mathematical (or statistical) part of Microsoft’s Office suite has been discussed. Looking at R itself, it is obvious that some important parts of R’s features have been omitted so far: graphics and text output. Achieving graphics output seems to be very simple: When calling one of R’s graphics commands (e.g. plot), R (or more precisely, the R instance running in the COM server) will open an R graphics window and the graphical output is shown. Although this approach provides a suitable solution at the first glance, this cannot be the right solution on a second thought. The graphics window is opened by the COM server and also “belongs” to the COM server process. If the COM server is run on a remote machine, the graphics window will be shown on the remote machine, too2 . The correct solution for this is to show R’s display window on the local machine, while it is controlled from the remote machine (the graphics are “drawn” by the COM server). This is achieved by providing a so-called Active X control (see Cluts 2001). Active X controls are user interface components which can be shown in a window or form. The “programmable” interface of the control is represented by a (custom) COM interface. The implementation uses the same mechanism to communicate between R and the Active X control as the Excel Add-In does to talk to R. The Active X 2 This is a very simplified approach for explaining the mechanism. In reality, the COM server tries to open the graphics window on the remote machine, but this will only succeed, if launch and run permissions are set appropriately and the login state of the remote machine allows to show the window.
Excel :: COM :: R
105
Fig. 3 Graphics output in microsoft excel form (via active X)
control is a COM object, and the R COM server on the remote machine holds a reference to the control on the local machine. This “callback mechanism” is implemented in rproxy.dll. To capture R’s text output (e.g., texts appearing in the console window produced using cat) another Active X control is provided. In addition to the GUI representation (the Active X control StatConnectorCharacterDevice) a non-GUI object is also provided. The coclass StringLogDevice stores all text output in a string variable and provides a way to programmatically access R’s text output. By using the core component StatConnector and the output components StatConnectorGraphicsDevice and StringLogDevice any COM client application can fully make use of R both as a powerful computational component and as a high-quality graphics engine (Fig. 3).
8 Excel/R communication modes Communication between Excel and R takes place using COM or DCOM. In both cases, it is possible to use two different mechanisms for RExcel to invoke functionality in R. Method calls (like, e.g., Evaluate or GetSymbol) can be made using the so-called custom interface (IStatConnector) or using a dispinterface (uses IDispatch to access IStatConnector’s methods). The method using the custom interface IStatConnector is comparable to a function call in C or C++. It provides strong typing (checking of arguments and data types) and is the fastest way to access a COM object. The COM client must have an intimate knowledge of the COM interface it wants to use (both at compile time and at run time). Alternatively, the dispinterface can be used. In this case, access to the methods is made through an generic COM interface, IDispatch. To issue a call to an IStatConnector method, the IDispatch’s method Invoke is called (using IDispatch’s custom interface) and Invoke is told to call a method in IStatConnector. This additional level of indirection makes dispinterfaces a bit slower but the COM client does not have to exactly
106
T. Baier, E. Neuwirth
know the internals of IStatConnector. E.g., when using IDispatch it is enough to know the method’s name and parameters, the client does not have to know if the method is, e.g. the first function in the interface, or the second function etc. The drawback is that using a dispinterface will show errors only at run-time because of a lack of type-safeness. To install RExcel in a way it can use this interface the user performing the installation has to have administrator rights on the machine, even when the R DCOM server is running on a different machine. But there is a way of installing RExcel for using a remote server which does not need administrator rights. This method allows users in an environment with tight access restrictions to quickly install RExcel on machines where they do not have administrators privileges. It is also possible to install RExcel on a client machine without installing R and still use strongly typed access. In this case, the type libraries containing information about the signatures of the the functions supplied by R DCOM (i.e., the interface definitions) are required on the client machine, but not the server binaries (the R(D)COM program) themselves. Installing the R(D)COM server (just as installing any other COM server) always requires administrative (or at least “power user”) privileges. It is possible to install R (and the R(D)COM server) on one central server. Installation of the client machines then can be done by a “normal” user and does not require administrative privileges. The client machines then can use the R installation on the centralized server machine. This is a reasonable context for an environment with one powerful server and less powerful client machines. In this case, RExcel serves as the user interface to R. Let us summarize the differences between the interfaces: •
advantages and disadvantages of typed use + type-safe + easy to find bugs during development + easier to find runtime errors + less complex: easier to find setup errors + fast + more flexible: can support more data types (e.g. structs, unsigned integers) − requires (registered) type library for local and remote use − Excel requires type library even to load Add-In • advantages and disadvantages of dispinterface + can run R remotely without any local components of COM server (even no type library is required) + works with all COM clients (e.g. scripting languages) + when no R components are installed, Excel can still load the Add-In − cause of errors often hidden (e.g. hard to distinguish between errors with setup, programming, communication (for remote R) − requires additional component (DLL) for running without type library − more complex way of calling functions (indirectly, via name/id) makes it slower and more error-prone
Excel :: COM :: R
107
References ActiveState Tool Corporation (2000) Active Perl, 5.6.0.618 edn, ActiveState Tool Corporation. http://www.ActiveState.com/ActivePerl/ Baier T (2005) rcom: R COM Client Interface and internal COM Server. R package version 1.2.1. Baier T, Neuwirth E (2005) R (D)COM Server V2.00. http://www.cran.r-project.org/other/DCOM Chambers JM (1998) Programming with Data, Springer, New York. ISBN 0-387-98503-4 http://www.cm.bell-labs.com/cm/ms/departments/sia/Sbook/ Cluts N (2001) Microsoft activex controls overview, in ‘MSDN Library’, Vol. Backgrounders, Microsoft Corporation. http://www.msdn.microsoft.com/ DuBois P (2000) MySQL. New Riders Eaton JW (2005) Octave: interactive language for numerical computations. University of Wisconsin, Department of Chemical Engineering. http://www.octave.org/doc/index.html Flanagan D (2001) JavaScript: the definitive guide, 4th edn. O’Reilly Media, Inc. ISBN 0596000480 Fox J with contributions from Michael Ash, Grosjean P, Maechler M, Putler D, Wolf P (2005) Rcmdr: R Commander. R package version 1.1-1 http://www.r-project.org, http://www.socserv.socsci. mcmaster.ca/jfox/Misc/Rcmdr/ Free Software Foundation (1991) GNU GENERAL PUBLIC LICENSE. Version 2 Free Software Foundation (1999) GNU LESSER GENERAL PUBLIC LICENSE. Version 2.1 Hornik K (2005) The R FAQ. ISBN 3-900051-08-9. http://www.CRAN.R-project.org/doc/FAQ/ Insightful Corporation (2005) S-PLUS 7’. http://www.insightful.com/products/splus/ James D, DebRoy S (2005) RMySQL Lang DT (2005a) RDCOMClient: R-DCOM client. R package version 0.91-0. http://www.omegahat.org/RDCOMClient, http://www.omegahat.org, http://www.omegahat. org/bugs Lang DT (2005b) RDCOMServer: R-DCOM object server. R package version 0.6-0. http://www.omegahat.org/RDCOMServer, http://www.omegahat.org, http://www.omegahat. org/bugs Lang DT (2005c) XML: Tools for parsing and generating XML within R and S-Plus. R package version 0.99-1. http://www.omegahat.org/RSXML Lapsley M, Ripley BD (2005) RODBC: ODBC database access. R package version 1.1-4 Martelli A (2003) Python in a Nutshell. O’Reilly Media, Inc. ISBN 0596001886 McCullough BD, Wilson B (2002) On the accuracy of statistical procedures in Microsoft Excel 2000 and Excel XP. Comput Stat Data Anal 40:713–721 McNab E, Swart RE, Hinks P, Horn D, Jansen A, Jewell D, Wako W, Winning C (1996) The Revolutionary Guide to Delphi 2. Peer Information Inc. ISBN 1874416672 Microsoft Corporation (2001a) Common language runtime. In: ‘MSDN Library’, vol. .NET Framework SDK, Microsoft Corporation. http://www.msdn.microsoft.com/ Microsoft Corporation (2001b) Microsoft office 2000/visual basic programmer’s guide. In: ‘MSDN Library’, vol. Office 2000 Documentation, Microsoft Corporation. http://www.msdn. microsoft.com/ Microsoft Corporation (2001c) Visual basic. In: ‘MSDN Library’, vol. Visual Studio 6.0 Documentation, Microsoft Corporation. http://msdn.microsoft.com/ Microsoft Corporation & Digital Equipment Corporation (1995) The component object model specification, Technical Report 0.9, Microsoft Corporation (Draft) Mono Project (2006) The Mono Project. http://www.mono-project.com/ Nardi BA (1993) A Small Matter of Programming. MIT Press, Boston. ISBN 0-262-14053-5 http://mitpress.mit.edu/catalog/item/default.asp?ttype=2&tid=6799 Neuwirth E, Arganbright D (2003) Mathematical Modeling with Microsoft Excel, Thomson-Brooks/Cole. ISBN 0-534-42085-0. http://www.brookscole.com/cgi-wadsworth/ course_products_wp.pl ?fid=M2b&product_isbn_issn=0534420850&discipline_number=1 Object Management Group I (2002) Common object request broker architecture: Core specification, Technical report, Object Management Group, Inc. 3.0 OpenOffice.org (2006) OpenOffice. http://www.openoffice.org/ R-core members, DebRoy S, Bivand R, others: see COPYRIGHTS file in the sources (2005) foreign: Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase. R package version 0.8-10
108
T. Baier, E. Neuwirth
R Development Core Team (2005a) An introduction to R, R Foundation for statistical computing, Vienna ISBN 3-900051-12-7 R Development Core Team (2005b) R: a language and environment for statistical computing, R Foundation for Statistical Computing. Vienna. ISBN 3-900051-07-0 http://www.R-project.org R Development Core Team (2005c) R Data Import/Export, R foundation for statistical computing, Vienna, ISBN 3-900051-10-0 R Development Core Team (2005d) R installation and administration, R foundation for statistical computing. Vienna, ISBN 3-900051-09-7 R Development Core Team (2005e) R language definition, R foundation for statistical computing. Vienna, ISBN 3-900051-13-5 R Development Core Team (2005f) Writing R extensions, R foundation for statistical computing. Vienna, ISBN 3-900051-11-9 Stallman RM (2005) Using and porting GCC, 2.95 edn. Free Software Foundation. http://gcc.gnu.org/ Wall L, Christiansen T, Schwartz R (1996) Programming perl. O’Reilly & Associates. ISBN 156592-149-6