Second stage DEA: Comparison of approaches for

European Journal of Operational Research 181 (2007) 425–435 www.elsevier.com/locate/ejor

Interfaces with Other Disciplines

Second stage DEA: Comparison of approaches for modelling the DEA score Ayoe Hoff Food and Resource Economics Institute, Fisheries Economics and Management Division, Rolighedsvej 25, 1958 Frederiksberg C, Denmark Received 22 December 2003; accepted 8 May 2006 Available online 7 July 2006

Abstract Tobit regression is often encountered in second stage data envelopment analysis (DEA), i.e. when the relationship between exogenous factors (non-physical inputs) and DEA efficiency scores is assessed. It is however not obvious that tobit is the only, or optimal, approach to modelling DEA scores. This paper presents two alternative approaches to second stage DEA, the results of which are compared with the tobit approach through a case study for the Danish fishery. Furthermore the three models are compared to OLS regression, representing a linear approximation to the models. It is firstly concluded that the tobit approach will in most cases be sufficient in representing second stage DEA models. Secondly it is shown that OLS may actually in many cases replace tobit as a sufficient second stage DEA model. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Data envelopment analysis; Second stage DEA; Tobit regression; Non-linear regression

1. Introduction Economic theory offers numerous procedures for evaluating efficiency in industries. Within fisheries data envelopment analysis (DEA) is gaining increased attention. DEA is a non-parametric method for measuring relative efficiencies of individual decision making units (DMUs) within a group, given a set of produced outputs and effort variables (inputs). DEA has several advantages. Firstly it is possible to include multiple outputs in DEA, as opposed to stochastic production frontier (SPF) analysis, for which only one output can be

E-mail address: [email protected]

considered. This is advantageous in fisheries as catch often consists of several different species. Secondly DEA is non-parametric, meaning that no functional form has to be assumed for the relationship between inputs and outputs, or for the distribution of efficiency scores, again as opposed to SPF where these assumptions must be made. In analyses of individual vessel efficiencies in a given fishery, inputs can be divided between ‘endogenous’ and ‘exogenous’ inputs. The former are physical inputs used directly to land the observed catch, such as fishing time, number of crewmembers and engine size. The latter are non-physical inputs that may however still affect the catch and thus the efficiency of the vessel, such as vessel type, fishing ground and year. Exogenous factors can be

0377-2217/$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.ejor.2006.05.019

426

A. Hoff / European Journal of Operational Research 181 (2007) 425–435

continuous (time), categorical (education level of skipper) or classificatory (fishing ground, vessel type). Several methods exist for including all three types of exogenous factors directly in the DEA evaluation (Cooper et al., 2000; Coelli et al., 1999; Fried et al., 1999; Grosskopf, 1996), but especially in case of categorical and classificatory factors it is an oftenencountered practise to employ so-called second stage analyses, where the efficiency scores obtained by DEA are subsequently modelled against the exogenous variables. The most-often encountered approach to modelling the DEA scores against exogenous variables is tobit regression, which is suitable when the dependent variables are either censored or corner solution outcomes, of which DEA scores falls within the second category. A corner solution variable is continuous and limited from above or below or both and takes on the value of one or both of the boundaries with a positive probability. As DEA efficiency scores are continuous on the interval ]0; 1], and takes on the value 1 with positive probability, it seems obvious to use a two-limit tobit technique for modelling the scores as a function of the exogenous variables. Tobit has as such been adopted as the natural ‘choice’ for modelling DEA scores in second stage evaluations. The two-limit tobit technique is however mis-specified when applied to DEA scores, given that these only take on the value 1 with positive probability (and not the opposite limiting value 0). Even so tobit yields sensible results in second-stage DEA, and it has as such never been questioned whether tobit is actually the most appropriate method, regarding predictability of scores and effects of the exogenous variables. The aim of this paper is to compare four different approaches to model DEA scores against exogenous influences, in order to investigate whether other models than tobit yield more optimal predictions of the DEA scores. DEA scores are limited to the interval ]0; 1], and the model used to reproduce the scores must thus also be limited to this interval, and accordingly non-linear. An introductory approach may however be to use an ordinary least squares (OLS) linear regression of the scores against the exogenous variables, as this represents a first order Taylor approximation to the more complex non-linear models. The OLS model will clearly predict scores outside the interval ]0; 1] but if the effects (the regression parameters) predicted by this model do not differ significantly from the effects predicted from non-linear

models, OLS is adequate for modelling these effects. The results of tobit modelling of the DEA scores will thus be compared with the corresponding OLS results, and furthermore with two alternative nonlinear approaches, following Papke and Wooldridge (1996) and Cook et al. (submitted for publication). The paper is introduced with at short description of data envelopment analysis, followed by a literature review of models that can be considered for predicting the conditional expectation of DEA scores given a number of exogenous variables. Then the two-limit tobit technique is presented, together with the two alternative non-linear models for second stage DEA considered in the present context. The paper is concluded with an empirical example, covering Danish liners and gillnetters between 12 and 15 m operating in 2002, in which the different models are compared and discussed. 2. Data envelopment analysis Data envelopment analysis (DEA) determines the efficiencies (or correspondingly capacities) of individual decision making units (DMUs) within a group relative to the other DMUs in the group. The most efficient DMUs constitute the efficient frontier of the group, relative to which the efficiencies of the remaining DMUs are measured. The frontier is non-parametric, i.e. no functional form needs to be specified, in contrast to stochastic production frontiers (SPF). The DEA method allows multiple outputs and inputs. Inputs may be variable and fixed, where the values of the variable inputs are allowed to change in the short run (in fisheries variable inputs are e.g. fishing days and number of crew members) while the values of the fixed inputs are only allowed to change in the long run (fixed inputs are e.g. vessel horsepower and tonnage). The DEA method can be input or output orientated, of which the former determines the minimum input for which the observed production of the ith DMU is possible, while the latter determines the maximum output of the ith DMU given the observed inputs. The output-orientated approach will in the following be used to illustrate the DEA method (for more details refer to Cooper et al., 2000; Coelli et al., 1999). The basic assumption of output orientated DEA is that the output vector of the ith DMU is expanded radially until the combination of inputs and outputs for the DMU reaches the frontier of the production possibility set P for the group of observations. P is defined as the smallest convex set comprising (i) all


observed input and output combinations and (ii) all possible convex combinations of these. Thus for a group of N DMUs that each have K outputs, M1 fixed inputs and M2 variable inputs, the maximum output of the ith DMU is determined by the following linear programming problem1: max

li

s:t:

li y k;i 6

k;l

N X

kj;i y k;j

8k 2 f1; . . . ; Kg;

j¼1

xf ;i P

N X j¼1

av;i xv;i ¼ 1¼

N X

kj;i xf ;j N X

8f 2 f1; . . . ; M 1 g;

kj;i xv;j

8v 2 f1; . . . ; M 2 g;

j¼1

kj;i ;

j¼1

k;

a P 0; ð1Þ

yk,i is the kth output of the ith DMU, xf,i denotes the fth fixed input of the ith DMU and xm,i denotes the mth variable input of the ith DMU. The factor av,i is the expansion factor for the vth variable input of the ith DMU. The unit sum of the DEA weights k ensures variable returns to scale. Finally the factor li (P1) is the maximum amount by which the output of the ith DMU can be increased, given the fixed inputs, i.e. the capacity of the DMU. The efficiency score of the ith DMU is defined as hi ¼

1 : li

ð2Þ

This shows that the efficiency score is by construction limited to the interval ]0; 1]. 3. Modelling data with limited dependent variables As shown above DEA efficiency scores are limited to the interval ]0; 1]. Second stage DEA seeks to relate such efficiency scores for a given group of DMUs to a number of exogenous variables believed to influence the level of efficiency. A common approach to second stage DEA is to employ twolimit tobit regression to estimate this relationship, as DEA scores resemble corner solution variables (cf. Wooldridge, 2002). Tobit has been employed 1

Note that only weak efficiency is addressed in the present paper, i.e. input and output slacks are not evaluated. This is not a problem in the present context as the only aim is to assess the efficiency scores.

427

by a number of authors, among these Fethi et al. (2002), Vestergaard et al. (2002), Chilingerian (1995), Oum and Yu (1994), Bjurek et al. (1992) and Ruggiero and Vitaliano (1999). It is important to notice (i) that although often applied, the twolimit tobit model is actually mis-specified when modelling DEA scores, as these only take on the one limiting-value 1 with a positive probability, while the probability of obtaining the limiting-value 0 is zero, and (ii) that when applied, the tobit regression parameters do not directly give the effect of the explanatory variables on the DEA scores, an oftenneglected fact. Regarding (i) it could be asked why a one-limit tobit model, with upper limit equal to 1, has not been employed instead of the two-limit model. The one-limit model would however also be a mis-specification as this implicitly assumes that the dependent variable is continuous on ]1; 1], where DEA scores are on the contrary limited from below by 0. Regarding (ii) it will be shown below that the effects are on the contrary also determined by the values of the explanatory variables (cf. Maddala, 1986). Several authors have however addressed the question of how to examine the relationship between continuous variables limited between 0 and 1 and selected exogenous variables, using other methods than tobit. Kieschnick and McCullough (2003) give an excellent discussion of this subject, in the case where the explanatory variable is in the open interval ]0; 1[ (i.e. when the probability for observing the end-points of the interval is zero), and continue by presenting and comparing several approaches met in the literature of estimating the expectation function for limited dependent variables. Papke and Wooldridge (1996) extend the discussion to include the case where the dependent variable may be found in the closed interval [0; 1], and present a non-linear quasi-likelihood model with robust parameter variance estimates that are independent of the distribution of the dependent variable. Their approach thus avoids the mean and variance specification error, but assumes that the dependent variable is equally distributed over the entire closed interval. Finally Cook et al. (2000) develop the so-called zero-inflated beta model, which is non-linear and allows the probability of being at the interval ends to be different from the probability of being inside the interval, i.e. a flexible probability distribution. In the present context the Papke–Wooldridge approach and a revised version of the zero-inflated

428


beta model, the ‘unit-inflated beta model’ will be used to model DEA scores as a function of exogenous variables. Each model is described in details below. Although mis-specified, as mentioned above, the two-limit tobit model is also included, as this has been most often used in second-stage DEA. For comparison the scores are also modelled using OLS regression, representing a first order Taylor approximation to the non-linear models. The author of this paper takes full responsibility of the revised version of the beta model.

where ui N(0, r) are independent and identically normally distributed (iid.) residuals of the observations and x = (x1, . . . , xn) is the vector of explanatory variables. Thus 8 > : 1; 1 < y

4. Tobit regression

X P ðy ¼ 0Þ ¼ F bk xk j0; r Z P bk xk 1 2 2 ¼ pffiffiffiffiffiffiffiffiffiffi et =ð2r Þ dt 2 2pr 1

Tobit regression is an alternative to ordinary least squares regression (OLS) and is employed when the dependent variable is bounded from below or above or both, with positive probability pileup at the interval ends, either by being censored or by being corner solutions (Wooldridge, 2002). In the former case (censored) observations outside the limiting interval are recorded as the border values. That is if the range is given by the interval [a; b], observed y < a is recorded as y = a, and likewise observed y > b is recorded as y = b. In the latter case (corner solutions) the observations are by nature limited from below or above or both with a positive probability at the ‘corners’ (interval ends). DEA scores are limited to the interval ]0; 1] and accordingly only has a positive probability to attain one of the two corner values. As such the two-limit tobit model, which is generally used to model censored or corner solution data limited both from below and above, is necessarily a mis-specification when applied to DEA scores, as this method requires a positive probability to attain both corner values. I will however be shown below that the parameters of the model can still be identified despite the lack of observations at 0 and that the model still provides sensible estimates of the DEA scores. As two-limit tobit has been used regularly in modelling of DEA scores it is thus still included in the comparison of possible models of DEA scores. When the dependent variable y is limited to the interval [0; 1] it may be described by the model: n X y i ¼ b k x k þ ui ; k¼1

1 þ signðy i Þ minð1; y i Þ; 2 1; y i P 0; signðy i Þ ¼ 1; y i < 0;

yi ¼

ð3Þ

The probability that a recorded y is equal to 0 is given by

ð5Þ

given the basic iid. assumption for the residuals u. F(xjl, r) is the N(l, r) density function. Likewise: X P ðy ¼ 1Þ ¼ F 1 bk xk j0; r Z ð1P bk xk Þ 1 2 2 ¼ pffiffiffiffiffiffiffiffiffiffi et =ð2r Þ dt: 2pr2 1

ð6Þ

Finally when the recorded y is between 0 and 1 the probability to observe y is given by X P ðy i j0 < y i < 1Þ ¼ f y i bk xk j0; r P 2 2 1 ¼ pffiffiffiffiffiffiffiffiffiffi eðy i bk xk Þ =ð2r Þ ; ð7Þ 2 2pr where f(xjl, r) is the N(l, r) frequency function. Thus the combined likelihood function for the recorded censored dataset is given by Y Y Y L¼ P ðy i ¼ 0Þ P ðy i ¼ 1Þ P ðy i j0 < y i < 1Þ: y i ¼0

y i ¼1

0