For example, a standard goodness-of-fit test employing a chi- square statistic might reject a ... section; these include the phi statistic, the psi statistic, the absolute.
Rniew, Vol. 10, No.2, pp. 127-147, 1986 @ InternationalRegionalScience
Matrix Comparison, Goodness-of-Fit, and Spatial Interaction Modeling Daniel C. Knudsen Departrnent of Geographl Indiana Uni.uersitl Bloomington,Indiana 47405 USA
A. Stewart Fotheringham Department of Geography Uniaersity of Florida Gainesuille, Florida 32511 USA ABSTRACT The usefulnessof various statistics for comparing observed and predicted spatial interaction matrices is examined. Resultsindicate that some statistics may yield misleading information about error levels in predicted matrices. Other statistics are found to be unsuitable for significance testing. The concept of experimental distributions is discussedfor severalof the statistics.Although framed in the context of spatial interaction modeling, the discussion is relevant to most matrix comparison problems.
I. INTRODUCTION An important component of model building is the assessmentof a model's ability to replicate a known data set. This procedure aids in validating the theory on which the model is predicated. Stated generally, model evaluation consists of measuring the accuracy with wnicfr a set of predicted scores, * = [i,, tr,. . . , i"], repiicatei a set of known scores,X = [*r, X2, . . . , X,]. Many "goodness-of-fit" statistics have been used for this purpose: all of them involve _aquantitative descriptiott some aspect of the difference between X and X. This 9f paper examines goodness-of-fit statistics that can be used to evaluate the performance of aggregate spatial interaction models, although the discussion is relevant to other matrix comparison problems (see, for example, the discussion of goodness-of-fit in input-output modeling by Butterfield and Mules [1980] and in discrete choice modeling by Stopher [1975]). Goodness-of-fit statistics serve two purposes. The first concerns either the examination of the accrfiacy with which two or more models replicate a known data set or the examination of the accuracy with which one model replicates two or more known data sets. Both The second author would like to acknowledge the support of a grant (number SES8208339) from the National ScienceFoundarion.
128
INTERNATIONAL
REGIONAL SCIENCE REVIEW
VOL. IO, NO. 2
of these uses are prevalent in the spatial interaction modeling literature. In either case, the relationship between error and the value of the goodness-of-fit measure employed must be known in order to draw conclusions regarding model performance. Often, this relationship is unknown but is assumed to be linear. Thus, if a statistic for which increasing values indicate increasing inaccuracy is employed and if the value of the statistic for model 1 is twice that of model 2, it is often assumed or inferred that the replication of the data by model 2 is twice as accurate as that by model 1. The second purpose served by goodness-of-fit statistics concerns hypothesis testing and the determination of whether the difference between actual and predicted flow matrices is statistically significant. This demands knowledge of the sampling distribution of the statistic(s) used. Even when sampling distributions of goodness-of-fit statistics are known, however, goodness-of-fit tests may produce undesirable results. For example, a standard goodness-of-fit test employing a chisquare statistic might reject a null hypothesis that X = X (and, hence, we might conclude that the model is unsuitable) due to some trivial departure of the model from the data when the sample size is sufficiently large. Alternatively, if the sample size is small, this same test may be unable to reject the null hypothesis even when the departure of the model estimates from the observed data is quite large. This characteristic of standard significance testing produces a number of practical problems in model evaluation (Openshaw 1979). Given these initial observations on the use of goodness-of-fit statistics, it might be expected that such a topic has been subjected to intensive investigation and that a standard set of goodness-of-fit measures has been established. Unfortunately, relatively few surveys or systematic examinations of goodness-of-fit statistics exist. Black and Salter (1975) compare the performance of the correlation coefficient and the chi-square statistic, but they are primarily concerned with the behavior of alternative distance functions. Wilson (1976) criticizes the use of chi-square, the correlation coefficient, and the root mean square errol preferring likelihood measures instead, which require weaker.distributional assumptions and are intuitively^based in current spatial interaction model calibration techniques. Southworth (1977) reiterates Wilson's arguments and proposes the use of an index based on logJikelihoods. Smith and Hutchinson (1981) undertake a relatively comprehensive study of several widely used, goodness-of-fit statistics, but they explicitly avoid the problem of significance testing. They also fail to examine the sensitivity of the statistics to variations in data magnitudes. The lack of investigation into the suitability of the numerous goodness-of-fit statistics available in spatial interaction modeling has led to a less-than-rigorous use of such statistics. There has been widespread use, for example, of ad hoc statistics with unknown sampling distributions and of statistics whose sensitivity to error is
KNUDSEN, FOTHERINGHAM:
MATRIX COMPARISON
129
unknown; there has also been little consistency in applications among different researchers (cf. Hathaway 1975; Thomas L977; Southworth 1983; and Fotheringham and Williams 1983), which has hampered the comparison of results across studies. This paper provides further information on goodness-of-fit statistics. Particular emphasis is placed on identifying statistics that facilitate comparisons across data sets and/or models and on identifyirg the conditions under which standard significance tests may produce undesirable results given certain assumptions about the nature of error. This is accomplished by assessingthe performance of alternative goodness-of-fit statistics and their associated tests with reference to experimentally generated levels of error (cf. Smith and Hutchinson 1981). For simplicity, only a percentage multiplicative error is considered because of its relevance to spatial interaction modeling, but the analysis can easily be extended to other forms of error (a discussion of additive error, for example, is given by Fotheringham and Knudsen l9B5). Because of the limited nature of the analysis, the results must be viewed as suggestive rather than as definitive. We consider, in turn, representative goodness-of-fit statistics, their properties, and the properties of their associatedsignificance tests. II. A SURVEY OF GOODNESS.OF.FIT STATISTICS IN MODELING SPATIAL INTERACTION A review of the spatial interaction literature reveals numerous statistics that have been used to assessmodel goodness-of-fit. These statistics can be classified into three types: information-based statistics, general distance statistics, and traditional statistics. Representative statistics within each of these groups are now identified. INFORMATI ON-BASED STATISTICS
In addition to the classic formulation by Kullback and kibler (1951), four other information-based statistics are discussed in this section; these include the phi statistic, the psi statistic, the absolute value formulation of the psi statistic, and the measure of absolute entropy difference. Information-based statistics have their origin in Kullback and Leibler's information gain statistic,
(1)
I(P:Q) i=t
j=l
where m and n are macr"ix dimensiots, prj and qi.;are elements of a posterior discrete probability distribution, P, and a prior discrete probability distribution, Q, respectively. It is customary to define mn mn ,sa
p,i = t,i/ 2
i:l
.s
,sl
2 t,: and qu = t,:/ Z jL= l t,:, where t, is the observed flow j=t i=l
between i and j and tu is the estimated flow. (Henceforth, all sum-
130
TNTERNATToNALREGToNALscrENCEREVTEw vot-. 10, uo. 2
mations over i will be assumed to range from 1 to m and summations over j from I to n). The information gain statistic has a minimum at zero when P = and a maximum at positive infinity when pu > 0 and g,.i= 0 for Q any r,j pair. The statistic is transitive in that,
(2)
I(P:Q)
which allows the statistic to be used to assessgoodness-of-fit for individual rows and columns within a matrix (Thomas 1977). Smith and Hutchinson (1981) note that information gain and related statistics are useful in emphasizing errors in large flows. The significance of an information gain statistic can be tested through its relationship to the minimum discrimination information statistic. MDI.
MDI=t.N.I(p,e), where N -
(3) distributed
ij
(Bishop,Feinberg,and Holland 1975). Previous studies have shown that three qualifications must be placed on the use of information gain as a goodness-of-fit statistic. one is that the value of the statistic depends on which matrix is defined as P and which is defined as Q, since I(P:Q) + I(Q:P) unless P = Q (Tribus and Rossi 1973; Bishop, Feinberg, and Holland 1975; Knudsen 1982). A second qualification is that when p,.;> 0 and qu = 0 for corresponding elements in P and Q, infinite values of informaiion gain are produced. This problem can be alleviated by replacing zero values in Q with some non-zero value, but this procedure is sombwhat arbitrary. The third qualification is thar, as Smith and Hutchinson (1981) note, information gain is sensitive to the distribution of over and under predictions which give negative and positive values of pu l.r(p'.i/q':), respectively. Thc phi statistic(Smith and Hutchinson 1981) is defined as: QrJ
I l.r(p,:/q,:) I,
(4)
and, like information gain, has limits of zero and positive infinity and is transitive (Smith and Hutchinson 198 t). The qualifications pertinenr to information gain apply here as well except that phi is not sensitive to the distribution of over and under predictions due to the use of absolute values. The phi statistic has no known theoretical distribution but its variance is estimated below by 'bootstrap' methods (Diaconis a n d E fro n 1 9 8 5 ).' lBootstrap methods are computer-intensive experimental methods that can be used to simulate distributions by calculating a value of a statisticfor each of a large number of randomly generated predicted matrices and observing the proportion 6f valuesthat are inferioi to the value of the statisticcompured for ihe aciuaipredicted matrix.
KNUDSEN. FOTHERINGHAM:
MATRTx coMPARrsoN
l3l
Th: Psi stailrtic(Kullback I 959) was introduced into the spatial . interaction modeling literature by Ayeni ( 1982 and 1983) . It is defined as, (5) where s'.;= (p': + q13)/2.The psi statisticis transitive and has a lower limit of zero when P = Q and an upper limit of m.n ln 2 when the non-zero elements of P correspond to the zero elements of Q, and aiceaersa.(The values m and n refer to the dimensions of P.) Unlike information gain and phi, psi is insensitive to the designation of P and Q, and it does not suffer from the necessityof substituting some arbitrary non-zero value when gr; = 0. However:, an equivalent substitution would be necessaryin the more unlikely event that both p,: and g,.; (and hence s,:) were zero. Psi is also sensitive to the distribution of over and under predictions, but this property can be eliminated by taking absolute values:
V = Z Z pu I ln (p/sq) | + > ) qu I ln (q1ls1)l.
(6)
Although significance levels for psi may be established because of its relationship to MDI, the statistic V has no known theoretical distribution. As with phi, its sample variance is derived using bootstrap methods. The final information-based statistic considered here is the a6solute entropy dffirence (AED), defined as the absolute value of the difference in the entropies of the observed and predicted probability values,
(7)
AED=lH"-Hql. H denotes Shannon's FIp =
entropy
measure so that,
for
example,
zero when Hp = He, and the uPPer limit is ln (m x n) when H* = 0 and Hs = ln(m X n), or uice aersa. The significance levels of AED can be taken from a t statistic under the assumption that the entropy values are the means of normal or near-normal distributions (Hutcheson 1970). If this assumption is invalid, experimental procedures may be employed to estimate a distribution for hypothesis testing. GENERAL DISTANCE STATISTICS
General distance statistics are characterized by functions of t1 [,; where t1 is an element of the matrix of observed flows, T, and i1 is an element of the matrix of predicted flows, f. The differencei are either squared or made absolute to avoid sr.rmming positive and negative differences. Standardized root mean square error (SRMSE)
r32
TNTERNATToNALREGToNALscrENCEREVTEw vot-. 10. ruo. 2
was chosen to represent the group of general distance statistics.It is defined by Pitfield (1978) as,
= {> I (tu- i,:),/* x SRMSE ,u,/- x n). (8) "}"/(2 T ,J
This statistic has a lower limit of zero indicating perfectly accurate predictions and an upper limit that is variable and depends on the distribution of the tu's, although in practice it is often 1.O.'?SRMSE is preferred by some researchers to the root mean square error (Black 1973; Openshaw and Connolly 1977) since the latter is not standardized by mean flow and hence is not comparable between spatial systems. The unstandardized statistic is also sensitive to large deviations of trip frequencies from the mean (Wilson L976; Southworth
1977;Pirfiald1978).SRMSEshouldonly be usedwhen )
?
,u =
ij
modeling. The formula for SRMSE in (8) suggeststhat the statistic may be asymptotically distributed as a weighted sum of chi-squares, although such a link has not, to the author's knowledge, been demonstrated theoretically. Consequently, we derive a frequency distribution for SRMSE using bootstrap methods. Other general distance statistics,of which SRMSE is representative, include the index of dissimilarity (Duncan and Duncan 1955), mean percentage error (Hathaway 1975), and the total absolute deviation as a percentage of the total interaction volume (Fotheringham and Williams 1985). TRADITIONAL
STATISTICS
The two most commonly-employed, goodness-of-fit statisticsare R2 (inter alia,Kariel 1968; Lewis 1975 Clark and Ballard l9B0; and Fotheringham 1983) and chi-square (inter alia,Black and Salter 1983; Hathaway 1975; Openshaw 1976; Baxter and Ewing L979). R2 is defined as:
r\
-
I
Sr. -4")'. ) (tu- r.)')"J Z $u 4 Ll4 | rS
(e)
where -" represents the mean of the t1's and [. represents the mean of the t,j's. R2 ranges between zeto and one. Zero indicates no 2While the upper limit of the SRMSE is normally assumedto be 1.0, values greater than 1.0 arise whenever the average error is greater than the mean. For example, consider two 2 x 2 matrices, A and B, that have the following elements: 2r1 = X, vr2 = Azr = a22= 0; and b1 = bzr = bzz = 0 and brz = x. For any value of x, SRMSE = 2.828. However, such a situation is unlikely to occur within a modeling context since the "worst fit" caseoccurs when all parameter estimatesare set to zero and the mean flow is used as the prediction for every actual flow.
KNUDSEN, FOTHERINGHAM:
MATRIX COMPARISON
133
correspondence between T and T; one indicates perfect correspondence. Several authors have noted that R2 is relatively insensitive to variations in model specification (inter alia, Black and Salter 1975; Wilson 1976). Smith and Hutchinson (1981) also provide evidence that R2 may yield artificially high values in goodness-of-fit applications. They report values as high as 0.70 even when T and T differ by 100 percent. R2 is also an imperfect statistic to evaluate model performance across different data sets since, being a function of the variance of the observed data, its value is sample specific. Even variations in measurement error between data sets can result in different R2 values for the same estimated model.
We derive the variance of R2 using bootstrap methods. The bootstrap approach is used becausethe distribution of R2 is compressed to a single point as T '' ! and standardized tables for the associated correlation coefficient provide only for a test under the null hypothesis that the population correlation coef,hcient is equal to
zero; this would be an extremely weak, goodness-of-fittest. The chi-square statistic, 'r.2=))\'u,
^-??
tt
_ t,:),
tii
,
(10)
has a lower limit of zero when T = T and an upper limit which tends to positive infinity u. i,; tends to zero. The stitistic is sensitive to ,.rndetpredictionsitr s*ill flows due to the division by iE, and it is very sensitiveto flow magnitudes. The statisticalso may be of limited use for sparseinteraction matrices, since it is generally acceptedthat an aggregation of sparseelementsshould be undertaken prior to the calculationof chi-square(Wilson 1976; Pitfield 1978). Wilson (1976) also notes the confusion between the use of chi-squareas a goodnessof-fit statistic and its traditional use as a measure of association.The use of chi-square as a goodness-of-fitstatistic has been encouraged because its distribution is known (for exarnple, Hathaway 1975; Snickersand Weibull 1977; Baxter and Ewing 1981). However,while the calculatedvalue of chi-squareis sensitiveto flow magnitudes, the critical value is not. Hence, the definition of the t11'scan determine whether or not the null hypothesisis rejected. III. SENSITIVITY ANALYSIS In this section we examine the sensitivity of the representative, goodness-of-fit statistics identified above to variations in error, which il defined by differences between P and Q or between T and f.u 3 Although the analysis in this section is similar to that given in Smith and Hutchinson (1981), we extend their analysis by investigating different goodness-offit statistics (those outlined in the previous section). In particular, we investigate information gain, V, V and AED; these are not analyzed by Smith and Hutchinson.
134
TNTERNATIoNAL REGIoNAL scIENCE REVIEw
vot-. 10, No. 2
Emphasis is placed on identifying measures of goodness-of-fit that are suitable for comparing the performance of different models in a single spatial system or fior comparing the same model across different spatial systems. An "ideal" goodness-of-fit statistic in this respect *ould be one for which the relationship between the value of the statistic and the level of error is linear. This property implies that the statistic is equally responsive at all error levels and that metric properties can be applied to the statistic. For instance, if the performance of a spatial interaction model in two spatial systems, A and B, yields a value of the statistic that is twice as large in A as in B, it can be concluded that the model performs twice as accurately in B as in A. Conclusions of this kind cannot be made unless the relationship between values of the statistic and error levels is known to be linear or approximately linear. The ideal goodness-of-fit statistic should also be insensitive to variations in data magnitudes, but among the statistics investigated only chi-square does not meet this requirement (see below). The sensitivity of each of the representative goodness-of-fit statistics identified in Section II to variations in error levels was obtained through simulation. Using a 30 x 30 matrix of Poisson distributed flows, normalized to sum to one, values of each statistic were obtained for predicted flow matrices with error levels of I percent,5 percent, and 10 through 100 percent in increments of ten. Error levels were introduced into the predicted matrices by,
9 ' . i= P u + 6 ( p ' : ' R N D ' F A C T ) ,
(11)
where q1 is an element of the predicted flow matrix, pi3is an element of the actual flow matrix, 6 randomly takes the value of plus or minus one, RND is a random number between zero and one, and FACT is the percentage error divided by 100 (Smith and Hutchinson 1981). The resulting elements of Q were then normalized to sum to one. By this construction, it is highly unlikely that any of the elements of Q will be zero, because qij can only be zero when d is minus one, RND is one, and IACT is 100. For those statistics in which it is necessaryto compare flows rather than probabiiities (e.g., chi-square), both the observed and predicted flow matrices were multiplied by a scalar so that the sum of the elements of each matrix corresponded to a particular level of total interaction volume. A value for each goodness-of-fit statistic was recorded at each error level. Since response surfaces derived in this manner may be, to some extent, artifacts of the random number seeds used in generating the predicted matrices, the entire procedure was repeated five times and the results averaged. In fact, the generated response surfaces appeared to be relatively insensitive to the choice of seed. The sensitivity of each of the representative goodness-of-fit statistics to variations in error levels is given in Figure I where the horizontal axis represents error levels and the vertical axis represents
KNUDSEN, FOTHERINGHAM:
MATRIX COMPARISON
1.O
o.9
0.8
o o.7
o d +t
a
0.6
o o
0.5
f
(u o.4
E
o N
0.3
E
(u E
c 6
o.2
U) 0.1 tittt' ll
r11
{
- t-
- ---
rlr
o.o
----o
20
40
60
80
lntroduced Error (%) LEGEND rrrrrrrra lnfOfmatiOn t.t *.t * t
t{.{.F aaaaaa H
gain
Phi Psi Absolute Entropy difference Absolute Psi SRMSE 1 - R -Square Chi-square
Figure 1: Error sensitivities of Eight Goodness-of-fitstatistics
r35
136
TNTERNATToNAL REGToNAL scrENCE REVTEw vot-. 10. No. 2
the relative value of each goodness-of-fit statistic. Relative value is determined for each statistic on a scale ranging from zero (representing the simulated minimum value of the statistic) to one (representing the simulated maximum). Note that the values of R2 have been given as I - Rt, since the maximum value of R2 corresponds to perfect fit. Also, the values of chi-square presented have been normalized to account for flow magnitude effects, which are negligible for the other statistics. Since the relationship between the value of an ideal goodnessof-fit statistic and error would be described by a ray connecting the origin to the point (100, 1.0), the ranking of the statistics in order of their usefulness for comparative purposes appears to be: SRMSE, V, O, AED, Rt, V, informltion gain, ind chilsquare. This ranking suggests that generalized distance statistics such as SRMSE are superior comparative measures, which is not unexpected since the error levels are defined in terms of an average percentage error. The divergence of some of the statistics from the ideal is unexpected. Chi-square, information gain, V, Rt, and AED have marked, nonlinear relationships with error levels, and the assumption of a linear relationship for these statistics could lead to particularly misleading conclusions about model performance. In practical terms, this means that one gains little insight into comparative model performance by simply considering, for example, two chi-square values. As already mentioned, chi-square is also a particularly poor, goodness-of-fit statistic for comparing model performance in different systems, since it is sensitive to the magnitud-e of the data, as is shown'in Figurc 2. The other traditional goodness-olfit statistic, Rt, is logistically related to error levels. As error levels increase, the value of R2 decreases quickly at low error ievels, but above approximately a 40 percent error the decline in the value of the statistic slows. The use of R2 to evaluate different models is thus not a particularly accurate procedure. Further, the lower bound of the statistic appears to depend on the distribution of the observed flows (cf. Wilson 1976). For example, when the observed distribution of flows was generated from a unif,orm, rather than a Poisson, distribution, the statistic's lower bound was approximately .46 at the 100 percent error level (cf. Smith and Hutchinson 1981). With a Poisson distribution, the value of the statistic approaches zero at the 100 percent error level. Mclafferty (1982) notes a similar problem with the correlation coefficient in a nonspatial interaction modeling context. IV. SIGNIFICANCE
TESTING
The predicteil flow matrix, Q or T, has a sampling distribution since it is the result of a calibrated model containing measurement errors and uncertainty regarding the true parameter values. It is thus useful to determine whether the observed differences in the actual
KNUDSEN, FOTHERINGHAM:
MATRIX
COMPARISON
!
I37
r-l
o(J lr>
r{
|.l
c.J
LN
r^
:l
a
IJ
E(n
I
(d\J
6\ ti
bo
F{
Ln rn rl
138
TNTERNATIoNAL REGIoNAL scIENCE REVIEw
vot-. 10, tto. 2
and predicted matrices are statistically significant. Hypotheses regarding the similarity of the two matrices can then be tested. When conducting significance tests it is standard procedure to set up a null hypothesis of no significant difference and to reject it if the calculated statistic exceedsa critical value associatedwith some confidence level, often 95 percent (a = 0.05). Given a research hypothesisthat significant differencesactually exist, such a procedure usefully minimizes the chance of making a Type I error - rejecting the null hypothesiswhen it is true. However:,this procedure is poorly suited for assessingthe accuracy of a model since, in minimizing Typ. I error, the procedure is extremely susceptibleto Typ. II error; whereby it might be concluded falsely that an inaccurate model is accurate.Goodness-of-fitsignificancetesting differs from conventional parameter testing in that we must strike a more even balancebetween Typ. I (") and Type II (B) error (Blalock 1972; Thomas 1979). In practice, it is important to strike the right balance between Typ. I and Typ. II error. For example, committing a Type I error in a planning context implies needlessreformulation of public policy, often accompaniedby additional and costly surveys,and committing a Type II error results in public policies that do not achieve desired goals thereby wasting public funds. In identifying useful test statisticsfor model evaluation, it is also important to considerpossiblecomplicationsintroduced by alternative definitions of tu. For example, the sameset of commodity flows may be measured in kilograms, tons, or rail carloads; the same set of shopping expenditures may be measured in dollars or hundreds of dollars; the sameset of migration flows may be measuredin individual units or family units. The subjective definition of tu makes the definition of sample size subjective. Consequently,the use of significancetests in which the calculated value is insensitiveto sample size, but the critical value is sensitiveto samplesize, or uiceuersa,leadsto situations in which the significance of a set of model estimatesmay be altered by a simple redefinition of units. With these potential problems in mind, the results of significance tests on the eight goodness-of-fitstatisticsidentified in Section II are presented. Results using standard significance tests are given first, followed by those derived using experimental methods. THEORETICAL
SIGNIFICANCE TESTS
Four of the eight statisticsidentified in Section II, information gain, psi, AED, and chi-square, have known theoretical frequency distributions from which critical valuescan be obtained. Information gain, psi, and chi-square are all either chi-square or asymptotically chi-square distributed and behave similarly. Since we are only concerned with the general characteristicsof each test, we arbitrarily choosepsi to represent the behavior of this group. The test on AED utilizes a Student's t test of the differencesbetween two means.
KNUDSEN, FOTHERINGHAM: MATRIX COMPARISON
I39
In order to compare the sensitivity of the various significance tests,it is again necessaryto define a priori what a "correct" conclusion is when comparing actlral and predicted matrices. The definition must be subjective, but mindful of the uses of spatial interaction modeling. With this in mind, we adopt the following rules: (1) u spatial interaction model should not be retained when error in the estimated matrix exceeds 50 percent; and (2) spatial interaction model should not be rejected when error is less"than 10 percent. In all other instances, the decision from the significance testing procedure is accepted. The significance of the psi statistic can be assessedthrough its relationship t9 the Minimum Discrimination Information (MDI) statistic as given in (3). MDI is asymptotically chi-square distributed with degrees of freedom equal to the number of i, j pairs minus the number of independent constraints(Southworth 7977; Phillips 1981; Ayeni 1982). It is assumed throughout this section that the predicted flow matrix has been obtained from a doubly constrained interaction model so that, for the chi-square distribution, the degrees of freedom a r e (m - 1 ) (" - l ). Values of the MDI statistic at particular error levels and total interaction volumes are given in Table 1. These values were derived from simulations similar to those described in Section III. In this instance, however, total interaction volume varied between 1000 and 5 million; error varied between I percent and 100 percent as before. In the table, the solid lines present critical values of the MDI number at a = 0.10 and a = 0.50; the dashed lines represent values at a 10 percent error (rule [2]) and at a 50 percent error (rule [1]). Figure 3 depicts the sensitivity of the MDI statistic (and hence of all statistics having a chi-square distribution) to Typ. I and Typ. II error. Sensitivity to potential Typ. II error is clearly demonstrated in region D. When the total interaction volume is low, less than 1,000 for example, the statistic fails to reject the null hypothesis that there is no difference between the actu-al and predicted flow matrices even at error levels of 100 percent. Ayeni (1982), assessingmodel goodnessof-fit with journey-to-work patterns of 918 households in Lagos, Nigeria, obtains a value of the MDI statistic equal to 636.84. In retaining the null hypothesis (at a = 0.01) that there is no significant difference between the actual and predicted matrices, Ayeni is clearly ignoring potential Typ. II error problems. For an MDI value larger than 600 based on less than 1,000 interactions, the error of the estimated flow matrix would have to be well in excessof 100 percent in order to conclude that the matrices are significantly different. In Region B can be found combinations of error levels and flow magnitudes for which the use of the MDI statistic potentially leads to Type I error, whereby a null hypothesis that two matrices are not significantly different is rejected at error levels of lessthan 10 percent.
r40
vol-. 10, uo. 2
TNTERNATIoNAL REGIoNAL scIENCE REVIEw itNQ
+
@cQ
o\
$
*
@ Io co of-
d oo fo- ) + ;
6\i ab ;
tri to \Jr or
o tl ro to.$ cj: o'+ ol oo oo
(9\
00t-
(o tr)'s
Go 6{ t
o
Io Ir) @ oqq 6'{q q e - Gie n q 9\ oqe a n t @ \o ql o (o co a * (N oo + Io @'#
$
#
foo + $
6\t + o oo (o o) c\l ro @ t Y I- * 6\ 6\1 0O (c) )!) ro A oa (o O) 6\rro o or-s @ $' O $ O $ o $ O $ 0or-\ *6{6IoOcO$otio's
6'{6i e n 4 I q\ a (o *
6I oO $
O) V rr) (.o l@ * $ t- O )]1.))f') Y 6o 0o o) o$o)$
*HG{6lGOcA$ti
Io.s
N
f-
*
rCl cr: oO.F- *
rc)
* (o O _
+ # ? g )
a U tt)
CN
' = 6
-'. MatT
CN
bog
T
?/. a
fu
r.\. 9l
dtu
VOL. IO. NO. 2
KNUDSEN. FOTHERINGHAM:
MATRIX COMPARISON
745
v. coNclusroNs Eight representative goodness-of-fit statistics have been evaluated
with respect to error sensitivity and hypothesistesting. For analyzing the performance of two or more models in replicating the samedata set, or for comparing a single model in different systems,the most accurate statisticsappear to be SRMSE, V, and phi. The chi-square statistic is particularly poor for these purposes. If the object of the analysis is to assessthe accuracy of a model in replicating a single data set, significance tests need to be undertaken. It appears that testing significance with chi-square distributed statisticsis prone to Typ. I and Typ. II errors and should be avoided. Use of the t statistic to test the significance of entropy differences is a more accurate procedure. In terms of the experimental distributions, those of SRMSE, phi, and V are all similarly accurate in reflecting the differences between two matrices, but the results are data specific and no general conclusions can be reached. The hypothesis testing results obtained for R2 are unsatisfactory. In general, the use of experimental significance tests for any statistic of unknown distribution is recommended. The procedure is relatively free of potential Typ. I and Typ. II error problems; it is insensitiveto samplesize;and it is far superior to simply using statistics as indices. The procedure can also provide a useful check on statistics whose theoretical distributions appear overly sensitiveto Type I and Typ. II error problems, or when asymptoticdistributions are utilized. Although the major focus of this study has been on the statistics used to assessmodel goodness-of-fit,some initial conclusionscan be made about the actual performance of spatial interaction models. It appears that spatial interaction models generally perform less accurately than has previously been thought. Thomas (1977), for example, reports values of information gain between 0.30 and 0.35 that are indicative of eror levels over 100 percent. Fotheringham and Williams (1985) utilize absolute entropy difference to compare the performance of several interaction models on four data sets. Their reported values for AED are between 0.0455 and 0.2305, and are indicative of errors in the 50 to 100 percent range. Knudsen (1982) reports valuesof phi between 0.369 and 1.270 which correspondto error levels from 70 percent to well over 100 percent. These results suggestthat the searchfor improved spatialinteraction models should continue. References Ayeni, B. 1982. The testing of hypotheses on interaction data matrices. Geographical Analysis 14: 79-84. Ayeni, B. 1983. Algorithm 11: Information statisticsfor comparing predicted and observed trip matrices. Enuironmentand Planning A 15: 1259-66. Baxter, M. J. and Ewing, G. O. 1979. Calibration of production - constrained trip
146
TNTERNATIoNAL REGToNALscrENCEREVTEw vot-. 10, No. 2
distribution models and the effects of intervening opportunities.Journal of Regional Science 19: 319-30. Baxter, M.J. and Ewing, G. O. 1981. Models of recreational trip distribution. Regional Studies15 327-44. Bishop, Y. M. M., Feinberg, S. E., and Holland, P. W. 1975. Discretemultiaariate analysis:theorl and practice.Cambridge: MIT Press. Black,J. A. and Salter, R. T. 1975. A statisticalevaluation of the accuracyof a family of gravity models. Proceedings, Institution of Ciuil Engineers,Part 2 59: 1-20. Black, W. R. 1973. An analysis of gravity model distance exponents. Transportation 2: 299-312. Blalock, J.., H. M. 1972. Social Statistics. New York: McGraw-Hill. Butterfield, M. and Mules, T. 1980. A testing routine for evaluating cell by cell accuracy in short-cut regional input-output tables.Journal of RegionalScience20: 2 9 3 - 3I 0 . Clark, G. and Ballard, K. 1980. Modeling out-migration from depressedregions: the significance of origin and destination characteristics. Enuironmentand Planning A 12 799-812. Costanzo,M. 1983. Statisticalinference in geography:modern approachesspell better times ahead. The ProfessionalGeographer35: 158-65. Diaconis, P. and Efron, B. 1983. Computer-intensivemethods in statistics.Scienffic A m e r i c a n5 : 1 1 6 - 3 0 . Duncan, O. D. and Duncan, B. 1955. A methodologicalanalysisof segregationindices. AmericanSocioLogicaL Rnieu 20: 210-17. Fotheringham, A. S. 1983. A new set of spatial interaction models: the theory of competing destinations.Enuironmentanh PLanningA 15:15-36. Fotheringham, A. S. and Knudsen, D. C. 1985. Goodness-of-fitstatisticsin geographic research. University of Florida: Department of Geography, manuscript. Fotheringham, A. S. and Williams, P. A. 1983. Further discussionon the poisson interaction model. GeographicalAnalysis15: 343-47. Fotheringham, A. S. and Williams, P. A. 1985. Destination choice and spatial comPetition in an urban hierarchy. University of Florida: Department of Geography, manuscript. Garrison, C. B. and Paulson, A. S. 1973. An entropy measure of the geographic concentration of economic activity. EconomicGeograpltl 49: 319-24. Hathaway, P.J. 1975. Trip distribution and disaggregation.Enaironmentaland Planning A 7: 71-97. Hubert, L.J. and Golledge, R. G. 1981. A heuristic method for the comparison of related structures. Journal of MathernaticalPslchology23: 214-26. Hubert, L. J., Golledge, R. G., and Costanzo, C. F. 1981. Generalized procedures for evaluating spatial autocorrelation. GeographicalAnallsis 13: 224-33. Hutcheson, K. 1970. A test for comparing diversitiesbasedon the Shannon formula. Journal of TheoreticalBiology 29 l5l-54. Kariel, H. G. 1968. Student enrollment and spatial interaction. Annals of Regional Science2: 114-27. Knudsen, D. C. 1982. An information theoretic approach for inferring individual cells of an interaction matrix from a knowledge of marginal totals. Indiana University: Department of Geography, manuscript. Kullback, S. 1959. Informationtheorl and statistics. New York:John Wiley and Sons. Kullback, S. and LeibleS R. A. fgft. Ot, information urri sufficiency. Annals of MathematicalStatistics22: 7 8-86. Lewis, D. E. i975. An empirical test of alternative theories of trade. Annals of Regionat Science 9: 102-11. Mclafferty, S. 1982. Urban structure and geographical accessto public services. Annals of the Associationof American Geographers72: 347-54. Openshaw,S. 1976. An empirical study of somespatialinteraction models.Enuironment and Planning A 8: 23-41. Openshaq S. 1979. A methodology for using models for planning purposes.Enuironmentand Planning A I I: 879-96.
KNUDSEN. FOTHERINGHAM:
MATRIX
COMPARISON
147
Openshaw, S. and Connolly, C. J. 1977. Empirically derived deterence functions for maximum performance spatial interaction models. Enaironrnentand Planning A 9: 1067-80. rnodelbuilding. Phillips, F. Y. t 98 |, A guide to MDI statistics for planning and management for Constructive Capitalism. Austin, TX: Institute Pitfield, D. E. 1978. Sub-optimality in freight distribution. TransportationResearch12: 403-9. Shannon, C. E. 1948. The mathematical theory of communication. Bell SystemTechnical Journal 27 : 37 9-423; 623-56. Sheppard, E. S. 1976. Entropy, theory construction and spatial analysis.Enaironment and PlanningA 7:279-91. Smith, D. P. and Hutchinson, B. G. 1981. Goodness-of-fitstatisticsfor trip distribution models. TransportationResearchI 5A: 295-303. Snickers, F. and Weibull, J. W. 1977. A minimum information principle: theory and practice. Regional Scienceand Urban Economics7: 137-68. testingfor trip distribution Southworth,F. 1977. Problemsand approachesto goodness-of-f.t models.Leeds: Alastair Dick and Associates(RHTM Project). Southworth, F. 1983. Temporal versus other effects on spatial interaction model parameter values. RegionalStudies L7: 4l-48. Stopher, P. R. 1975. Goodness-of-fit measuresfor probabilistic travel demand models. Transbortation4: 67-83. Thomas, if. W. 1977. An interpretation of the journey-to-work on Merseyside using entropy-maximizing methods. Enuironmentand Planni,ngA 9: 817-34. Thomas, R. W 1979. An introductionto quadrat ana\sis. Norwich: Geo Abstracts Ltd. Tribus, M. and Rossi, R. 1973. On the Kullback information measure as a basis for information theory: comments on a proposal by Hobson and Cheng. Journal of StatisticalPhysics9: 331-38. Walsh,J. A. and Webber, NI.J. 1977. Information theory: some conceptsand measures. Enaironmentand Planning A 9: 395-417. Wilson, S. R. 1976. Statistical notes on the evaluation of calibrated gravity models. TransportationResearchl0: 343-45.