Optimal Configuration and Information Content of Sets of Frequency

0 downloads 0 Views 737KB Size Report
to mold an empirical frequency distribution into a simpler configuration that can be rep- resented by moments. Krumbein's (1934,. 1936) introduction of the phi ...
OPTIMAL CONFIGURATION AND INFORMATION CONTENT OF SETS OF FREQUENCY DISTRIBUTIONS 1 W I L L I A M E. F U L L ? R O B E R T E H R L I C H , AND S T E P H E N K. K E N N E D Y

Department of Geology University of South Carolina Columbia, South Carolina 29208 ABSTRACT: Grain-size analysis increasingly consists of multivariate comparisons between samples involving use of class-frequencies. This approach is dictated by realization that size-frequency distributions are spectra more akin to X-ray diffractograms than simple random phenomena. With the assumption that samples consist of mixtures of subdistributions comes the problem of the most efficient way to compare and contrast size-frequency data in order to enhance differences between samples without forcing contrasts that do not exist. Two problems exist with respect to doing this: 1) determination of the optimal number of class-intervals and 2) determination of class-interval widths. The first problem is unsolved, but this paper explains a way to determine class-interval widths (once the number of intervals is chosen) to maximize information content. Applying the basic concepts of information theory, a procedure is presented which evaluates the relative information content of a set of frequency data when subdivided in various ways. M a x i m u m information is always preserved when " m a x i m u m entropy" spectra (unequal class intervals) are used. Evaluation of several schemes of histogram subdivision (phi-based arithmetic, log arithmetic, Z-score, m a x i m u m entropy) indicate, not surprisingly, that in some instances equal-intervaL phi-based histograms contain the least information.

INTRODUCTION

One of the more frustrating aspects ofsedimentological research concerns proper analysis of data in the form of frequency distributions. The frustration arises from the fact that the frequency distributions of the two basic sediment properties, size and shape, do not consistently conform to one or another "well-behaved" distributions (normal, lognormal, hypergeometric). If they did, they could be efficiently summarized by estimation of a few parameters. The normal distribution, for example, only has two paramet e r s - t h e mean and the standard deviation. That is, knowing the values for the mean and standard deviation, the parent normal distribution can be regenerated exactly. Data analysis is then simplified in that the two parameters per sample can be used in place of hundreds of individual observations (or many class-interval frequencies). Data transformation is a time-honored way Manuscript received 4 March 1983; revised 3 October 1983. 2 Presently at Department of Geology, Wichita State University, Wichita, Kansas 67208.

to mold an empirical frequency distribution into a simpler configuration that can be represented by moments. Krumbein's (1934, 1936) introduction of the phi transformation was an attempt to transform size distributions to a normal configuration, thus permitting use of the mean and variance in place of the raw data itself. Krumbein was aware that the attempt was not entirely successful-that is, cumulative frequency curves often did not plot as straight lines on normal probability paper. However, it was felt that the normal curve could serve as a convenient benchmark or datum from which to measure sedimentary size properties. Because the phi transformation did not bring true lognormality in its wake, various stratagems were advanced to capture the essence of the sample distributions. All entailed the addition of extra variables--in some cases higher moments, in other cases points defining the ends of straight-line segments into which the cumulative frequency curve could be subdivided (Doeglas 1968; Visher 1969; Middleton 1976). As we lose faith in the simplicity of an underlying frequency distribution, we require progressively more data from

JOURNAL OF SEDIMENTARYPETROLOGY,VOL. 54, NO. 1, MARCH, 1984, P. 0117--0126 Copyright © 1984, The Society of Economic Paleontologists and Mineralogists 0022-4472/84/0054-0117/$03.00

118

W I L L I A M E. FULL, R O B E R T E H R L I C H , A N D S T E P H E N K. K E N N E D Y

each sample. The logical end to this is to use all the data (in the form of class frequencies or individual grain diameters or volumes) in a sample rather than parameters for evaluation of the data. There has been a distinct trend in this direction for the past 20 years in sedimentology (Doeglas 1968; Dowling 1977; Taira and Scholle 1979). The end product of this historical process is that we are threatened to be buried not only with large amounts of data (several variables per sample times the total number of samples) but also with a wide variety of sets of variables (Taira and Sholle (1979) used 65 variables) to describe each sample. This has partly arisen because of a growing consensus (Visher 1969; Middleton 1976) that size-frequency distributions are not simple probability distributions in a strict statistical sense but are composite--containing two or more "subpopulations." Controversy seems to exist over the nature of these subpopulations (Visher 1969; Middleton 1976; Kennedy et al. 1981) but not their existence. If size-frequency distributions are composed of subpopulations, then such distributions are more accurately termed spectra (like X-ray diffractograms) where the location and intensity of a peak (subpopulation) is of interest (Dowling 1977). If size-frequency distributions are in fact spectra, then data transformations designed to "simplify" them may actually destroy important information. The equivalent of a phi transformation on X-ray diffractograms would be enhancement of peaks at low two-thela angles and loss of resolution of peaks at high values. Spectra are not always cast as histograms. Conventional histograms are area plots. The proportion of a sample in a class-interval is represented by an area rather than height measured on a vertical axis. This preserves the shape of the distribution. When spectra are not viewed as histograms, class intervals are commonly termed bins or channels and are accompanied by terms such as bin width or channel width. The variable in the bin is reported as counts--which may later be converted to histogram form or may be used directly. The number of counts generally decreases as bin width decreases. At some point the number of counts decreases to the point that a signal may be indistinguishable from noise.

O P T I M A L C O N F I G U R A T I O N O F SPECTRA F O R INTERSAMPLE C O M P A R I S O N

Generally, analysis of grain size or grain shape entails measurement and evaluation of a set of samples. That is, information is gained, not through assessment of the characteristics of a single sample, but from the patterns of similarities and differences between samples. Accordingly, many research objectives concern the comparison of sizefrequency (or shape-frequency) spectra between samples. Whether or not the real contrasts between samples can be detected is tied to the configuration of the spectra, that is, the number of class-intervals and their widths. Extreme situations range from the case where all the observations in all samples fall into one and the same interval (that is, intervals too wide) to the case where the datz are vanishingly sparse in all intervals (that is, intervals too narrow and numerous). Thus it can be seen that the widths of class-intervals affect the quality of information that we can glean from data cast in this way. Examples of the effect of the width of the intervals can be seen in Folk (1966), Jaquet and Vernat (1976), Swan et al. (1979) and Kennedy et al. ( 1981). A constant interval width might not be effective because some parts of a frequency distribution may have large frequencies (modes) and other portions may be relatively bare. Where class frequencies are low, the effect of random perturbations are strongly felt, whereas in high class-frequency portions of a spectrum, the effect of these random perturbations is insignificant. Such high classfrequency portions of spectra might be profitably subdivided in order to test for the presence of polymodality in these intervals. If, therefore, one compares a set of samples, each in the form of a frequency histogram with equal-width class intervals, certain intervals may have low frequencies for all samples and so not contribute significantly to the comparison, whereas in other intervals information may be lost because high-frequency intervals are too broad. One could remedy this by creating class intervals of unequal width. For example, percentile-based size measures (Inman 1952; Folk and Ward 1957; Mason and Folk 1958) calculated from cumulative frequency curves implicitly assume that variable interval widths are desirable in size analysis.

FREQUENCY DISTRIBUTIONS

The objective of this paper is to demonstrate that such unequal-width class intervals can maximize the amount of information that can be extracted from frequency distributions. These conclusions draw heavily on the fundamental relations of information theory which were originally derived for analysis of electronic signal propagation and detection. Based on these fundamental relations of information theory, a method of defining interval boundaries is presented. This new "maximum entropy method" defines interval boundaries based on the aggregate properties of the entire sample set and involves no assumption as to the nature of the distribution. Examples from size and shape analyses will be used to illustrate the usefulness of such a method to increase the quality of (any) analysis (discrimination, clustering, unmixing) performed on data cast in such a manner. But before this is done, a brief overview of entropy will be given. THE CONCEPT

OF ENTROPY

The concept of entropy has generally been shrouded in mystery due partly to the fact that the term entropy is used in many disciplines with seemingly different connotations. There is a "chemical entropy," a "thermal entropy," and a "probabilistic entropy," to name a few. Shannon (1948)0 however, described a measure called "'information entropy," which has been shown to be the primitive concept which could explain the other entropy measures (Jaynes 1978). Henceforth, the word entropy will refer to the information entropy as described by Shannon. Shannon defined entropy (E) by the following formula: n

E = -- ~ PilogaPi, i=l

where P~ is the probability of an event occurring in interval i, a is any base of logarithm (base e in this paper), and n is the number of intervals. The probability of an event occurring in interval i is calculated by dividing the frequency within that interval by the total frequency count. This unitless form of entropy has been applied in geology to histograms and cumulative curves by Sharp and Fan (1963) in order to define a sorting index for size analysis. Later, Sharp (1973) applied in-

119

4O 3O 2O

10 0 25

20] B.

E-1.28

.

lo] C"

E:1.61

i FiG. 1.--Examplesof frequencyspectra and resulting entropies (E). formation entropy to define a measure of parity between the mean and standard deviation of a distribution and how such information is expressed by these values. Sharp's earlier studies on entropy represent some of the first applications of entropy in sedimentary geology. Entropy, as applied to frequency plots, is a measure of the contrast between intervals. Low entropy values represent frequency plots with large differences between intervals (Fig. 1a) and high entropies characterize frequency plots with relatively slight contrast between intervals (Fig. 1b, c). Frequency plots exhibiting no contrast between intervals (Fig. lc) are in a maximum entropy configuration. For a spectrum expressed over " K " intervals, maximum entropy is equal to the negative logarithm of the reciprocal of K. Inasmuch as single frequency plots (for example, samples) are rarely used in sedimentological analysis, such as unmixing, discrimination, or clustering, other properties of entropy can be used to measure the information contained within a system of frequency plots. A system of frequency plots in this case may, for instance, represent a set of samples from which size-frequency data have been obtained. One such property is that entropy values are additive; the total entropy contained within a system of frequency plots is equal

120

W I L L I A M E. FULL, R O B E R T E H R L I C H , A N D S T E P H E N K. K E N N E D Y

to the sum o f the entropies o f individual frequency plots. The total entropy of a set of frequency plots is a measure of the total a m o u n t o f useful i n f o r m a t i o n c o n t a i n e d within that system. The entropy o f the total system is a direct measure o f the potential o f the frequency plots to provide reasonable solutions when further analysis (such as unmixing, discrimination, or clustering) is performed on that set o f data. Entropy, as a measure o f information contained within a particular set o f data, can be used to measure the efficiency of several schemes for determining class intervals. The importance of such an evaluation is that some c o m m o n l y used methods o f defining the interval boundaries may be better than others in displaying differences or similarities between samples and hence can drastically affect the efficacy of further analysis performed on that data. Included in this comparison is a new approach that maximizes the entropy within a particular set o f frequency plots. The approach that produces such plots is presented in the next section. MAXIMUM

ENTROPY

.

POOLED

;,


~

~

E=O.O SAMPLE 1

E:I.0

A

SAMPLE 4 ~ E=0.0

B

FIG. 2.--Example of two theoretical sample sets alike when pooled but expressing a) similarity between samples and b) differences between samples when the same interval definition is applied to the individual samples (relative frequency is defined by height (Y axis) rather than value of the intervals).

METHOD

M a x i m u m entropy is attained when any single event has equal probability o f occurring in any interval (Shannon 1948). Relative to a particular set of samples, m a x i m u m entropy can be obtained by defining the interval boundaries using the following method: l) Pool all observations in the sample set. 2) Order the observations from smallest to largest. 3) Let N be the total n u m b e r of pooled observations and I, the n u m b e r o f intervals. Then, starting at the smallest (or largest) observation, the (N/I) value defines the end o f the first interval. 4) Repeat step three for the 2 x (N/I), 3 x (N/I) . . . . . (I - 1) x (N/I) values and use these to define the remaining interval boundary values. In size analysis, the N may be weight or volume. The above method is easily programmable. However, because step two can he very time and space consuming for large data sets, modifications to the approach can be made wherein the procedure becomes practical for these large data sets. A computer program designed to handle large data sets is

currently being prepared for publication (Full and Ehrlich, in prep.). Obviously, such an approach assumes that the distributions are measured continuously (for size distribution, either grain-by-grain, or via cumulative frequency curves derived from settling tubes or interpolated from sieving). Such data permit definition o f class intervals o f unequal width. The above approach (involving the pooling of the entire data set) allows the differences between samples to be optimally expressed without forcing the samples to be different. This can be seen by examining the two extreme examples shown in Figure 2. The first data set (Fig. 2a) represents four samples that are exactly alike while the second data set (Fig. 2b) represents four samples that are completely different from each other. The sum of either sample set will yield identical classinterval frequencies based on m a x i m u m entropy (Fig. 2, pooled). Yet in Figure 2a, all samples have identical frequencies, while Figure 2b represents samples containing data in mutually exclusive intervals. In this light, the m a x i m u m entropy approach represents the "maximally least-biased" method o f interval

F R E Q U E N C Y D t S TR IB U T I O N S

l 21

2o1 lO

0

LOG

I

|

ARITH

I

I

ARITH

:;;;

;

PHI

Z-SCORE

MAX

dl

ENT

"'I I I

I

I

'I 'I

!

;

w

w



:

I

:

I

:

I

|

:

I/

;

'

I

I

I I

I;s :

!

:

|

:

I

:

:

i



1

:

I

'I

I

I

|

I

I

:

I

I

!;o

mm

;

;

'I

!

ols

'I

:

:

I

'.

I

I

FIG. 3.--Example of the partition of a set of pooled distributions using the traditional phi, arithmetic, log (base 10) arithmetic, Z-score, and m a x i m u m entropy methods. Vertical bars indicate the interval widths of the ten intervals.

definition (Lee 1974). That is, the data have the greatest freedom to distribute themselves through the class intervals. Additionally, maximum-entropy frequency plots increase the potentially useful information that should be used in further analysis when available (Lee 1974). Conceivably, conventional schemes of determining class intervals (for example, quarter-phi intervals) in sedimentological studies may already approach a m a x i m u m entropy subdivision of the frequency distributions and so are in optimal configuration. However, this can only be determined by comparing the entropy generated by the conventional methods with the entropy generated by spectra whose intervals are formally determined by the m a x i m u m entropy method. The efficiency of any system of generating class intervals can be evaluated by comparing the total en-

tropy for each of the methods. The larger this value, the greater is the efficiency in capturing the information contained within the system. The following section compares the efficiency of various ways of generating class intervals. EFFICIENCIES OF FREQUENCY-DISTRIBUTION QUANTIFICATION APPROACHES

In addition to the previously defined maxi m u m entropy scheme, five additional methods will be compared using grain-size data (Fig. 3). Figure 3 shows interval width and Figure 4 shows the distribution of the entire data set using the five methods. The first three are similar in that some measure of size is divided into equal intervals. The first method considered is the traditional phi method (PHI) where the intervals are each quarter-phi units wide. The second is an arithmetic method

122

W I L L I A M E. FULL, R O B E R T E H R L I C H , A N D S T E P H E N K. KE?vWED Y

ARITH

20" 10 0 30

I

LOG ARITH

I I

i

20

10

, I I

0

40-

|

!

j

|

w

|

I

I

!

PHI

F-

3020 lO

,I Z-SCORE

otil LI Ilh Ii •

|







I

1

I

I

!

MAX ENT

'2111111111

FIG. 4.--Graphs of the percent occurrence in each interval using different methods. The interval width is as shown in Figure 3. True shape distribution is expressed in the uppermost diagram. The other frequency diagrams represent what a computer "sees" when frequencies are used in multivariate analysis. ( A R I T H ) wherein the m i l l i m e t e r - b a s e d intervals are each a fixed distance wide. The third is the log a r i t h m e t i c m e t h o d ( L O G A R I T H ) where the log (base 10) o f the di-

ameters is used to define equally spaced log m i l l i m e t e r intervals. The last method, functionally related to the P H I m e t h o d , has been used in spectral analysis and has been recently a p p l i e d to the analysis o f quartz particle shapes (Boon et al. 1982). T h e fourth m e t h o d , the Z-score or stand a r d score m e t h o d , has been used in m a n y shape studies (for example, see Porter et al. 1979; Brown et al. 1980; Mazzullo and Ehrlich 1980). Its underlying a s s u m p t i o n is that the pooled data have a higher density closer to the mean. The intervals are defined in such a way as to create equal frequency counts in each interval by assuming the pooled d a t a are n o r m a l l y distributed. The net effect o f such a scheme is to create narrower intervals where the bulk o f the data are situated (in the m i d d l e o f the pooled data set) a n d progressively wider intervals t o w a r d the tails where the d a t a are sparser. Phi data cast in m a x i m u m e n t r o p y form do not have to be evaluated because the log t r a n s f o r m a t i o n will not result in a change o f i n f o r m a t i o n relative to the u n t r a n s f o r m e d d a t a if both are in m a x i m u m entropy form (Kullback 1959). T h a t is, for most c o m m o n t r a n s f o r m a t i o n s (log, square root, square) the m a x i m u m entropy m e t h o d is i n d e p e n d e n t o f transformation. Therefore, there is no a priori need to transform the data if the d a t a are subsequently cast in m a x i m u m entropy format. A data set o f eleven grain-size distributions from thickly b e d d e d ( > 30 cm) sands near the lower b o u n d a r y o f the Western Atlantic lower continental rise (and associated Hatteras abyssal plain) was analyzed in o r d e r to c o m p a r e the various methods mentioned. The distributions were o b t a i n e d on a grain-bygrain basis using a video-digitizing microprocessor system ( A R T H U R II) described elsewhere (Fico 1980). The average d i a m e t e r was calculated based on the m a x i m u m projection o f each particle a n d c o n v e r t e d to volu m e by assuming sphericity ( K e n n e d y et al. 1981). A p p r o x i m a t e l y 100 grains c o m p r i s e d each sample. Results o f the entropy c o m p a r isons are shown in Table 1. T h e calculated e n t r o p y for each s a m p l e is listed by method; the total entropy is given for each m e t h o d at the b o t t o m o f the table. T h e equal-interval, p h i - b a s e d frequency plot can be seen, in this instance, to be the

FREQUENCY DISTRIBUTIONS TABLE 1.--Individual sample comparisons of entropy for different methods of interval definition. The totalsystem entropy is given at the bottom

123

QMODEL family of algorithms (Klovan and Imbrie 1971; Klovan and Miesch 1976; Full et al. 1981, 1982).

Method Sample

Phi

Log Arith

Arith

1 2 3 4 5 6 7 8 9 10 11

1.42 1.33 1.36 1.21 1.47 1.44 1.40 1.48 1.38 1.30 1.17

1.48 1.35 1.50 1.12 1.49 1.51 1.45 1.32 1.43 1.33 1.21

1.88 0.68 1.46 1.56 1.84 1.79 1.99 1.82 2.03 1.96 1.87

2.11 0.00 0.2 1.63 1.99 1.90 2.17 1.93 2.07 2.23 2.20

2.12 0.00 0.89 1.70 1.91 1.99 2.26 2.05 2.21 2.26 2.17

14.96

15.19

19.06

19.15

19.56

Total

Z-Score Max Ent

poorest carrier of information. Interestingly, even the arithmetical method produced a higher total entropy than either case of equalwidth interval histograms or log-transformed data (PHI and LOG ARITH), demonstrating that, at least in the present case, the log transformation can reduce intersample information. The intention behind the concept of phi transformation was to suppress the tendencies towards nonnormality. (Perhaps in this instance we have thrown the baby out with the bath water.) This suggests that data transformations must be done with great caution unless a maximum-entropy, class-interval format is used. Less surprising is the fact that the Z-score method produced relatively high entropy values. This is due to the fact that the distributions have similar mean values. Many size distributions are, however, not as mutually similar as the samples used here. Therefore, the Z-score method may not be universally better than the other methods discussed above, especially when the data consist of nonsymmetrical or polymodal distributions. The least surprising result is that the maximum entropy method produced the largest entropy, showing that this method conveys more information than the previously defined methods. Generally, the increased information content of the maximum entropy plots has produced higher quality solutions relative to quartz grain-shape analysis (Ehrlich et al. 1980) when used as input data into the Q-mode EXTENDED CABFAC/

E N T R O P Y AS A F E A T U R E E X T R A C T O R

The previous sections introduced the idea of using the total-system entropy as a means to compare the information captured by various interval definition schemes. The maximum entropy method was shown to capture the largest amount of information. Now that the method of interval definition has been chosen, entropy can serve as a measure of the total information contained within a particular data set. This measure of information becomes important when two or more data sets, consisting of different observations, may contain similar geologic information. For example, a geologist may have access to separate data sets such as size, shape, chemical, and hydrologic analyses from the sample array or a time series. There may be no practical way (or justification) to analyze all these data. A question to be answered is which variables hold the most potential for further analysis. Such a criterion is called a "feature extractor." In order to use entropy as a feature extractor, the concept of relative entropy must be introduced. Relative entropy is the ratio of the calculated sample entropy and the maximum possible entropy. The maximum possible entropy of an individual sample is equal to the natural logarithm of the number of intervals. Therefore, the relative entropy of an entire data set is the sum of the calculated entropies divided by the product of the maximum possible entropy and the number of samples. Given a particular method of interval definition (preferably the maximum entropy method), the data set that contains the lowest relative entropy among several data sets contains the highest potential for a clear, unambiguous solution in any subsequent analysis. In shape analysis, the maximum projection shapes of particles are quantified by means of a Fourier series (Ehrlich el al. 1978; Full and Ehrlich 1982). Each shape is quantified by calculating 20 shape components (called harmonics); the greater the magnitude (amplitude) of the harmonic, the greater its contribution to the total shape. Each sample of

W I L L I A M E. FULL, R O B E R T E H R L I C H , A N D S T E P H E N K. K E N N E D Y

124

tC'q . 9 7 0..96

O Z

FUZZY C-MEANS algorithm (Full et al. 1982; Bezdek et al. 1982), produce four similar cluster (four source) solutions which clearly reflect the four important source terranes in the region.

• •



.95--

.94--



• •

.93-









~...92 .91 v.90

~

, ~ , ~ , ~ ,1~0,1,2,1~-1,6,1,8,2,0 HARMONIC NO.

FiG. 5.--Plot of the relative entropy (squared to enhance separation) versus harmonic number. Harmonic 15 was selected for further analysis.

200 grains is represented by the distributions of 20 sets of harmonic amplitudes. A study may involve 100 or more samples; analysis of each variable (harmonic) separately would be cumbersome. Because each harmonic measures a different scale of shape variability, different amounts and types of information can occur at different harmonics. For instance, if one is concerned with the effects ofdiagenesis or abrasion, the high harmonics measuring small-scale perturbations on the grain outline are more efficient carriers of information than the lower harmonics, which measure the grosser aspects of shape (elongation, triangularity, etc.). In addition, among the higher harmonics, some may be more inherently sensitive to the system of interest than others. The problem then is to find one or a few harmonics which are most likely to carry the most information. The class frequencies of the amplitudes of these harmonics will be used in further analysis. An example is shown in Figure 5. The data set represents stream-sediment samples from drainage basins on the west flank of the Bighorn Mountains in W y o m i n g (Kennedy 1982). The streams traversed Precambrian through Mesozoic crystalline and sedimentary rocks. The relative entropy is plotted for Harmonics 2-20 (the first harmonic is an error measurement and is not used for further analysis). The plot in Figure 5 indicates that Harmonic 15 shows the lowest relative entropy, followed by Harmonics 14, 16, and 17. All of these harmonics, when individually analyzed using the E X T E N D E D C A B F A C /

POTENTIAL PROBLEMS

There may be occasions when the maximum entropy technique may pose some difficulties. One of the potential difficulties is that the data determine interval definition, thus making direct comparisons between data from two independent studies more difficult. Another problem is the choice of the optimal number of intervals. Using the maximum entropy approach, all samples are subdivided into the same number of intervals. Too few intervals will obscure the information present by submerging inherent variability, whereas too many intervals will obscure information by decreasing the "signal-to-noise ratio." The subdivisions are based on the average properties of the data set. However, this fixed number of intervals cannot be conceivably optimal for any of the samples. That is, relatively simple frequency distributions might be adequately served by a few intervals, whereas complex ones might be better served with a larger number of intervals. However, a sample set may consist of samples containing both types of distributions. In such a case, less or more than the optimal number of intervals for any given distribution may be accepted in order to retain the potential definition of all the distributions. Therefore, what is needed, complementary to the maximum entropy approach for optimal interval-width definition, is a method for determining the optimal number of intervals. Unfortunately, to our knowledge, such a procedure has not yet been developed. CONCLUSIONS

The concept of entropy, as defined by Shannon (1948), can be a powerful tool for geologists. With entropy measurements the geologist has an index of the information represented by a particular data set. In addition to measuring information, the principle of entropy can be used to define interval definition for histograms and frequency-distribution plots, via the maximum entropy

FREQUENCY DISTRIBUTIONS method, wherein the maximum amount of i n f o r m a t i o n c o n t a i n e d w i t h i n a specific d a t a system can be assumed to have been captured. The maximum entropy method maximizes the possibility of extracting additional i n f o r m a t i o n in t e r m s o f u n m i x i n g , c l u s t e r i n g , scaling, a n d d i s c r i m i n a t i o n , a n d m i n i m i z e s any bias introduced by the a priori definition of interval values or by assuming a priori knowledge of the distributions present. The original reason for performing the above entropy analysis was to optimize the unmixing algorithms EXTENDED QMODE L (Full et al. 1981) a n d F U Z Z Y Q M O D EL ( F u l l et al. 1982) for s h a p e a n a l y s i s . H o w ever, applications of the principle of entropy in size a n d o t h e r t y p e s o f a n a l y s i s p r o m i s e to i n c r e a s e t h e p o t e n t i a l use o f s u c h d a t a to s o l v e more complex geologic problems. In terms of realized potential, the maximum entropy m e t h o d h a s a l r e a d y r e s u l t e d in h i g h e r - q u a l i t y s o l u t i o n s in F o u r i e r s h a p e a n a l y s i s ( E h r l i c h et al. 1980). I n a d d i t i o n t o o p t i m i z i n g t h e d e f i n i t i o n o f i n t e r v a l s in f r e q u e n c y p l o t s , e n t r o p y c a n b e u s e d as a f e a t u r e e x t r a c t o r f o r a p a r t i c u l a r s y s t e m o f d a t a sets. T h i s h a s b e e n a p p l i e d to s h a p e a n a l y s i s t o p o i n t o u t w h i c h h a r m o n i c s c a r r y t h e largest a m o u n t o f i n f o r mation. The example of the Hatteras Plain samples suggests t h a t i n c o r r e c t l y s u b d i v i d e d f r e q u e n cy d i s t r i b u t i o n s c a n ( a n d p r o b a b l y h a v e ) res u l t e d i n t h e i n a d v e r t e n t s u p p r e s s i o n o f diff e r e n c e s b e t w e e n s a m p l e s a n d so m a y i m p l y a more homogeneous sedimentary system t h a n a c t u a l l y exists. ACKNOWLEDGMENTS T h i s p r o j e c t was s p o n s o r e d b y t h e Office of Naval Research under Contract No. N00014-78-C-0698. Carleen Sexton typed a n d h e l p e d e d i t t h e final draft. REFERENCES BEZDEK,J., EHRLICH,R., AND FULL, W. E., 1984, FCM: The FUZZY C-MEANS clustering algorithm: Computers and Geosci., in press. BOON II1, J. D., EVANS,D. A., AND HENNIGAR, H. F., 1982, Spectral information from Fourier analysis of digitized quartz grain profiles: Jour. Math. Geol., v. 14, p. 589-605. BROWN, P. J., EHRLICE, R., AND COLQUHOUN, D., 1980, Origin of patterns of quartz sand types on the southeastern United States continental shelf and its impli-

125

cation on contemporary shelf sedimentation-- Fourier grain shape analysis: Jour. Sed. Petrology, v. 50, p. 1095-1100. DOEGLAS, D. J., 1968, Grain-size indices, classification and environment: Jour. Sed. Petrology, v. 10, p. 83100. DOWLING, J. J., 1977, A grain-size spectra map: Jour. Sed. Petrology, v. 47, p. 28. EHRLICH, R., GRACE,J., AND GROTHAUS, B., 1978, Size frequency distributions taken within sand laminae: Jour. Sed. Petrology, v. 48, p. 1193-1202. EHRLICH,R., BROWN,P. J., YARUS,J. M., ANDPRZYGOCKI, R. S., 1980, The origin of shape frequency distributions and the relationship between size and shape: Jour. Sed. Petrology, v. 50, p. 475-484. Flco, C., 1980, Automated particle shape analysis-Development of a microprocessor-controlled image analysis system [unpubl. master's thesis]: University of South Carolina, Columbia. FOLK, R. L., 1966, A review of grain-size parameters: Sedimentology, v. 6, p. 73-93. FOLK, R. L., AND WARD, W. C., 1957, Brazos River bar: A study in the significance of grain-size parameters: Jour. Sed. Petrology, v. 47, p. 3-26. FULL, W. E., EHRLICH, R., AND KI_OVAN, J, E., 1981, EXTENDED QMODEL--Objective definition of external end members in the analysis of mixtures: Jour. Math. Geol., v. 13, p. 331-344. FULL, W. E., AND EHRLICH, R., 1982, Some approaches for location of centroids of quartz grain outlines to increase homology between Fourier amplitude spectra: Jour. Math. Geol., v. 14, p. 43-54. FULL, W. E., EHRLICH, R., AND BEZDEK, J. C., 1982, FUZZY QMODEL: A new approach for linear unmixing: Jour. Math. Geol., v. 14, p. 259-270. FULL, W. E., AND EHRLICH, R., Unbiased maximization of differences between spectra, in prep. INMAN, D. L., 1952, Measures for describing the size distribution of sediments: Jour. Sed. Petrology, v. 22, p. 25-145. JAQUET, J.-M., AND VERNAT, J.-P., 1976, Moment and graphic size parameters in sediments of lake Geneva (Switzerland): Jour. Sed. Petrology, v. 46, p. 305-312. JAYNES, E. T., 1978, Electrodynamics today in Mandel, L., and Wolk, E., eds., Proc. of the Fourth Rochester Conf. on Coherence and Quantum Optics: New York, Plenum Press. KENNEDY, S. K., 1982, Provenance and dispersal of sand and silt in a high-gradient stream system on the west flank of the Bighorn Mountains, Wyoming--Fourier shape analysis [unpubl. Ph.D. dissert]: University of South Carolina, Columbia. KENNEDY, S. K., EHRLICH,R., AND KANA, T. W., 1981, The nonnormal distribution of intermittent suspension sediments below breaking waves: Jour. Sed. Petrology, v. 51, p. 1103-1108. KLOVAN, J. E., AND IMBRm, J., 1971, An algorithm and FORTRAN-IV program for large scale Q-mode factor analysis and calculation of factor scores: Jour. Math. Geol., v. 3, p. 61-76. KI.OVAN, J. E., AND MIESCH, A. T., 1976, EXTENDED CABFAC and QMODEL computer programs for Qmode factor analysis of compositional data: Computers and Geosci., v. 1, p. 161-178. KRUMBEIN, W. C., 1934, Size frequency distributions of sediments, Jour. Sed. Petrology, v. 4, p. 65-77.

126

W I L L I A M E. FULL, R O B E R T E H R L I C H , A N D S T E P H E N K. K E N N E D Y

KRUMBEIN, W. C., 1936, The application of logarithmic m o m e n t s to size frequency distributions of sediments: Jour. Sed. Petrology, v. 6, p. 35-47. KULLBACK,S., i 959, Information Theory and Statistics: New York, John Wiley & Sons. LEE, R., 1974, Entropy models in spatial analysis: University of Toronto, Department of Geography, Discussion Paper Series, Toronto, Canada. MASON, G. C., AND FOLK, R. L., 1958, Differentation of beach, dune, and aeolian flat environments by size analysis, Mustang Island, Texas: Jour. Sed. Petrology, v. 28, p. 211-226. MAZZULLO, J. M., AND EHRLICH, R., 1980, A vertical pattern of variation in the St. Peter S a n d s t o n e - F o u rier grain shape analysis: 2our. Sed. Petrology, v. 50, p. 63-70. MIDDLETON, G. V., 1976, Hydraulic interpretation of sand size distributions: Jour. Geol., v. 84, p. 405-426. PORTER, .G'. A., EHRLICH, R., COMBELLICK,R. A., AND OSaORN, R. H., 1979, Sources and nonsources of beach sand along southern Monterey Bay, California--

Fourier shape analysis: Jour. Sed. Petrology, v. 49, p. 727-732. SHANNON, E., 1948, The mathematical theory of communication: Bell System Tech. Jour., v. 379, p. 623. [Reprinted in Shannon, C. E., and Weaver, W., The mathematical theory of communication: Urbana, University of Illinois Press, 1963.] SHARP, W. E., 1973, Entropy as a parity check: Earth Research, v. 1, p. 27-30. SHARP, W. E., AND FAN, P., 1963, A sorting index: Jour. Geol. v. 71, p. 76-84. SWAN, D., CLAGUE,J., AND LETERNAUER,J. L., 1979, Grain-size statistics II: Evaluation of grouped moment measures: Jour. Sed. Petrology, v. 49, p. 487-500. TAIRA, A., AND SCHOLLE~P. A., 1979, Discrimination of depositional environments using settling tube data: Jour. Sed. Petrology, v. 49, p. 787-800. VtSHER, G. S., 1969, Grain size distributions and depositional processes: Jour. Sed. Petrology, v. 39, p. 1074-1106.