WHAT NON-LINEARITY TO CHOOSE? MATHEMATICAL ... - CiteSeerX

WHAT NON-LINEARITY TO CHOOSE? MATHEMATICAL FOUNDATIONS OF FUZZY CONTROL V. Kreinovich1 , C. Quintana1 , R. Lea2 , O. Fuentes3 , A. Lokshin4 , S. Kumar1 , I. Boricheva5 and L. Reznik6 1

Computer Science Department, University of Texas at El Paso, El Paso TX 79968, USA, [email protected], [email protected], [email protected] 2 NASA Johnson Space Center, Houston TX 77058, USA, [email protected] 3 Department of Computer Science, Texas A&M University, College Station TX 77840, USA, [email protected] 4 Applied Science Division, Perkin-Elmer, P.O. Box 2801, Pomona CA 91769, USA 5 Post-Graduate Studies, VNIIEP, Prosvescheniya 85, St. Petersburg 195267, Russia 6 Sikeirosa 15–1–46, St. Petersburg 194354, Russia Abstract. Fuzzy control is a very successful way to transform the expert’s knowledge of the type “if the velocity is big and the distance from the object is small, hit the brakes and decelerate as fast as possible” into an actual control. To apply this transformation one must: 1) choose fuzzy variables corresponding to words like “small”, “big”; 2) choose operations corresponding to “and” and “or”; 3) choose a method that transforms the resulting fuzzy variable for a into a single value a ¯. The wrong choice can drastically affect the quality of the resulting control, so the problem of choosing the right procedure is very important. From mathematical viewpoint these choice problems correspond to non-linear optimization and are therefore extremely difficult. We develop a new mathematical formalism (based on group theory) that allows us to solve the problem of optimal choice and thus: 1) explain why the existing choices are really the best (in some situations); 2) explain a rather mysterious fact that the fuzzy control based on the experts’ knowledge is often better than the control by these same experts; 3) give choice recommendations for the cases when traditional choices do not work. Perspectives of space applications will be also discussed. Keywords. Fuzzy control, optimization, stability, smoothness, space applications. 1. BRIEF INTRODUCTION: WHY MATHEMATICAL FOUNDATIONS? 1.1 Why is fuzzy control necessary: a real world example A simple example: controlling a thermostat. The goal of a thermostat is to keep a temperature T equal to some fixed value T0 , or, in other words, to keep the difference x = T − T0 equal to 0. 1

To achieve this goal, one can switch the heater or the cooler on and off and control the degree of cooling or heating. What we actually control is the rate in which the temperature changes, i.e., in mathematical terms, a derivative T˙ of temperature with respect to time. So if we apply the control u, the behavior of the thermostat will be determined by the equation T˙ = u. In order to automate this control we must come up with a function u(x) that describes what control to apply if the temperature difference x is known. Why can’t we extract u(x) from an expert? We are talking about a situation where traditional control theory does not help, so we must use the experience of an expert to determine the control function u(x). Why can’t we just ask the expert questions like “suppose that x is 5 degrees; what do you do?”, write down the answers, and thus plot u(x)? It sounds reasonable at first glance, until you try applying the same idea to a skill in which practically all adults consider themselves experts: driving a car. If you ask a driver a question like: “you are driving at 55 mph when the car in 30 ft. in front of you slows down to 47 mph., for how many seconds do you hit the brakes?”, nobody will give a precise number. You might install measuring devices into a car or a simulator, and simulate this situation, but what will happen is that the amount of time to brake will be different for different simulations. The problem is not that the expert has some precise number (like 1.453 sec) in mind that he cannot express in words; the problem is that one time it will be 1.3, another time it may be 1.5, etc. An expert usually expresses his knowledge in words. An expert cannot express his knowledge in precise numeric terms (such as “hit the brakes for 1.43 sec”), but what he can say is “hit the brakes for a while”. So the rules that can be extracted from him are “if the velocity is a little bit smaller than maximum, hit the breaks for a while”. Let’s illustrate the rules on the thermostat example. If the temperature T is close to T0 , i.e., if the difference x = T − T0 is negligible, then no control is needed, i.e., u is also negligible. If the room is slightly overheated, i.e., if x is positive and small, we must cool it a little bit (i.e., u = x˙ must be negative and small). If the temperature is a little lower, then we need to heat the room a little bit. In other terms, if x is small negative, then u must be small positive, etc. So we have the following rules: 1) if x is negligible, then u must be negligible; 2) if x is small positive, then u must be small negative; 3) if x is small negative, then u must be small positive; etc. Fuzzy control methodology is really necessary. We need some methodology to transform these rules into a precise value of control. Such a methodology was first outlined by L. Zadeh [Z71, CZ72, Z73] and experimentally tested by E. Mamdani [M74] in the framework of fuzzy set theory [Z65], therefore the whole area of research is now called fuzzy control. For the current state of fuzzy control the reader is referred to the surveys [S85], [L90] and [B91]. 1.2. Why mathematical foundations are needed Fuzzy control is often semi-heuristic and hence not always reliable. Fuzzy control is a very successful way to transform an expert’s knowledge of the type “if the velocity is big and the distance from the object is small, hit the breaks and decelerate as fast as possible” into an actual control. In applying this transformation one must: 1) choose fuzzy variables corresponding to words like “small”, “big”; 2) choose operations corresponding to “and” and “or”; 3) choose a method that transforms the resulting fuzzy variable for a into a single value a ¯. The wrong choice of variables, operations or a transformation method can drastically affect the quality of the resulting control [KKS85]. For example, as we will see later, different choices can lead to a twofold difference in relaxation time (this is an important characteristic of stability). 2

These choices are usually made on a semi-empirical basis; if a resulting system works, that’s fine. This approach is acceptable for camcorders or dishwashers; even if something goes wrong with a picture or a water level for a moment or two, it is not a problem. However, this level of reliability is absolutely unacceptable for more important systems [Be88]. This is the main reason why, in spite of the very promising results of computer simulations ([L88, LJB90], etc), fuzzy control techniques are not yet widely applied to space missions. Another reason why potential users are very cautious about applying fuzzy control is that it is suspiciously too efficient. Namely, the authors of fuzzy control applications claim that the performance of the resulting fuzzy control is often better than the performance of the same systems, when they are controlled by the same experts whose experience was used to design this fuzzy control. This is kind of a mystery, because we did not use any other knowledge in this automated fuzzy control other than that extracted from the expert. Since it is impossible to write down all the rules that the expert uses, it would be no mystery for an automated system to behave a little bit worse than the human expert. But in reality the opposite occurs: the automated system is better! (see [MYI87], [KF88], [D91], [K92]). This increase in quality (and it can be up to 50%) is absolutely unexpected in the present theory of fuzzy control. If the results are 50% better than expected, then the precision of expectations is 50%. This is not very impressive precision, and it adds to the belief that fuzzy control methods are not very reliable. Hence part of the task of making fuzzy control more reliable is to explain why fuzzy control is often better. What we are planning to do. So in order to make these promising methods more reliable we must undertake a mathematical analysis of fuzzy control techniques in order to: 1) theoretically explain why the existing semi-empiric choices are really appropriate in many situations; 2) explain why the fuzzy control that is based on the experts’ knowledge is often better than the control performed manually by these same experts; 3) give choice recommendations for the cases when traditional transformations do not work. We are going to solve these problems in the present paper. For whom this paper is written. This paper does not assume any preliminary knowledge of fuzzy logic or fuzzy control theory. We will give first informal motivations, and then precise definitions of all the notions that we will need. Our goal was to make it understandable and convincing both for those who are already acquainted with fuzzy control, and for those who may be doubtful about the entire approach. One of main purposes of fuzzy control was to simplify the controllers, and we are introducing deep mathematics instead: isn’t there a contradiction? One of the main reasons why fuzzy control was invented was that [Z74] by the beginning of the 1970-s control theory became so “over-mathematized” that complex real world control problems became intractable. Fuzzy control is much simpler and makes these problems tractable. We are now planning to add more mathematics to fuzzy control theory. It seems like this contradicts to the original goal of simplicity. But there is no contradiction: the methodology of fuzzy control remains simple. We are planning to apply complicated mathematics only to improve the quality of the fuzzy control by comparing different alternatives and choosing the best one. And since we are choosing between tractable alternatives, the one that we choose will inevitably be tractable. In order to apply fuzzy control methodology to real-life systems one does not need to know all the complicated mathematics that we used: it is sufficient to use the results of our analysis. 3

Contents of the paper. This paper consists of the following sections: 2. Brief Informal Description of all Four Stages of Fuzzy Control Methodology 3. Relationship Between Different Procedures of Assigning Numerical Values of Uncertainty 4. What Fuzzy Variables to Choose. Reasonable Choice 5. What Fuzzy Variables to Choose. Optimal Extrapolation of the Experts’ Choice 6. What Fuzzy Variables to Choose. How to Describe Modifiers (Like “Very”) 7. The Choice of & and ∨ Operations. The General Case 8. What Fuzzy Variable to Choose. How to Describe “almost equal to a” 9. What Defuzzification to Choose. General Case 10. What Defuzzification to Choose. Case of Prohibitions 11. Why Fuzzy Control is Often Better Than the Experts’ It Simulates: an Explanation 12. The Choice of & and ∨ Operations. Docking and Tracing 13. How to Combine Rules and Optimization When Designing a Fuzzy Control Section 3 contains the basic formalism that will be used in all other sections. Except for that, each of the following sections is more or less self-sufficient and independent on all the other sections: it contains the formulation of a problem in informal terms, motivations of the proposed formalization, definitions, and the corresponding mathematical results. All the proofs are given in the Appendix. The results of this paper were partially published before in [KR86], [KKM88], [FK89], [K89], [K89a,b], [K90], [KK90], [KK90a], [KL90], [KFLL91]. This paper also subsumes [KQL91]. Who wrote what. Inna Boricheva co-authored Section 5, Olac Fuentes Sections 2, 11, 12, Sundeep Kumar Sections 3 and 7, Anatole Lokshin Sections 4, 6, 7, 11, 12, Leonid Reznik Section 8. Vladik Kreinovich, Robert Lea and Chris Quintana participated in the entire text. Section 10 contains the result that we obtained together with John Yen and Nathan Pfluger from Texas A&M University. 2. BRIEF INFORMAL DESCRIPTION OF ALL FOUR STAGES OF FUZZY CONTROL METHODOLOGY 2.1 Brief description of fuzzy methodology Let’s first combine all the rules into one statement relating x and u. If we know x, what control u should we apply? u is a reasonable control if either: • the first rule is applicable (i.e., x is negligible) and u is negligible; or • the second rule is applicable (i.e., x is small positive), and u must be small negative; or • the third rule is applicable (i.e., x is small negative), and u must be small positive; or • one of the other rules is applicable. Summarizing, we can say that u is an appropriate choice for a control if and only if either x is negligible and u is negligible, or x is small positive and u is small negative, etc. If we use the denotations C(u) for “u is an appropriate control”, N (x) for “x is negligible”, SP for “small positive, SN for “small negative” and use the standard mathematical notations & for “and”, ∨ for “or” and ≡ for “if and only if”, we come to the following informal “formula”: C(x) ≡ (N (x)&N (u)) ∨ (SP (x)&SN (u)) ∨ (SN (x)&SP (u)) ∨ ... (1) How do we formalize this combined statement: four stages of fuzzy control methodology. In order to formalize statements like the one we just wrote down, we first need to somehow interpret what notions like “negligible”, “small positive”, “small negative”, etc., mean. The main difference between these notions and mathematically precise (“crisp”) ones like “positive” is that any value is either positive or not, while for some values it is difficult to decide whether they are negligible or not. Some values are so small that practically everyone would agree that they are negligible, but 4

the bigger is the value, the fewer experts that will say that it is negligible, and the less confident they will be in that statement. For example, if someone is performing a complicated experiment that needs fixed temperature, then for him 0.1 degree is negligible, but 1 degree is not. For another expert ±5 degrees is negligible. First stage: describing the degree of confidence. This degree of confidence (also called degree of belief, degree of certainty, truth value, certainty value) can take all possible values from “false” to “true”. Inside the computer, “false” is usually described by 0, “true” by 1. Therefore it is reasonable to use intermediate values from the interval (0,1) to describe arbitrary degrees of certainty. This idea appeared in fuzzy logic [Z65], and that’s why the resulting control is called fuzzy control. So the first stage of a fuzzy control methodology is to somehow assign values from the interval [0,1] to different statements like “0.3 is negligible” or “0.6 is small positive”. There are several ways to do that [DP80]. For example ([BW73], [B74], [DP80, IV.1.d], [KF88]) we can take several (N ) experts, and ask each of them whether he believes that a statement is true (for example, that 0.3 is negligible). If M of them answer “yes”, we take M/N as a desired certainty value. Another possibility is to ask one expert and express his degree of confidence in terms of the so-called subjective probabilities [S54]. Second stage: forming a membership function. The procedure described above allows us to get the truth values of, for example, N (x) for different values of x. But even if we spend a lot of time questioning experts, we can only ask a finite amount of questions. Therefore, we will only get the values N (x) for finitely many different values of x: x1 , x2 , ..., xn . But in future applications, the variable x can take any value. We cannot get the truth value of N (x) for all x simply by asking, so we must somehow extrapolate the known truth values of N (xi ) to come out with a function that, for every possible x, gives a value from the interval [0,1] that expresses our degree of confidence that this property is true for x. Such a function is called a membership function and is usually denoted by µ(x). A membership function of the property N is denoted by µN (x), a membership function of the property SP (x) by µSP (x) , etc. This extrapolation procedure is applicable if the expert whose control we simulate uses only finitely many possible words to describe the quantities, e.g., “negligible”, “small”, “medium”, “big”, etc. In this case, for each of these words we can determine a membership function. But in some cases, experts can be more specific. First, they can use “modifiers”, i.e., words like “very”, “almost”, etc. By adding several such modifiers we can come out with many (potentially infinitely many) different descriptions of magnitude: for example “big”, “very big”, “very very big”, etc. Since there are potentially infinitely many such combinations, it is impossible to describe the membership functions that correspond to all of them one by one. We need a general procedure that would allow to transform a membership function that corresponds to some notion (e.g. “big”) into a membership function that corresponds to, for example, “very big”. Second, in addition to words like “big” or “small”, experts can use numbers to describe the quantity, for example, “about 3.4”, or, “about 3.4 ± 0.3”. This does not mean that the actual value lies strictly between 3.4 − 0.3 and 3.4+0.3, it just means that the expert considers it reasonable that the actual value belongs to the interval [3.1, 3.7], but, in principle, the value can be a little bit smaller than 3.1 or a littele bit bigger than 3.7. Here also we have a potentially infinite set of properties, because an expert can take arbitrary values as estimates (and not only 3.4), so here we also need a general method to describe all corresponding membership functions. Third stage: & and ∨ operations. After the second stage we are able to assign truth values to the statements N (x), SP (x), etc. Our goal is to describe the possible values of control. In formula (1) control is represented by the statement C(u), meaning “u is an appropriate value of control”. To get the truth value of this statement for different values of C, we must somehow interpret the operations “and” and “or” that relate them to the values that we already know. 5

Suppose that we have already chosen some rules to process & and ∨. Namely, we have chosen a procedure that allows us, given the truth values a and b of some statements A and B, to compute the truth values of A&B and A ∨ B. Let’s denote the resulting truth value of A&B by f& (a, b), and the truth value of A ∨ B by f∨ (a, b). Now we can compute the truth value of C(x) for every x, i.e., a membership function of the property C. In particular, for our thermostat example the resulting membership function is µC (u) = f∨ (f& (µN (x), µN (u)), f& (µSP (x), µSN (u)), f& (µSN (x), µSP (u)), ...). Comment. Some rules that an expert formulates can be negative, like “if something, do not apply a big control”. When we write down such rules, then in addition to “and” and “or” we must somehow represent the negation “not”. The procedure that transforms the truth value a of a statement A into a truth value for ¬A will be denoted by f¬ . Fourth stage: defuzzification. After the first three stages we have the “fuzzy” information about the possible controls: something like “with degree of certainty 0.9 the value u = 0.3 is reasonable, with degree of certainty 0.8 the value u = 0.35 is reasonable, etc”. We want to build an automatic system, so we must choose one value u ¯. So we must somehow transform a membership function µ(u) into a single value. Such a procedure is called defuzzification. Alternatives to this four-stage approach. These four stages correspond to the mainstream research in fuzzy control. All applications of fuzzy control that we know of are based on this methodology. Therefore in the present paper we will analyze only such four–stage control procedures. However, it is worth mentioning that a different approach has been proposed in [KA90], [DP91]. In this new approach we first preprocess the fuzzy rules, so that there is no need to first find a fuzzy variable, and then apply defuzzification. This new approach seems to save lots of computation time, and is therefore extremely promising. 2.2 What choices are made now for these four stages What membership functions are actually used. The most frequently used are: 1) piecewise linear functions; the most frequent case is when we know for sure that x lies inside an interval [a−∆, a+∆]. Here a denotes the most probable value of x, and ∆ the maximal possible error. In this case we take µ(a) = 1, µ(a − ∆) = µ(a + ∆) = 0 and apply linear interpolation. As a result we get µA (x) = 0 if x < a − ∆ or x > a + ∆; µA (x) = 1 + (x − a)/∆ if a − ∆ ≤ x ≤ a and µA (x) = 1 − (x − a)/∆ if a ≤ x ≤ a + ∆. Such functions are called triangular. If we have several consequent words to describe the same quantity, such as “small negative”, “negligible”, “small positive”, etc., then every value of x must satisfy one of these properties. Using ratios of experts or subjective probabilities to get the value of µ, we come to the conclusion that, for every x, the sum of the values of µA (x) for all A must be equal to 1. Therefore, when the membership function corresponding to one property starts decreasing from 1 to 0 (in the interval [a, a + ∆]), the membership function that corresponds to the next property must start increasing from 0 to 1. In view of that, the value of ∆ must be the same for all the properties, and the value of a is equal to 0 for negligible, ∆ for “small positive”, 2∆ for the next property, etc. This is to some extent an oversimplification in comparison with what is actually used: “left ∆” and “right ∆” can be different, and there must be infinite intervals corresponding to “very very big”(positive and negative). With these corrections made, these simplest membership functions are efficiently used in fuzzy control [L88], [LTTJ89], [LJB90]. In addition to triangular functions, trapezoidal functions are often used (in the applications in general they are the most frequent ones [DP80, DP88]); other examples are given in [BK88]. 2) piecewise fractionally linear functions; this means that the set of real numbers is divided into finitely many regions, and in each region µ(x) is described by the fractionally linear formula 6

µ(x) = (ax + b)/(cx + d); the values of a, b, c, d can be different for different regions. These functions are used in the famous Japanese train control system [MYI87]. 3) splines, i.e., piecewise polynomial functions; 4) Gaussian functions µ(x) = exp((x − a)2 /σ 2 ) for some a and σ. What & and ∨ operations are most frequently used: 1) f& = min and f∨ = max; this is the original proposal of L. Zadeh; 2) f& (a, b) = ab (this operation was also proposed by Zadeh in [Z65] as “algebraic product”), f∨ (a, b) = a + b − ab (“algebraic sum” [Z65]) and f∨ (a, b) = min(a + b, 1) (“bounded sum” or “bold union” [G76]). These operations turned out to be a good fit for human reasoning with uncertainties [HC76], [O77]. However, several other operations have been proposed both to describe human reasoning and fuzzy control (see, e.g., [GQ91], [GQ91a]): &) Zadeh stressed that the above operations “are not the only operations in terms of which the union and intersection can be defined”, and “which of these ... definitions is more appropriate depends on the context” [Z75, p. 225–226]. Other operations for & that have been proposed include the “bounded difference” (“bold intersection”) max(0, a + b − 1) [G76]; ab/(k + (1 − k)(a + b − ab)), where k ≥ 0 [H75], [H78]; logs [1 + (sa − 1)(sb − 1)/(s − 1)], where s > 0 [F79]; ab/ max(a, b, e), where 0 < e < 1 [DP80]; 1 − min(1, ((1 − a)p + (1 − b)p )1/p ), where p > 0 [Y80]; 1/(1 + [(a−1 − 1)p + (b−1 − 1)p ]1/p ), where p > 0 [D82]; also the operation max(0, ap + bp − 1)1/p , that was originally proposed in [SS61] for non-fuzzy purposes, was proposed for & in [KF88]. ∨) For ∨ alternative operations include “bounded sum” (“bold union”) min(1, a + b) (Giles, 1976); ((1− k)ab+ k(a+ b))/(ab+ k), where k ≥ 0 [H75], [H78]; 1− log s [1+ (s1−a − 1)(s1−b − 1)/(s − 1)] [F79]; (a+b−ab−min(a, b, 1−e))/ max(1−a, 1−b, e) [DP80a]; 1−min(1, ((1−a)p +(1−b)p )1/p ) [Y80]; 1/(1 + [(a−1 − 1)p + (b−1 − 1)p ]1/p ), where p < 0 [D82]; and operations from [SS61]: 1 − max[0, (1 − a)−p + (1 − b)−p − 1]1/p . In particular, if we use min and max for our thermostat example, we get µC (u) = max(min(µN (x), µN (u)), min(µSP (x), µSN (u)), min(µSN (x), µSP (u)), ...). What defuzzification procedures are now R R most frequently used: 1) centroid rule u ¯ = ( uµC (u) du)/( µC (u) du). 2) mean of maximum rule if the function µ(x) attains its maximum in only one point, we take this point as u ¯; if it attains the value, equal to max µ(x), on the whole interval, we take the center of this interval for u ¯. 3) centroid of largest area [PY91], [PY91a], [PYL91]. To apply this rule we must choose some threshold value p0 ; the set of all the values u, in which µ(u) ≥ p0 , is either connected or R consists of several disconnected regions. We choose a region A, for which the area A µ(u) du is the largest, and as u ¯ we take the centroid value, computed for the restriction of µ to that region (motivations of this procedure is given in Section 10). 2.3 Brief summary of the fuzzy control methodology We extract rules from an expert or experts, and transform these rules into an and-or statement. We then find the membership functions for all words like “negligible” or “small” in our rules. After that we choose functions f& and f∨ . Now we can compute µC (u) for every x. Using a defuzzification rule, we compute the value u ¯. This value is what the fuzzy control algorithm recommends for this case. 2.4 What we are going to do The choices described above are mainly semi-empirical in the sense that for some reasonable problems, they turned out to be better than the other choices that were tried. In the following sections, we will: 7

1) 2) 3) for

formulate the choice problem as a mathematical problem; solve this mathematical problem; explain why the currently made choices are in some sense optimal all the stages of fuzzy control methodology. 3. RELATIONSHIP BETWEEN DIFFERENT PROCEDURES OF ASSIGNING NUMERIC VALUES OF UNCERTAINTY

3.1 Why do we need to relate different procedures Different assignment procedures are in use. Working intelligent systems use several different procedures for assigning numeric values that describe uncertainty of the experts’ statements. The same expert’s degree of uncertainty that he expresses, for example, by the expression “for sure”, can lead to 0.9 if we apply one procedure, and to 0.8 if another procedure is used. Just like 1 foot and 12 inches describe the same length, but in different scales, we can say that 0.9 and 0.8 represent the same degree of certainty in two different scales. Some scales are different even in the fact that they use an interval different from [0,1] to represent uncertainty. For example, the famous MYCIN system uses [−1, 1] [S76, BS84]. In some sense all scales are equal, but some are more reasonable than others. From a mathematical viewpoint, one can use any scale, but from the practical viewpoint some of them will be more reasonable to use, and some of them less reasonable. We’ll consider only practically reasonable scales, and we’ll try to formalize what that means. We must describe transformations between the scales. Since we are not restricting ourselves to some specific procedure of assigning a numeric value to uncertainty, we can thus allow values from different scales. If we want to combine them, we must be able to transform them all to one scale. So we must be able to describe the transformations between reasonable scales (“rescalings”). 3.2 How to describe transformations between reasonable scales: motivation The idea of such a description appeared first in [KK91] and is as follows. The class F of reasonable transformations of degrees of uncertainty must satisfy the following properties: 1) If a function x → f (x) is a reasonable transformation from a scale A to some scale B, and a function y → g(y) is a reasonable transformation from B into some other scale C, then it is reasonable to demand that the transformation x → g(f (x)) from A to C is also a reasonable transformation. In other words, the class F of all reasonable transformations must be closed under composition. 2) If x → f (x) is a reasonable transformation from a scale A to scale B, then the inverse function is a reasonable transformation from B to A. Comment. Thus, the family F must contain the inverse of every function that belongs to it, and the composition of every two functions from F . In mathematical terms, it means that F must be a transformation group. 3) If the description of a rescaling is too long, it is unnatural to call it reasonable. Therefore, we will assume that the elements of F can be described by fixing the values of n parameters (for some small n). In mathematics, the notion of a group whose elements are continuously depending on finitely many parameters is formalized as the notion of a (connected) Lie group. So we conclude that reasonable rescalings form a connected Lie group. 8

4) The last natural demand that we’ll use is as follows. Of course, in principle, it is possible that we assign 0.1 in one scale and it corresponds to 0.3 in another scale. It is also possible that we have 0.1 and 0.9 on one scale that comprises only the statements with low degrees of belief, and when we turn to some other scale that takes all possible degrees of belief into consideration, we get small numbers for both. But if in some scale we have the values 0.5, 0.51 and 0.99, meaning that our degrees of belief in the first two statements almost coincide, then it is difficult to imagine another reasonable scale in which the same three statements have equidistant truth values, say 0.1, 0.5 and 0.9. If this example is not convincing, take 0.501 or 0.5001 for the second value on the initial scale. We’ll formulate this idea in the maximally flexible form: there exist two triples of truth value that cannot be transformed into each other by any natural partial rescaling. This demand is somewhat less convincing than the previous ones, so we’ll prove two versions of our future results: with or without this demand. Examples of reasonable rescaling transformations. In addition to these general demands, we have some examples of rescalings that are evidently reasonable [KK90, KK91]. As we have already mentioned, one of the natural methods to assign a truth value t(S) to a statement S is to ask several experts and take t(S) = N (S)/N , where N is the number of all experts asked and N (S) is the number of those who believe in S. If all the experts believe in S, then this value is 1 (=100%), if half of them believ in S, then t(S) = 0.5 (50%), etc. Knowledge engineers want the system to include the knowledge of the entire scientific community, so they ask as many experts as possible. But asking too many experts leads to the following negative phenomenon: when the opinion of the most respected professors, Nobel-prize winners, etc., is known, some less self-confident experts will not be brave enough to express their own opinions, so they will either say nothing or follow the opinion of the majority. How does their presence influence the resulting uncertainty value? Let N denote the initial number of experts, N (S) the number of those of them who believe in S, and M the number of shy experts added. Initially t(S) = N (S)/N . After we add M experts who do not answer anything when asked about S, the number of experts who believe in S is still N (S), but the total number of experts is bigger (M + N ). So the new value of the uncertainty ratio is t′ = N (S)/(N + M ) = ct, where c = N/(M + N ). When we add experts who give the same answers as the majority of N renowned experts, then, for the case when t(S) > 1/2, we get N (S) + M experts saying that S is true, so the new uncertainty value is t′ = (N (S) + M )/(N + M ) = (N t(S) + M )/(N + M ). If we add M “silent” experts and M ′ “conformists” (who vote as the majority), then we get a transformation t → (N t + M ′ )/(N + M + M ′ ). In all these cases the transformation from an old scale t(S) to a new scale t′ (S) is a linear function t → at + b for some constants a and b; in the most general case a = N/(N + M + M ′ ) and b = M ′ /(N + M + M ′ ). Now we are ready to formulate a mathematical definition 3.3 Definitions and the main result of this section Definition 1. By a rescaling we mean a strictly increasing continuous function f that is defined on an interval [a, b] of real numbers. Definition 2. Suppose that some set F of rescalings satisfies the following properties: 1) F is a connected Lie group; 2) if N, M, M ′ are non-negative integers and N > 0, then the transformation 9

t → (N t + M ′ )/(N + M + M ′ ) belongs to F ; 3) there exist two triples of real numbers x < y < z and x′ < y ′ < z ′ such that no rescaling from F can transform x into x′ , y into y ′ and at the same time z into z ′ . Elements of this set F are called reasonable transformations. Comment. In view of the previous remark that demand 3) is not as convincing as demands 1) and 2), we’ll consider two versions of several results: the main one when we assume 3) and the additional one when we do not assume 3). The main results will be denoted as Theorems 1, 2, 3, Definitions 2, etc., and the corresponding additional results (without assuming 3)) will be denoted as Theorem 1′ , 2′ , 3′ , Definitions 2′ , etc. In particular, without assuming the demand 3) the above Definition 2 takes the following form: Definition 2′ . Suppose that some set F of rescalings is fixed, and this set F satisfies the following properties: 1) F is a connected Lie group; 2) if N, M, M ′ are non-negative integers and N > 0, then the transformation t → (N t + M ′ )/(N + M + M ′ ) belongs to F . Elements of this set F are called reasonable transformations. THEOREM 1 (assuming 1), 2), 3)) Every reasonable transformation f (x) is linear, i.e., f (x) = ax + b for some a, b, and every monotonic linear transformation ax + b with a > 0 is reasonable. THEOREM 1′ (assuming 1), 2)) Every reasonable transformation f (x) is fractionally linear, i.e., f (x) = (ax + b)/(cx + d) for some a, b, c, d. (All the proofs are given in the Appendix) Historical comment. The problem of classifying all finite-dimensional transformation groups of an n-dimensional space Rn (where n = 1, 2, 3, ...) that include a sufficiently big family of linear transformations, was formulated by N. Wiener (see, e.g., [W62]). Wiener also formulated a hypothesis that was confirmed in [GS64], [SS65]. It turned out that if n = 1, then only two groups are possible: the group of all linear transformations and the group of all fractionally linear transformations (the simplified proof for n = 1 is given in [K87]. For other applications of this result see [KK90], [CK91], [KQ92]). Comment. A special case of a transformation is a so-called normalization. The idea is as follows. Suppose that we are interested in the possibility of different alternatives. We used two methods (or two groups of experts) to estimate the degree of their possibility and got two results: µ(a) is the result of the first method, and µ′ (a) is the result of the second one. Since we could have used different procedures of representing uncertainty to get these two sets of estimates (i.e., different scales), it makes no sense to compare the values directly. Even if for some a we have µ(a) > µ′ (a), it is still possible that they represent the same degree of certainty, but converted to different numbers by different procedures. What is independent is the ordering of the values. If µ(a) > µ(b) in some scale, this means that the degree of belief in a is bigger than the degree of belief in b. So the choice of the most “probable” alternative a (i.e., the one, for which µ(a) → max) does not depend on what scale we used. Therefore, in order to compare the two sets of values we normalize them, i.e., reduce them both to a scale in which the maximum value maxa∈A µ(a) (where A is the set of all alternatives) has some prescribed value (usually 1). Usually, in such cases, there exists an alternative a, about which we are absolutely sure that it is impossible in our situation (i.e., µ(a) = 0). It is natural to demand that this value 0 should remain the same after the “normalization” transformation. So we arrive at the following definition. Definition 3. By a normalization we mean a reasonable transformation f (x), for which f (0) = 0. 10

THEOREM 2. Every normalization is equal to f (x) = kx for some constant k > 0. THEOREM 2′ . Every normalization has the form f (x) = kx/(c + dx). Comments. 1. The normalization from Theorem 2 is precisely the normalization that was proposed by L. Zadeh [Z65]: if we want to undertake such a normalization, that for the resulting membership function maxµ(a) = 1, then we must transform µ(a) into kµ(a), where k = 1/(maxa∈A µ(a)). Comments. 2. Normalizations from Theorem 2′ are not used. A possible explanation for that will be given in Section 6. 4. WHAT FUZZY VARIABLES TO CHOOSE. REASONABLE CHOICE 4.1 Motivation Suppose that we have a fuzzy notion like “small”. Then for x = 0, and maybe for extremely small values of some physical quantity x, we are sure that it is a small value; for some sufficiently big x we are absolutely sure that it is not small. However, for intermediate values x we are uncertain whether x is small or not. The bigger the value x, the less we are certain that this value is small. In this case there are two ways to represent our uncertainty: first, we can use a general tool that translates our uncertainty into a number (the value of a membership function µ(x)), and second, we can use this very value x because the bigger the x, the bigger our uncertainty. Of course, this is true not for all possible x, but only for those x that lie in the “gray zone”, between the values that are definitely small (where µ(x) = 1) and those that are definitely not small (where µ(x) = 0). So on every such zone we have two different scales to express uncertainty: x and µ(x), and therefore the transformation between them (i.e., a function µ) must be a transformation between two reasonable scales, i.e., in our terms, a reasonable transformation. 4.2 Definitions and the main result of this section Definition 4. By a membership function we mean a continuous function µ from the set R of all real numbers to the interval [0,1]. We say that a membership function is reasonable if for every real number x, for which µ(x) 6= 0 and µ(x) 6= 1, there exists an interval I (finite or infinite) such that: 1) x belongs to I 2) µ : I → [0, 1] is a reasonable transformation and 3) if e is an endpoint of I, then µ(e) = 0 or µ(e) = 1. THEOREM 3. Every reasonable membership function is piecewise linear. THEOREM 3′ . Every reasonable membership function is piecewise fractionally linear. Comment. These theorems explain why piecewise linear (in particular, triangular and trapezoidal) and piecewise fractionally linear functions are very efficient in fuzzy control [L90], [B91], [YM85], [MYI87], and in other applications of fuzzy logic [KG85], [DP88]. 5. WHAT FUZZY VARIABLES TO CHOOSE. OPTIMAL EXTRAPOLATION OF THE EXPERTS’ CHOICE 5.1 Why it is not sufficient to use “reasonable” membership functions It looks like there is a contradiction between the result of Section 4 and the real-life experience. As we have already mentioned, in some real-life applications membership functions that we considered reasonable in the previous section were not the best performers. Does this indicate that our conclusion was wrong? 11

There is no contradiction between the results of Section 4 and practical experience, but there is a problem: how to describe “non-reasonable” membership functions. The whole purpose of fuzzy control is to develop methods of control for situations where our knowledge of the controllee is not sufficient to describe a control, thus we must rely on the experience and intuition of the experts who are good in controlling. When describing membership functions, we try to formalize their intuition. Sometimes the main difference between experts and inexperienced controllers is that expert controllers use additional non-trivial rules; sometimes they use the same rules, but they are more knowledgeable when they speak about “small” or “medium”. In the first case, we could rely on reasonable membership functions that correspond to more or less standard notions of “small”, “big”, etc. In the second case, however, these notions are specially tailored by the expert so that they should better represent this very case of control. Since our goal is to simulate the expert’s control in the best possible way, in such situations the corresponding membership function should also be specially tailored. The resulting function may seem “non-reasonable” at first glance, but we must be able to describe it. So we have a problem: how do we describe “non-reasonable” membership functions? In short: the very fact that something in the expert’s control strategy may seem non-reasonable is evident. If we could get good control from reason only (i.e., from the first principles), then we would have no need in the experts’ experience and fuzzy control. 5.2 Motivations of the following formalization We need a functional to describe the quality of different membership functions. Suppose that we determined the values µ1 , ..., µn of a membership function in several points x1 , ..., xn , and we want to extrapolate from these values. There are infinitely many possible functions for which µ(xi ) = µi , 1 ≤ i ≤ n; we must choose among them the one that is best in some sense. What does “the best” mean? It means that we must be able to somehow describe to what extent different functions are good, and choose the one that is the best. Usually a quality of a function can be described by a functional, i.e., by assigning to every possible function µ(x) a numeric value J({µ(x)}) that characterizes its quality, and choosing a function for which this characteristic takes the biggest possible value: J({µ(x)}) → max. One of the possible criteria is smoothness. If we are interested in a smooth control, then we would like to choose the membership function to be as smooth as possible because the discontinuity of a membership function can lead to a non– smoothness of the resulting control. There can be other criteria, e.g., the average stability of the resulting control, etc. How can we describe the functionals J({µ(x)}) that correspond to reasonable criteria? How to describe different functionals? Before we can answer the question “what functionals J({µ(x)}) correspond to reasonable criteria?”, we need some way to represent all possible functionals. In order to describe the value of a functional J({µ(x)}) for a function µ(x), it is generally not enough to know the values µ(xi ) of this function at finitely many points; we must know the values of µ in all the points, i.e., at infinitely many points. So we can view a functional as a function of infinitely many variables (µ(x) for all possible x). There are standard ways to describe all reasonable functions f (x1 , ..., xn ) of finitely many variables x1 , ..., xn . For example, it can usually be assumed that P is analytic, and therefore we can P a function expand it into the power series: f (x1 , ..., xn ) = a + i ai xi + ij aij xi xj + ..., where aij = aji , ..., and describe a function by its coefficients a, ai , aij , ... In case there are only finitely many P variables µ(x ), ..., µ(x ), we could consider a similar representation for J: J({µ(x)}) = a + n i ai µ(xi ) + P 1 ij aij µ(xi )µ(xj ) + ... In reality, the number of variables is infinite, and the more variables we take into consideration, the better approximation of J we get. Therefore, the value of the functional 12

J({µ(x)}) can be considered as a limit of the previously given power series. Let’s figure out what happens to all the terms in this power series in this limit. When we consider the P values of µ in the points that are more and more densely distributed, then in the limit the sum i ai µ(xi ) turns into an integral (because R that is how an integral is usually defined). So the linear terms tend to an expression of the type a(x)µ(x) dx for some function a(x). This a(x) is not necessarily a function in the usual sense of this word. For example, if we consider the limit process, in which on step n we take the values xi = i/n, 0 ≤ i ≤ n, and ai = 0 for i > 0 and a0 = 1,R then the limit value of this integral equals µ(0). This value can also be represented as an integral a(x)µ(x) dx, but this limit function is 0 for x > 0 and tends to ∞ for x = 0. Such “functions” are known in mathematics as generalized functions, orR distributions (we will refer to the precise definitions later). So in general the linear term equals a(x)µ(x) dx,R where a(x) is a generalized function. Likewise, quadratic terms in the limit should be equal to a(x, y)µ(x)µ(y) dx dy for some generalized function a(x, y) of two variables (such that a(x, y) = a(y, x)), etc. A functional must be non-degenerate. This subsection will be a little bit mathematical, so readers who are unfamiliar with optimization problems can skip it. We are going to choose the best membership function as a function µ(x) that solves the following conditional optimization problem: J({µ(x)}) → max under the conditions µ(xi ) = µi for finitely many values x1 , ..., xn . The Lagrange multipliers methods allows us to reduce this conditional optimization P problem to a problem of unconditional optimization for a functional J ′ ({µ(x)}) = J({µ(x)}) + i λi (µ(xi ) − µi ), where λi are Lagrange multipliers. Let us again use an analogy with a function. A function f attains its maximum at the point, where all its derivatives equal 0. Likewise, a functional attains its maximum at a point where all its derivatives with respect to all its variables µ(x) equal 0 (such derivatives are called variational derivatives). Since J is generally non-linear, the resulting equation can contain a linear part, a quadratic part, etc. If it contains only linear terms, then in order to compute the optimal values µ(x), we will have to solve a system of linear equations. If it contains linear terms and terms of higher order, we can proceed as follows: first neglect all non-linear terms, solve the corresponding system of linear equations, and use this solution as a first approximation to the desired one. There is only one case when such a procedure does not work: when there are no linear terms in the equation. We will call this case degenerate. Correspondingly, we will call equations non-degenerate if a linear part is different from 0. But linear terms in the derivative come from quadratic terms in the functional that we differentiate (in our case J ′ ({µ(x)})). So the demand that our equations are non-degenerate implies that this functional should contain non-zero quadratic part. By definition, this functional is a sum of the original functional J({µ(x)}) and linear terms. So the fact that J ′ ({µ(x)}) has non-zero quadratic terms means that J({µ(x)}) must also have non-zero quadratic terms. This conclusion will be used later on as a mathematical description of the non-degeneracy demand. Locality demand. The next reasonable idea is that of locality. Suppose that we know that a value of x can be located only in one of the two disjoint intervals I1 and I2 (we will return to these considerations in Section 10 and show that this is a quite reasonable situation). In terms of a membership function, it means that µ(x) = 0 for x outside both intervals Ii . In this case, the problem of choosing a membership function µ(x) is clearly divided into two independent problems: describing the values of µ(x) for x ∈ I1 , and describing these values for x ∈ I2 . We call these problems independent, because in this case, it is reasonable to expect that the relative quality of different approximations on I1 should not depend on how we approximated a membership function on I2 . 13

In other words, if µ1 (x) and µ ˜1 (x) are located on I1 (i.e., different from 0 only for the points from I1 ), and µ2 (x) and µ ˜2 (x) are located on I2 , then from the fact that µ1 + µ2 is better than µ ˜ 1 + µ2 it must follow that µ1 + µ ˜2 is better than µ ˜1 + µ ˜2 Invariance. The relative quality of different membership functions µ(x) should not depend on what units we use to measure x, what starting point we use to measure x, and what scale we use to describe uncertainty values. In mathematical terms, changing units means that instead of the values of x we now have values λx, where λ is a ratio of the old and the new units. For example, if we use inches instead of feet, then in order to obtain new numeric values we have to multiply the old values by 12 (=1 ft/1 in). Likewise, changing a starting point means that we use x + x0 instead of x, where x0 is a difference between the old and the new starting points. Finally, rescaling means applying some linear (or fractionally linear) transformation to all the values of µ(x). 5.3 Definitions and the main result of this section Definition 5. Assume that a class of membership functions M is given. We say that an analytical functional is defined on M if there exists a number a0 , and for every n there exists a distribution (generalized function) [GSh64] an (x, ..., R y) of n variables R such that for every function µ(x) from M the following series converge: a0 + a1 (x)µ(x) dx + a2 (x, y)µ(x)µ(y) dx dy + .... This sum is denoted by J(µ) or J({µ(x)}) and called a value of the functional J on µ. Definition 6. Suppose that the values x1 < x2 < ... < xn are given, plus the values µi , 1 ≤ i ≤ n. We say that µ(x) is an optimal membership function if from all functions µ from M that satisfy the conditions µ(xi ) = µi for 1 ≤ i ≤ n, it has the biggest possible value of J (in other words, if it is a solution to a conditional optimization problem J(µ) → max under the conditions µ(xi ) = µi ). Definition 7. We say that the relation J(µ) < J(µ′ ) between membership functions corresponds to J; we say that the functionals J 1 and J 2 are equivalent if one and the same relation corresponds to both of them (i.e., if the inequality J 1 (µ) < J 1 (µ′ ) if true or false simultaneously with J 2 (µ) < J 2 (µ′ )). Definition 8. We say that an analytical functional J is non-degenerate if, first, its quadratic part is non-zero, i.e., aR2 6= 0, and, second, it is not equivalent to a linear functional (i.e., a functional of the type J(µ) = a(x)µ(x) dx). Definition 9. We say that a functional J is local if whatever two disjoint intervals I1 and I2 we take, in case µ1 (x) and µ ˜1 (x) are located on I1 (i.e., different from 0 only for the points from I1 ), and µ2 (x) and µ ˜2 (x) are located on I2 , then from the fact that J(µ1 + µ2 ) > J(˜ µ1 + µ2 ) it must follow that J(µ1 + µ ˜2 ) > J(˜ µ1 + µ ˜2 ). Definition 10. Suppose that a transformation T is defined on the set M of membership functions. We say that the ordering relation is invariant with respect to T if J(µ1 ) < J(µ2 ) is equivalent to J(T (µ1 )) < J(T (µ2 ))). Definition 11. The transformation µ(x) → µ(λx), where λ > 0, is called changing units of x. The transformation µ(x) → µ(x + x0 ), where x0 is a real number, is called changing starting point for x, for some x0 . The transformation µ(x) → f (µ(x)), where f (µ) is a reasonable transformation, is called a rescaling of a membership function. We suppose that the set M is closed under these transformations, i.e., with any function µ(x) it contains all its transformations.

14

THEOREM 4(4′ ). Assume that a set of reasonable transformations is defined, J is an analytical non-degenerate local functional, and the corresponding ordering relation is invariant with respect to changing units of x, changing a starting point x, and rescaling. Then: R for (n) ˜ 1) this functional J is equivalent to J = (µ (x))2 )dx, where µ(n) (x) denoted n-th derivative, and 2) the optimal membership function that corresponds to the data xi , µi is equal to a polynomial of order ≤ 2n − 1 on each of the intervals [xi , xi+1 ]. Comments. 1. This Theorem is marked 4(4′ ) because it is true whether or not we add condition 3) to the definition of the reasonable transformations. 2. This Theorem explains why piecewise polynomial functions are efficiently used in fuzzy control. In particular, for n = 1 we conclude that µ(x) must coincide with the 1st order polynomials on each of its intervals, i.e., it must be piecewise linear (this we get a new derivation of piecewise linear membership functions). For n = 2 we get the membership functions that are composed of polynomials of order ≤ 3. This explains why piecewise quadratic polynomials are often efficiently used. Our Theorem is in good accordance with a result of Dishkant [D81] (see also [K86, Section 2.1]) who proved that piecewise quadratic functions correspond to the case when the variable in which we are interested is the sum of many small components, and we have fuzzy statements about the size of all these small components. This situation is similar to the limit theory in probability, where Gaussian distribution is thus justified. In control situations, it is appropriate for describing the possible deviations from the ideal trajectory, or some other quantities, that are the result of many independent factors. 6. WHAT FUZZY VARIABLES TO CHOOSE. HOW TO DESCRIBE MODIFIERS (LIKE “VERY”) 6.1 Motivations The problem of assigning modifiers is difficult, because a transformation from “possible” to “very possible” is not a “reasonable” transformation. Let us first explain why the problem of assigning modifiers is difficult. The main difficulty is as follows. Up to this moment we dealt with transformations that transformed the value that corresponds to some word (like “probable”) on one scale into the number that represents this same word on some other scale. Such transformations were called reasonable and we already know how to describe them. However, when we cannot go from “probably” and “slightly probably”, we move to another degree of certainty, so this is a different type of transformation. An idea. We have already mentioned that one of the reasonable transformations is normalization. Suppose that we initially knew the truth values ti of several statements S1 , S2 , ..., Sn . We could then apply a normalization procedure and thus make the biggest truth value equal 1. Now suppose that we suddenly became aware that the source of our knowledge is not as reliable as we thought, and therefore all the information is only true to some extent. This can be expressed by saying that instead of statements Si , we have statements Si′ meaning “Si is true to some extent”. We can view “to some extent” as a modifier. The new truth values are equal to t′ = f (ti ), where f (t) is a function that corresponds to a modifier. We can again normalize these values. So one possibility is that we first normalize, then make changes that correspond to a modifier, and then maybe normalize the changed values. Another possibility is that we discovered the flaw in 15

our knowledge before we started normalizing initial values ti . In this case, it makes no sense to normalize them, as we already know that they are wrong. So we compute the new values t′i and normalize them. It is natural to demand that the results of the two procedures that correspond to these two possibilities will be the same. 6.2 Definitions and the main result of this section Definition 12. Suppose that finitely many values t1 , ..., tn from [0,1] are given. We say that a normalization f is a normalization to 1 if maxi (f (ti )) = 1. The values f (ti ) are said to be normalized to 1. Definition 13. By a modifier we mean a monotone function m : [0, 1] → [0, 1] such that for every finite set of truth values t1 , ..., tn , if we first apply a normalization to 1 and then apply m, the results will be the same as if we first apply m, and then normalize to 1. In other words: if t1 , ..., tn are different values, f is their normalization to 1, fm is a normalization to 1 for the values m(t1 ), ..., m(tn ), then m(f (ti )) = fm (m(ti )). THEOREM 5. Every modifier has the form m(x) = xd for some d > 0. THEOREM 5′ . If we consider all normalizations from Theorem 2′ , then no modifiers are possible. Comments. 1. Functions m(x) = xd are precisely those that were proposed by Zadeh [Z73] to express the modifiers like “very”, “hardly”. The experimental results of [KB76] turned out to be consistent with the formula m(x) = x2 for “very”. 2. The fact that if we admit fractionally linear normalizations, then no modifiers are possible, can be an explanation of why such normalizations are never used. 3. It deserves mentioning that the more detailed analysis of properties like “very large” revealed [HC76], [M78] that in these cases, the better approximation for the truth value tvery (A(x)) of a statement “very A(x)” is not t2 (A(x)), but t2 (A(x−c)) or even t2 (A(kx−c)) for some k, c > 0. But from all transformations of the type t(A(x)) → m(t(A(x))), the function m(x) = x2 is certainly the best fit. 4. Alternative formulas for modifiers were proposed in [B88]. These formulas contain arbitrary functions f ′ and f ′′ ; in the only specific case that is considered in [B88] these functions are trapezoidal (piecewise linear) functions, and the resulting modifiers are piecewise linear. So this approach is also consistent with our general methodology. 7. THE CHOICE OF & AND ∨ OPERATIONS. THE GENERAL CASE The main results of this section were announced in [KK90], and the proofs appeared first in our Technical Reports [KK90a] and [KL90]. 7.1 Definitions of & and ∨-operations Motivation. Suppose that we know the truth values t(A) and t(B) of the statements A and B, and we want to estimate the truth values t(A&B) and t(A ∨ B) of the statements A&B and A ∨ B. Since the only information available consists of these two numbers a = t(A) and b = t(B), the resulting estimates must be obtained from them by some appropriate computations. Let’s denote the resulting estimate for t(A&B) by f& (a, b), and resulting estimate for A ∨ B by f∨ (a, b). We will call these functions correspondingly an & and an ∨-operator. So the questions is: what & and ∨-operators to choose? 16

Definition 14. By a &-operation we mean a function f& from [0, 1]×[0, 1] to [0,1], that is continuous, non-decreasing in both variables and satisfies the following equations: f& (a, 0) = 0, f& (a, 1) = a, f& (a, b) = f& (b, a) and f& (f& (a, b), c) = f& (a, f& (b, c)). Motivation of this definition. 1) If A is false, then A&B is also false, so f& (0, a) = 0 for all a. 2) If A is true, then A&B is true if and only if B is true, so in this case t(A&B) = t(B) hence f& (a, 1) = a for all a. 3) When we say that A and B are both true or that B and A are both true, we mean the same thing. Therefore, t(A&B) must be always equal to t(B&A), or, in other words, f& (a, b) = f& (b, a) for all a, b. 4) When we say that A1 , A2 , A3 , ..., An are true, we can change the order of Ai and the result will be still the same. For example, if n = 3 and we enumerate them in the order A, B, C, then t(A&B&C) = t((A&B)&C) and if we order them another way we get t(A&(B&C)). From the equality of these expressions we conclude that f& (f& (a, b), c) = f& (a, f& (b, c)). 5) If our degree of belief in A increases, then our degree of belief in A&B becomes greater or equal (but cannot become smaller). So the function f& must be non-decreasing in both variables. 6) If our degrees of belief in A and B change a little bit, then our degree of belief in A&B cannot change essentially. The smaller is the change in t(A), t(B), the smaller must be the change in t(A&B). In other words, the function f& must be continuous. Denotation. In view of these properties the result of applying f& to several values does not depend on their order, so we can use a simplified notation f& (a, b, ..., c) for f& (a, f& (b, ...c)...)). Likewise arguments justify the following definition: Definition 15. By a ∨-operation we mean a function f∨ from [0, 1]×[0, 1] to [0,1], that is continuous, non-decreasing in both variables and satisfies the following equations: f∨ (a, 0) = a, f∨ (a, 1) = 1, f∨ (a, b) = f∨ (b, a) and f∨ (f∨ (a, b), c) = f∨ (a, f∨ (b, c)). The last expression will be denoted by f∨ (a, b, c). 7.2 Informal motivation of the relationship between &- and ∨-operations and reasonable transformations When we communicate, we often do not explicitly pronounce all the assumptions that we make; these assumptions are implicitly assumed. E.g., when we ask a specialist in energetics about the perspectives of nuclear and hydro power stations, we implicitly assume that the present-day physics is correct, although we are quite aware of the fact that new physical theories can appear that will lead to new sources of energy. In view of that there are two ways to ask questions to the experts: either we do not mention this implicit knowledge at all, or we can explicitly tell him that we are interested in his opinion, and we are not assuming that this implicit knowledge is true. Then in the first case he will most likely tell us his opinion in the assumption that the implicit knowledge is true, while in the second case, he will try to take into consideration the possibility that some of these implicit statements can turn out to be false. For example, when we ask a doctor about the chances that a patient will quickly recover from a depression, and we know that he is treated by psychoanalysis, then the doctor may say something like 90%. This estimate can be different from 100% not because this doctor knows of some cases when this treatment did not help, but because he may have some doubts in the whole psychoanalysis methodology, and the lacking 10% represent his doubts. If we then ask this same question differently, 17

stressing that we want to know his estimates of a patient’s chances for recovery irrespective of whether the treatment that he receives now is appropriate, he may give a bigger number (for example, 99%), because he know of no cases when a treatment did not help, either by really treating the disease, or by a placebo effect. Thus we have two different procedures to assign truth values to the same uncertain knowledge of the same expert. The first procedure, when we do not mention the implicit knowledge B in our question, actually represents the expert’s degree of belief in S&B. The second procedure, when we especially mention to the expert that we want his estimate of S irrespective of whether B is true or not, we get the degree of belief in S itself. So if in both cases we use numbers to represent the degree of uncertainty, then for the first procedure we get t1 (S) = t(S&B), and for the second procedure we get t2 (S). The transformation from t2 (S) to t1 (S) is an evident example of a transformation between reasonable scales for representing uncertainty, so it should belong to the class of reasonable transformations. Since we estimate t(S&B) by f& (t(S), t(B)), this transformation from t2 (S) to t1 (S) takes the form a → f& (a, b), where by b we denoted t(B). So our conclusion is that this transformation must be reasonable. The value t(B) can be arbitrary, therefore we can conclude that this transformation must be reasonable for every b. However, we cannot formalize precisely that. A reasonable transformation must be strictly monotone, but the transformation x → f& (x, b) is not necessarily strictly increasing. For example, a statement S can be so highly unreliable that although t(S) is positive but small, S&B is absolutely false (t(S&B) = 0). This is quite possible, but in this case 0 < t(S) and f& (0, b) = f& (t(S), b) = 0. Another example is when S is so highly reliable, that our degree of belief in S&B equals to the degree of belief in B, i.e., t(S) < 1, but f& (t(S), b) = b = f& (1, b). So we can demand that a transformation x → f& (x, b) is reasonable only on the intervals where the value of this function is different from 0 and b. So far we considered implicit knowledge that can in principle turn out to be false (i.e., crudely speaking, the pessimistic part of the implicit knowledge). But the implicit knowledge can also include an optimistic part. Foe example, when we form a knowledge base about energetics, we have in mind that maybe scientists will find some ecologically pure and economically cheap way to use the solar energy. Then every question that we ask the experts can have two interpretations: either we ask them whether this or that prediction is true assuming the existing technological level, or we implicitly admit the possibility of such an optimistic breakthrough. Here we also have two scales t1 (S) and t2 (S) = t1 (S ∨ B)(= f∨ (t(S), t(B))), where B is this implicit optimistic possibility. Similar arguments lead us to the conclusion that the function g(x) = f∨ (x, b) must be a reasonable rescaling for all x, for which b < g(x) < 1. Summarizing, we come to the following definitions. 7.3 Definitions and the main result of this section Definition 16. We say that an &-operation is reasonable if for every real number b from the interval (0,1) the function g(x) = f& (x, b) is a reasonable rescaling for all x, for which 0 < f& (x, b) < b. We say that an ∨-operation is reasonable if the function g(x) = f∨ (a, b) is a reasonable rescaling for all x, for which b < f∨ (x, b) < 1.

18

THEOREM 6. If an &-operation is reasonable then it coincides with one of the following operations: 1) f& (a, b) = min(a, b); 2) f& (a, b) = ab; 3) f& (a, b) = max(0, (k + 1)(a + b) − kab − (k + 1)), where k ≥ −1; 4) f& (a, b) = ab/a& , if both a ≤ a& and b ≤ a& for some number a& , else f& (a, b) = min(a, b); 5) f& (a, b) = max(0, (k + 1)(a + b) − kab/a& − (k + 1)a& ) if a ≤ a& and b ≤ a& , else f& (a, b) = min(a, b). THEOREM 7. If an ∨-operation is reasonable, then it coincides with one of the following operations: 1) f∨ (a, b) = max(a, b); 2) f∨ (a, b) = a + b − ab; 3) f∨ (a, b) = min(a + b + kab, 1), where k ≥ −1; 4) f∨ (a, b) = max(a, b) if one of the values a, b is smaller or equal than some fixed constant a∨ , and f∨ (a, b) = a∨ + (a − a∨ ) + (b − a∨ ) + (a − a∨ )(b − a∨ )/(1 − a∨ ) else; 5) f∨ (a, b) = max(a, b), if a ≤ a∨ or b ≤ a∨ , else f∨ (a, b) = a∨ + min((a − a∨ ) + (b − a∨ ) + k(a − a∨ )(b − a∨ ), 1 − a∨ ), where k ≥ −1. THEOREM 6′ . If an &-operation is reasonable, then it coincides with one of the following operations: 1) f& (a, b) = min(a, b); 2) f& (a, b) = ab/(k + (1 − k)(a + b − ab)), where k ≥ 0; 3) f& (a, b) = max(0, ((lab + (l − 1)(a + b − 1)/(k + (k − 1)(a + b − ab)) for some constants k, l; 4), 5) f& (a, b) equals to min(a, b) if at least one of the values a, b is ≥ a& , else is equal to a& g(˜ a, ˜b), where g denotes one of the functions 2), 3), a ˜ = a/a& , and ˜b = b/a& . THEOREM 7′ . If an ∨-operation is reasonable, then it coincides with one of the following operations: 1) f∨ (a, b) = max(a, b); 2) f∨ (a, b) = (a + b + (k − 1)ab)/(1 + kab), where k ≥ 0; 3) f∨ (a, b) = min((a + b + kab)/(1 + lab), 1) for some k, l; 4),5) f∨ (a, b) = max(a, b), if one of the values a, b is smaller or equal than some fixed constant a∨ , and f∨ (a, b) = a∨ + (1 − a∨ )g(˜ a, ˜b) for all other pairs a, b, where ˜ a ˜ = (a − a∨ )/(1 − a∨ ), b = (b − a∨ )/(1 − a∨ ), and g is one of the functions 2)-3). The resulting list of reasonable & and ∨ operations includes (directly or indirectly) practically all operations that were actually used. This list of reasonable operations includes the original operations of fuzzy logic, algebraic (probabilistic) operations, bold operations of [G76], Hamacher operations, operations of [DP80a] (formula 4) of Theorem 7 and 8). It also includes the operations of MYCIN [S76], [BS84] for positive certainty factors (for negative factors MYCIN uses non-monotonic operations, so our results are not applicable; however, these operations are in good accordance with our results, because they are piecewise fractionally linear in each of the variables). These Theorems are also in good accordance with the experimental results of [HC76], [O77] and [Z78], who showed that among associative operations min, max and probabilistic operations are the best fit for human reasoning. The operations from [Y80], [D82] and [SS61] do not follow directly from our theorems, but they can be explained indirectly in the following manner. If we initially have truth values a and b, apply a modifier to them (getting ap and bp ), then apply the “bold” union, and after that apply the inverse modifier, we get operations from [SS61]. If we first apply negation x → 1 − x (see below), then apply the dual operation and again apply negation we get the formulas of [Y80]. If we apply the same idea to Hamacher’s operation ab/(a + b − ab), we get Dombi’s operations. If we 19

follow this pattern with other operations from our Theorem, we get other operations that therefore seem promising (e.g., one can try (ap + bp − ap bp )1/p or the result of applying this procedure to an arbitrary Hamacher operation). The only operations from the list given above that do not follow from our Theorem, neither directly nor indirectly, are the operations of [F79]. However, these operations were invented on purely theoretical grounds; as far as we know, they were never applied to real intelligent systems, so the fact that they do not follow from the natural demands can mean that they will not be successful in real applications. Possible interpretation of the new reasonable operations. In addition to the well known operations we got some new ones that generalize the operation of [DP80a]. Their interpretations are straightforward: Suppose we have many highly unreliable evidences E1 , E2 , ..., En in favor of some hypothesis H. This means that E1 ∨ E2 ∨ E3 ∨ ... ∨ En implies H, and the truth values t(Ei ) are small. For simplicity let’s consider the case when the truth values t(Ei ) are equal to each other: t(Ei ) = a for some a that is much smaller than 1 (denoted a ≪ 1). In this case, the truth value of H must be greater than oror equal to the truth value of E1 ∨ E2 ∨ E3 ∨ ... ∨ En . If we use f∨ (a, b) = a + b − ab, then for sufficiently big n, the truth value of H can be arbitrarily big, which contradicts our intuition (because our confidence in the experimental confirmation is greater than in case where we have arbitrarily many confirming guesses). If we use f∨ = max, then we have another contradiction with intuition according to which two independent confirmations of some hypothesis are always better (i.e., formally, f∨ (a, b) > a and f∨ (a, b) > b for sufficiently big a, b). The new operations help in avoiding both contradictions: for a, b sufficiently big we have f∨ (a, b) greater than both a, b, and if a is sufficiently small (< a∨ ) then regardless on how many confirmations with this truth value we have, the resulting truth value can never be close to absolute certainty (i.e., to the value 1). The same interpretation can be given to the new &-operation. Fo example, although physicists understand that their formulations of physical laws can, in principle, be wrong, they are sufficiently confident in them, and even when they make long sequences of arguments, they still believe in all the results with the same confidence. This means that if a and b are greater than some crucial value a& , then our degree of belief in a&b is still greater than a& . Historical comment 1. The general idea of using groups to describe &− and ∨− operations appeared first in [H85] and [HV87]; see also [K83], [KR86], [K87a], [KKM88], [FK89], [K89], [K89a], [K89b], [K90], [KK90]. Historical comment 2. Several explanations of why these operations are used have been proposed. None of these explanations narrow the class of all possible operations to precisely all the operations that are actually in use (and we have done precisely that). However, for several operations, these explanations give a pretty good understanding of why these operations are used. Practically all of them are based on the natural assumptions that the functions f∨ and f& are continuous, non-decreasing in each argument, commutative (f (a, b) = f (b, a)), associative (f (a, f (b, c)) = f (f (a, b), c)), and satisfy the extension principle (that their values for 0 and 1 are the same as for standard logical operations & and ∨) and f∨ (0, x) = f& (1, x) = x. The operations min and max are uniquely determined either by the demand that the operations f∨ and f& are distributive with respect to each other [BG73], or by the demand that f∨ (a, a) = a and f& (a, a) = a [FF74]. If we assume that the functions f∨ and f& are strictly increasing, then we come to the conclusion that f∨ (a, b) = f −1 (f (a) + f (b)) for some strictly increasing function f from [0,1) onto the set R+ of all non-negative real numbers, and f& is described by a similar expression, but with the strictly decreasing function f [H75], [H78], [W83]. 20

The additional demand that f∨ and f& are rational functions leads to the formulas of [H75], [H78]. This result is very close to what we are doing; the main difference is that to our viewpoint this demand is purely mathematical and has to be justified. Our justification (that uses the connection with reasonable transformations) actually leads to the conclusion that these operations are only piecewise rational, so we could not use Hamacher’s results directly, and had to prove new theorems. If we assume that for each a and sufficiently large n f∨ (a, a, ..., a(n times)) = 1 and f& (a, a, ..., a(n times)) = 0, then f∨ (a, b) = f −1(min(f (a)+f (b), 1) for a strictly increasing function f from [0,1] onto [0,1] (and a similar representation is possible for f& ) [DP88]. The function f& (a, b) = max(0, a + b − 1) from this class is uniquely determined by the additional demand that it must have a linear compensation effect between membership values: f& (a, b) = f& (a − k, b + k) for all k [Y79]. The assumption that for every two values p, q there must be a transformation preserving both operations and transforming p into q is analyzed in [K83], [K87]. It also leads to a rather general class of transformations. The operations of [F79] are uniquely determined by the demand that f∨ (a, b) + f& (a, b) = a + b. This equation, however, is not intuitively evident. 7.4 How to handle negation Definition 17. By a negation we mean a reasonable transformation g(x) such that g(0) = 1 and g(1) = 0. Comment. Negation is a transformation between reasonable scales, because we can express our uncertainty in a statement A either by our degree of belief in t(A), or by our degree of belief t(¬A) in its negation ¬A. So the operation g(x) that transforms t(A) into our estimate for t(¬A) is a reasonable transformation. If A is false (t(A) = 0), then ¬A must be true (t(¬A) = 1), therefore g(0) = 1 and likewise g(1) = 0. THEOREM 8. A negation is g(x) = 1 − x. THEOREM 8′ . A negation is described by a formula g(x) = (1 − x)/(1 + kx) for some k > −1. Comment. The operation 1 − x was originally proposed by Zadeh [Z65] and experimentally confirmed in [HC76]. An operation from Theorem 8′ was originally proposed by Sugeno [S74], [S77]; these operations are also sometimes a good fit for human reasoning [S77]. So these Theorems explain the present choice of negation operations. A more general expression was proposed in [Y80]: g(x) = (1 − xp )1/p . Although we cannot directly get it from our Theorems, we can get it indirectly (in the same way as we explained some & and ∨ operations). Namely, if we start with a value a, apply a modifier, Zadeh’s negation and the inverse modifier, we will get precisely this formula. Historical comment: what was previously proved about the possible negation operations. For negation natural conditions imposed on the corresponding function g(x) [BG73] lead to the class of all possible operations consisting of strictly decreasing functions such that g(g(x)) = x, g(0) = 1 and g(1) = 0. In [T79] it is shown that these operations can be described as g(x) = f −1(1−f (x)), where f (x) is an arbitrary strictly increasing function from [0,1] onto [0,1]. An axiomatization proposed in [O83] on the basis of optimality demands also leads to the same large class of functions. If we additionally assume that g(x) has a “compensation” property in the sense that g(a + b) = g(a) − b [BG73] or that g(1 − a) = 1 − g(a), then we arrive at g(a) = 1 − a. However, no axiomatization have been proposed that would allow Zadeh’s and Sugeno’s negations and that would not include the whole bunch of all possible operations. 21

7.5 Fuzzy implication and other operations Implication and other logical connectives are sometimes used by experts. In addition to “and”, “or” and “not”, experts can use implication (“A implies B”) and other logical connectives while expressing their knowledge. For such operations, we must be able to estimate the truth value of the resulting statement (like “A implies B”) from the truth values of the component statements. Fuzzy implication. The same arguments that we gave about “and” and “or” can be applied here to prove that if we fix one of the variables, then the resulting function must be a reasonable transformation (and hence it must be fractionally linear). Indeed, the list of most frequently used implication operators [KB88] consists only of the functions that are piecewise fractionally linear: min(1, b/a) (proposed by Gaines), min(1, b/a, (1 − a)/(1 − b)) (a modification of Gaines’ proposal), min(1, 1 − a + b) (Lukaciewicz), 1 − a + ab (Kleene – Dienes – Lukaciewicz), max(1 − a, b) (Kleene – Dines), max(1 − a, min(a, b)) (Zadeh), min(max(1 − a, b), max(1 − b, a, min(1 − a, b))) (Willmot). This fact is easily explainable, because fuzzy implication a → b is usually defined either directly in terms of &, ∨ and ¬ (e.g., as b ∨ ¬a), or as a solution of the equation a&(a → b) = b. In the first case, the corresponding function is a composition of the functions that correspond to &, ∨ and ¬; these functions are piecewise fractionally linear, and therefore their composition is also piecewise fractionally linear. In the second case, the function corresponding to → is an inverse function to the one corresponding to &, and the function that is inverse to a piecewise fractionally linear function is piecewise fractionally linear itself (in mathematical terms, the conclusion that thus defined implication operations are reasonable follows directly from the fact that fractionally linear transformations form a group). Other logical connectives. It has been experimentally shown [Z78] that sometimes non-associative operations better represent human usage of “and” and “or” words. These operations are called aggregation operations; the first such operation f (a, b) = w1 a + w2 b was proposed in [BZ70]; for a current survey see [DP88, Section 3.1.2.3]. Experimental analysis (Czagola, 1988) revealed that the operations that are the best fit for human reasoning are piecewise bilinear (e.g., pab+(1−p)(a+b−ab) or p min(a, b) + (1 − p)max(a, b)). Some operations that are not piecewise linear or piecewise fractionally linear can be explained in a similar way to the above explanation of the formulas from [Y80]. Namely, these operations can be described as first applying a modifier to a, b, then some linear operation and then the inverse modifier. For example, in such a way one can represent the generalized means [KF88]: ((ap + bp + ... + cp )/n)1/p 8. WHAT FUZZY VARIABLE TO CHOOSE. HOW TO DESCRIBE “ALMOST EQUAL TO a” The main results of this section were announced in [KR86], and the proofs appeared first in a technical report [K89]. 8.1. Motivations In this section we consider only smooth membership functions. There are several reasons for doing that: 1) we are going to solve an optimization problem, and for the smooth case it is well known to be easier; 2) in fuzzy control applications we are often interested in the smoothest control, and since the shape of the membership functions directly affects the control strategy, it seems reasonable to consider only smooth membership functions; 22

3) any continuous function can be, with arbitrary precision, approximated by smooth functions; so if we restrict ourselves to smooth functions only, we do not lose much generality. In this section we consider only membership functions whose values are always positive. This is done for mainly the same reasons: 1) since we want to get smooth controls, it is reasonable to consider smooth membership functions. The maximal smoothness demand that one can impose is a demand that a function is analytical; but analytical functions are uniquely determined by their local behavior. This means that if an analytical function is equal to 0 on some interval, it must be identically equal to 0. Therefore, for such functions it is reasonable to demand that they are always positive. 2) any continuous function µ(x) can be, with arbitrary precision, approximated by positive functions: it is sufficient to add a small positive amount to the value of µ(x) in all the points x, in which µ(x) = 0; so if we restrict ourselves to positive functions only, we do not lose much of a generality. The main difference between this situation and the case when only finitely many words are possible. We want to describe membership functions, that correspond to the statements of the type “x is approximately equal to a”, or “x is approximately equal to a, with precision around σ”, where a and σ are explicitly given, and x is the unknown value that is estimated by this statement. As we have already noticed, since different experts can use different values of a (and maybe σ) to express their idea of what control to apply and under what conditions, we can have many different statements. And here comes the big difference between this case and the case when we had only finitely many possibilities (like “negligible”, “small positive”, “medium positive”, etc): in that case if 50 different experts give their opinions, each of them names one of these finitely many words. If, for example, 40 out of 50 experts said that some value x is “small positive”, then we can say that it is small positive with certainty 40/50=80%. If we ask the opinion of one more expert, we do not increase the information that we have: it will still read that the value is small positive, only the certainty value will change. It will either increase to 41/51 (if this new expert agrees with the majority) or decrease to 40/51, if a new expert disagrees. In our case, every additional expert adds some information. Indeed, suppose that we are asking several experts, what value of control u they would apply in a certain situation. The first expert can say: “To my viewpoint, you should apply u approximately equal to a1 with precision of the order σ1 ”, where a1 is his estimate of the optimal control, and σ1 is his estimate of the uncertainty of his own judgement. Suppose that we already know how to express such statements in terms of membership functions, and µ1 (u) is a membership function that corresponds to his statement. Then the second expert makes a slightly different statement that corresponds to a different membership function µ2 (u), then comes the third expert, from whom we extract µ3 (u), etc. What is the resulting membership function? If we take reliable experts, then we can say that a value u is a reasonable value of control if all the experts consider it reasonable. In other words, the statement C(u) meaning that “u is a reasonable value of a control” can be expressed as E1 (u)&E2 (u)&..., where Ei (u) stands for “i-th expert considers u to be a reasonable value”. In this section we consider only f& (a, b) = ab. Since we consider the case when & corresponds to the multiplication of truth values, the resulting membership function µC (u) for control is equal to the product of the membership functions that correspond to the Ei , i.e., µC (u) = f& (µ1 (u), µ2 (u), ...). We have already mentioned that the most widely spread operations for & are min and product. The problem with min is that it is not differentiable, so even if we start with two smooth membership functions, we end up with a function that is not differentiable in some points. Since we have already 23

decided to consider only smooth membership functions, we have no other choice but to use product for &. Therefore, the membership function for C(u) will be equal to µC (u) = µ1 (u)µ2 (u).... A set of all possible membership functions must be a semigroup. We are looking for a class F of all possible membership functions that represent this type of expert knowledge. If two functions µ1 and µ2 are possible, then their product is thus also possible, because it corresponds to the case when two experts independently express statements that correspond to µ1 and µ2 . So the class F must contain a product of every two of its elements. Such sets are called semigroups in mathematics. So we come to a conclusion that F is a semigroup. A set F must be finite-dimensional. We are interested in having a family of functions that would enable us to describe any possible knowledge. Since our final goal is to use that information for automated control, these functions must be representable in a computer, i.e., we must have a function with several adjustable parameters that would enable us to represent any function from F . Since in finite time we can estimate the values of only finitely many parameters, we need a function with finitely many parameters. So the family F must be finite-dimensional in the sense that fixing the values of finitely many (n) parameters would be sufficient to pick any function from that family. This argument is very close to the one that we used to explain why a family of reasonable transformations should be finite-dimensional. So we already know what kind of a family F we want to choose: we want a finite-dimensional semigroup. But there are many families with this property, and we want to choose the best family with respect to some reasonable criteria. It can be the best in the sense that it should be the best fit for an expert knowledge, or it should be the best in the sense that the resulting control is most stable, or the best in the sense that average computation time that is necessary to process the membership functions is the smallest possible, etc. We want to find the best family: why is this problem difficult? The main difficulty stems from the fact that at present, no one knows how to estimate the value of any reasonable criterion for any family of membership functions. Since we do not how to do it, we cannot compare the advantages and disadvantages of using different families F . How can we find a family, for which some characteristic is optimal if we cannot compute this characteristic even for a single family? There does not seem to be a likely answer. However, we will show that this problem is solvable (and give the solution). The basic idea of our solution is that we consider all possible optimization criteria on the set of all families, impose some reasonable invariance demands and for them deduce the precise formulas for the optimal family. This approach has been applied to various problems in [K90], [KK90], [KQ91]. What family is the best? Among all m-dimensional families of functions, we want to choose the best one. In formalizing what “the best” means we follow the general idea outlined in [K90] and applied to various areas of computer science (expert systems in [KK90], neural networks in [KQ91] and genetic algorithms in [KQ92a]). The criteria to choose may be computational simplicity, minimal average approximation error, or something else. In mathematical optimization problems, numeric criteria are most frequently used, where to every family we assign some value expressing its performance and choose a family for which this value is maximal. However, it is not necessary to restrict ourselves to such numeric criteria only. For example, if we have several different families that have the same average approximation error E, we can choose between them the one for which the average running time T of an approximation algorithm is the smallest. In this case, the actual criterion that we use to compare two families is not numeric, but more complicated: a family F1 is better than the family F2 if and only if either 24

E(F1 ) < E(F2 ) or E(F1 ) = E(F2 ) and T (F1 ) < T (F2 ). A criterion can be even more complicated. What a criterion must do is allow us for every pair of families to tell whether the first family is better with respect to this criterion (we will denote it by F1 > F2 ), or the second is better (F1 < F2 ) or these families have the same quality in the sense of this criterion (we will denote it by F1 ∼ F2 ). The criterion for choosing the best family must be consistent. Of course, it is necessary to demand that these choices be consistent, e.g., if F1 > F2 and F2 > F3 then F1 > F3 . The criterion must be final. Another natural demand is that this criterion must be final in the sense that it must choose a unique optimal family (i.e., a family that is better with respect to this criterion than any other family). The reason for this demand is very simple. If a criterion does not choose any family at all, then it is of no use. If several different families are “the best” according to this criterion, then we still have a problem choosing the absolute “best”. Therefore, we need some additional criterion for that choice. For example, if several families turn out to have the same average approximation error, we can choose among them a family with minimal computational complexity. So what we actually do in this case is abandon that criterion for which there were several “best” families, and consider a new “composite” criterion instead: F1 is better than F2 according to this new criterion if either it was better according to the old criterion, or according to the old criterion they had the same quality, and F1 is better than F2 according to the additional criterion. In other words, if a criterion does not allow us to choose a unique best family it means that this criterion is not ultimate; we have to modify it until we come to a final criterion that will have that property. The criterion must be reasonably invariant. We are looking for a general family that would represent the membership functions for all possible physical quantities. The numerical value of a quantity x depends on what units we use to represent it. For example, if x is a duartion of a time interval (e.g., a time during which we press on the accelerator), then we can describe it in seconds or in milliseconds. If x is a numeric value of time in seconds, then its value in milliseconds will be 1000x. We can also use any other unit. In general, if x is the numerical values of the quantity in one units, then the values of the same quantity in the new units will be cx, where c is the ratio of these two units. Now suppose that we first used one unit, compared two different families F = {f (x)} and F¯ = {f¯(x)}, and it turned out that F is better. It is reasonable to expect that the relative quality of two families should not depend on what units we used. So we expect that when we apply the same methods, but to the data in which quantities are expressed in the new units (in which we have cx instead of x), the results of using functions from F will still be better than the results of applying functions from F¯ . But if we use a function µ(x) to describe the degree of expert’s belief that x new units is a possible value of our variable, then to figure out what will be his belief that x in old units is a proper value, we must first translate x into new units x → cx and then apply µ to the resulting value cx new units. As a result we conclude that the uncertainty expressed by a function µ(x) when we are using new units will be expressed by a membership function µc (x) = µ(cx) in the old units. So we conclude that if a family {f (x)} is better than a family {f¯(x)}, then the family {f c (x)} must be better than {f¯c (x)}, where f c (x) denotes f (cx) and f¯c (x) = f¯(cx). This must be true for every c, because we can use arbitrary units. Another reasonable demand is related to the possibility to choose different starting points for x. Change in a starting point leads to a transformation x → x + x0 . Similar arguments lead to a conclusion that if a family {f (x)} is better than a family {f¯(x)}, then for every a the family {Ta f (x)} must be better than {Ta f¯(x)}, where Ta f (x) denotes a function f (x + a). 25

Now we are ready for the formal definitions. 8.2 Definitions and the main result of this section Definition 18. By a family we will understand a set of smooth membership functions, positive everywhere. A family F is called a semigroup if for every f, g ∈ F , their product f g also belongs to F . We say that a family is m- dimensional, where m is an integer, if there exists a connected open region U in an m-dimensional space Rm and a continuous mapping f : U × R → [0, 1] such that F coincides with the set of all functions x → f (~c, x) for different ~c ∈ U . Comment. This definition describes in mathematical terms that we need to fix m parameters to describe a function from F . Denotation. The set of all m-dimensional families will be denoted by Sm . Definition 19. A pair of relations ( 0 the following two conditions are true: i′ ) if F is better than F ′ in the sense of this criterion (i.e., F ′ < F ), then cF ′ < cF . ii′ ) if F is equivalent to F ′ in the sense of this criterion (i.e., F ∼ F ′ ), then cF ∼ cF ′ . Comment. As we have already remarked, the demands that the optimality criterion is final and invariant with respect to changing units of x and changing a starting point for x, are quite reasonable. The only problem with them is that at first glance they may seem rather weak. However, they are not, as the following Theorem shows: THEOREM 9. If an m-dimensional family F is optimal in the sense of some optimality criterion that is final and invariant with respect to changing units of x and changing a starting point for x, then there exists an integer n such that each element of F is equal to µ(x) = exp(a0 + a1 x + a2 x2 + ... + an−1 xn−1 ) for some numbers ai . Definition 22. A membership function will be called trivial if it is a constant function. Comment. If µ(x) = const for all x, this means that for all possible values of x an expert has precisely the same degree of belief that this value x is possible. This means that he actually has no information about the value of x, and the corresponding membership function is trivial in the sense that it carries no information at all. COROLLARY. The smallest value of n, for which an optimal family contains not trivial functions, is n = 3. For n = 3 the optimal family consists of Gaussian membership functions µ(x) = A exp((x − a)2 /σ 2 ). Comments. 1. This corollary explains why Gaussian membership functions are widely used ([K75], [KM87, Ch. 5]; for fuzzy control [BCDMMM85], [YIS85], etc.): because for all reasonable criteria they are the best among all families with three parameters. The Theorem also explains what family of membership functions to choose if three parameters are not sufficient. 2. We consider families that are closed under the “and” operation. But sometimes part of the knowledge comes in terms of an “or” statement. For example, if a robot is approaching an obstacle, then a reasonable expert’s suggestion for it is either to turn to the left, or to turn to the right. We will discuss this situation in more detail in Section 10, but now we just want to mention that this is a typical “or” situation. It is not reasonable not to turn at all, so the membership function µ(u) that describes an appropriate rotation u must be equal to 0 for u = 0 and must be positive for sufficiently big positive and negative values of u. Both positive and negative values can be expressed by Gaussian-type membership functions µR and µL , but we cannot represent the entire membership function µ(u) = f∨ (µR (u), µL (u)) by a single Gaussian formula. So in order to describe “or” statements, it is not sufficient to use functions from the optimal family F , we must also use their f∨ combinations. In particular, if we use f∨ = max, then we arrive at the necessity to use the functions µ(u) = max(µ1 (u), ...µn (u)). Since the functions µi from the optimal family are of the type µi (u) = exp(−Pi (u)) for some polynomials Pi (u), we conclude that µ(u) = exp(−P (u)), where P (u) = min(P1 (u), ..., Pn (u)) is 27

a piecewise polynomial function. This explains why functions µ(u) = exp(−P (u)) with piecewise polynomial P are often used: e.g., µ(u) = exp(−|u − u0 |) [Z75], [K86, Section 7.7]. 9. WHAT DEFUZZIFICATION TO CHOOSE. GENERAL CASE 9.1. Heuristic explanation of a centroid formula, and why it is not sufficient An explanation. Informally speaking, we want a value that is in average closest to the optimal control. Closest means that the square (u − u ¯)2 must be minimal. “In average” means that we have to take into account how often different values of control are appropriate. We do not know the frequencies, what we know are degrees of confidence. But let’s recall that one of the natural interpretations of the degrees of confidence is that they are proportional to the number N (u) of experts who believe that this very value u is the best: µC (u) = kN (u) for some constant k. The more experts that say that u is the best, the greater is the probability p(u) that this u will really be the best. In view of that we can estimate that probability K. C (u) for R as p(u) = Kµ R some constant Therefore, the average deviation of u ¯ from u equals to p(u)(u − u ¯)2 du = K µC (u)(u − u ¯)2 du. We must choose u ¯ so that this deviation is theRsmallest possible. Differentiating with respect to u ¯ R gives the explicit formula u ¯ = ( uµC (u) du)/( µC (u) du), which is exactly a centroid. Why we are not satisfied with this explanation. Well, for many reasons. Although looking for a method that would minimize the average square of the deviation u− u ¯ is often appropriate, in many problems other, more sophisticated, criteria are more appropriate. The bold usage of probabilities instead of truth values is also a not very satisfactory step, because truth values are not probabilities, and not all the manipulations that are justified for probabilities are justified in the fuzzy case. To summarize: we need a more reliable (more mathematical) solution to the problem of choosing a defuzzification. 9.2. How to formalize the problem of choosing a defuzzification in mathematical terms: motivations of the following definitions Let us first start with the simplest case, when only finitely many different values of x are possible. We will call such membership functions finite, or finite fuzzy functions, or finite fuzzy sets. In other words, the value µ(x) of a membership function is different from 0 only in finitely many points x1 , x2 , ..., xn . In order to describe such a function, it is sufficient to describe the values xi and the corresponding values µi = µ(xi ). A defuzzification procedure must be defined for all the cases when this set is non-empty, i.e., when not all µi are equal to 0. Let’s list reasonable demands for such a procedure. D1: The result of defuzzification must lie between the smallest and the biggest of the values xi . The first natural property of the desired result x ¯ of a defuzzification procedure is that it must lie between the smallest and the biggest of all possible values xi of the quantity x. The reason for that is as follows: when we say that µ(xi ) > 0, it means that there is some reason to believe that the actual value of x is equal to xi (or is at least close to xi ). The fact that µ(x) is different from 0 only for x = x1 , ..., xn means that all possible reasons lead to the values from the interval [min xi , max xi ], and there are no reasons to believe that x is smaller than min xi or bigger than max xi . Hence it seems reasonable to conclude that a single value, chosen by this procedure, must also belong to this same interval. D2: Symmetry. A finite fuzzy set is a finite set of pairs (xi , µi ). The word “set” means that it does not matter in what order we list these pairs; so evidently the result of defuzzification must not depend on the order in which we list them. 28

D3: If µi = 0 for some i, then the result of defuzzification must not depend on this xi . This demand is quite natural: if µi = 0, this means that xi is impossible, so we can omit it. D4: µi , xi and x ¯ can be interpreted as degrees of uncertainty, and the transformation from µi or xi to x ¯ is a transformation of degrees of uncertainty. That µi is a degree of uncertainty is evident: that’s what the values of membership functions describe. Let’s show that xi and µ can also be interpreted as degrees of uncertainty. To do that let’s recall that to describe a finite fuzzy function with n values we must describe 2n different parameters x1 , x2 , ..., xn , µ1 , ..., µn . A defuzzification method f takes all these parameters as an input and computes x ¯ as an output: x ¯ = f (x1 , ..., xn , µ1 , ..., µn ). Suppose that we fix somehow the values of all these parameters, except for µi for some i. The remaining parameter µi can take any value from 0 to 1. The bigger the value of xi , the more uncertain we are about the actual value of x: if µi = 0, then only the values x1 , ..., xi−1 , xi+1 , ..., xn are possible. When we increase µi , we add one more possibility: that the actual value of x equals to xi . The bigger the µi , the more possible is this additional possibility, and so the value µi = 1 corresponds to the greatest possible uncertainty (greatest possible under the condition that the values of all the other parameters are fixed). So in this case µi describes our degree of uncertainty. But these different degrees of uncertainty correspond in general to different values of x ¯. So in principle we can express the degree of uncertainty by the value of x ¯, and not by the value of µi . This is not a purely mathematical trick. For example, in the simplest case, when n = 2 and i = 2, when µ2 = 0, it means that with certainty x = x1 , so it’s natural to conclude that x ¯ = x1 . When µ2 increases, our degree of belief that x2 is possible increases as well, and therefore it is natural to “shift” the overall estimate x ¯ closer to x2 . In general, the same arguments work, so the value x ¯ really describes our degree of belief in xi : the closer is x ¯ to xi , the more we believe in xi . So if all the values of xj and µj are fixed, except for µi for some i, we can express our degree of uncertainty (or degree of belief in the possibility of xi ) in two different ways: by the value of µi (a bigger µi would mean a higher possibility for xi to be the actual value of x) and by the value of x ¯ (the closer is x ¯ to xi , the bigger is the possibility that the actual value of x is xi ). In other words, we have two different scales to represent the same degrees of belief: the scale of possible values of µi and the scale of possible values of x ¯. In these terms the transformation from µi to x ¯, defined by the formula x ¯ = f (x1 , ..., xn , µ1 , ..., µn ) with fixed x1 , ..., xn , µ1 , ..., µi−1 , µi+1 , ...µn , can be viewed as a transformation that “translates” the value of uncertainty in one scale into the representation of the same degree of uncertainty in some other scale. In short, the transformation from µi to x ¯ is a transformation of degrees of certainty. Let’s now fix the values of all the parameters, except for xi for some i. In this case, xi can take any value from −∞ to +∞. Let’s denote m = min(x1 , x2 , ..., xi−1 , xi+1 , ..., xn ) and M = max(x1 , x2 , ..., xi−1 , xi+1 , ..., xn ). Then, in particular, xi can take any value from M to ∞. For every choice of xi , the possible values of x lie between the minimum and the maximum of all the values xj . If xi > M , then the minimum of all the xj equals to m, and the maximum of all the xj equals to xi . Therefore, the bigger the xi , the bigger the interval of possible values for x, and the bigger our uncertainty in x. So in this case, xi can be viewed as describing our uncertainty in x: the bigger the xi , the greater our uncertainty. On the other hand, if we increase xi , we keep n − 1 possible values at the same place and shift the remaining value xi to the right. So it is natural to expect that the resulting “overall” value increases as xi increases. So the bigger x ¯, the bigger our uncertainty. So in this case we also have 29

two different scales to represent the same degrees of belief: the scale of possible values of xi and the scale of possible values of x ¯. In these terms the transformation from xi to x ¯, defined by the formula x ¯ = f (x1 , ..., xn , µ1 , ..., µn ) with fixed x1 , ..., xi−1 , xi+1 , ..., xn , µ1 , ..., µn , can be viewed as a transformation that “translates” the value of uncertainty in one scale into the representation of the same degree of uncertainty in some other scale. In short, the transformation from xi to x ¯ is also a transformation of degrees of uncertainty. Transformations of degrees of uncertainty have already been described in Section 3, therefore we are ready to give formal definitions. 9.3 Definitions and the main result of this section Definition 23. Assume that some positive integer n is fixed. A function f (x1 , ..., xn , µ1 , ..., µn ) of 2n real variables is called a defuzzification of finite membership functions if it is defined whenever not all µi are equal to 0. A defuzzification is called reasonable if it satisfies the following demands: 1) The value of f always lies between the smallest min xi and the biggest max xi of all the values. 2) (Symmetry) The value of f must not change after any permutation, i.e.: f (x1 , ..., xi−1 , xi , xi+1 , ..., xj−1 , xj , xj+1 , ..., xn , µ1 , ...µi−1 , µi , µi+1 , ..., µj−1 , µj , ..., µn ) = f (x1 , ..., xi−1 , xj , xi+1 , ..., xj−1 , xi , xj+1 , ..., xn , µ1 , ...µi−1 , µj , µi+1 , ..., µj−1 , µi , ..., µn ) 3) If µi = 0 for some i, then the result of defuzzification must not depend on xi , i.e., f (..., xi−1 , xi , xi+1 , ..., µi−1 , 0, µi+1 , ...) = f (..., xi−1 , x′i , xi+1 , ..., µi−1 , 0, µi+1 , ...) for all xi and x′i . 4) If we fix the values of all its variables except one of them, then the resulting functions xi → f (x1 , ..., xi−1 , xi , ..., xn , µ1 , ..., µn ) and µi → f (x1 , ..., xn , µ1 , ..., µi+1 , µi , µi+1 , ..., µn ) are reasonable transformations. From these demands we can conclude the following: LEMMA 1. Every reasonable defuzzification of finite membership functions has the form f (x1 , ..., xn , µ1 , ..., µn ) = (α1 x1 + ... + αn xn )/(α1 + ... + αn ), where αi = µi gP , and gi is aPmulti-linear symmetric function of µ1 , µ2 , ..., µi−1 , µi+1 , ..., µn , i.e., i P αi = µi (c0 + c1 j6=i µj + c2 j6=i k6=j,i µj µk + ...). Comment. We now know what a defuzzification can be for a finite fuzzy function. How to apply it to real-life cases, when an expert usually names the whole interval of possible values? A natural idea is to approximate an arbitrary membership function µ by a sequence of finite membership functions µn , apply the defuzzification to µn , and then take the limit of the resulting values f (µn ). A natural way to approximate a continuous function µ by a set of finite pairs is to take its values on a grid. For example, if a fuzzy function µ is located on an interval [0,1], we can take for µn the set of pairs (0, µ(0)), (1/n, µ(1/n)), ..., (i/n, µ(i/n)), ..., (1, µ(1)). As n → ∞, the sums α1 x1 + ... + αn xRn and α1 + ... + αRn in the description of P a reasonable defuzzification tend correspondingly to α(x)µ(x) dx and α(x) dx. The sum j6=i µj in the R description of αi tends to an integral x6=xi µ(x) dx. Since we consider only continuous membership functions, this integralR over a region with one missing P point P is equal to an integral over the entire region, i.e., to µ(x) dx. Likewise, the sum j6=i k6=j,i µj µk tends to a double R integral µ(x)µ(y) dx dy. So we arrive at the following definitions: Definition 24. By a defuzzification procedure (or defuzzification for short) we mean a mapping that transforms membership functions into numbers. Suppose that some reasonable defuzzification of finite membership function P f (x1 , ..., xn , µP µn ) = (α1 x1 + ... + αn xn )/(α1 + ... + αn ) is given, where 1 , ..., P αi = µi (c0 + c1 j6=i µj + c2 j6=i k6=j,i µj µk + ...). We say that a defuzzification µ(x) → x ¯ 30

is equal to a limit of the results x ¯n of applying a reasonable defuzzification of finite membership functions to a sequence µ of finite membership functions that approximate µ(x), if x ¯ = n R R R R ( α(x)µ(x) dx)/( µ(x) dx), where α(x) = µ(x)(a0 + a1 µ(y) dy + a2 µ(y)µ(z) dy dz + ...) for some ai . Definition 25. We say that a defuzzification is reasonable and continuous if it is equal to a limit of the results x ¯n of applying a reasonable defuzzification of finite membership functions to a sequence µn of finite membership functions that approximate µ(x). THEOREM 10. R R The only reasonable and continuous defuzzification procedure is a centroid µ(x) → ( xµ(x) dx)/( µ(x) dx). Comment. The authors of [YSB91] propose a linear defuzzification procedure. The main advantage is that we get a linear controller as the result, and for a linear controller we can apply the methods of control theory to check its stability, smoothness, etc. The main disadvantage is that there is that one has to adjust the coefficients of the linear defuzzification formula, because else there is no guarantee that the resulting control x ¯ would fit into an interval [min xi , max xi ]. The resulting control is pretty good, and this fact is in good accordance with our approach, because, if we drop the corresponding condition 1) from our Definition 23, we’ll get a more general class of defuzzifications, that would include the one from [YSB91]. 10. WHAT DEFUZZIFICATION TO CHOOSE. CASE OF PROHIBITIONS 10.1 Why it is not sufficient to use a centroid defuzzification [PYL91] A real-life situation when centroid does not work. Suppose that a robot is approaching an obstacle. It can either turn to the left or to the right. Some experts would recommend turning to the left, some of them would prefer turning to the right. If we place both rules into the initial set of rules that is used to develop a fuzzy control, and apply a previously described procedure, then we end up with a membership function µ(x) for the controlled angle. We assume that this membership function reflects the experts’ knowledge correctly. In particular, since it is senseless not to turn, the value µ(0) is either equal to 0, or close to 0. For big positive or big negative angles (corresponding to the angles that are reasonable to apply) the values µ(x) are positive and sufficiently big. Since turns to the right (x > 0) and to the left (x < 0) seem to be equally reasonable, it is reasonable to expect that we will have a symmetric membership function, i.e., a function µ(x) such that µ(x) = µ(−x). However, in this case a centroid defuzzification procedure leads to the value x ¯ = 0. So if we follow it, the robot will run directly into an obstacle. What was wrong with our axioms? Our demands lead to a centroid, but centroid is not appropriate in a real-life situation. This means that one of these demands is not applicable in this situation. It is easy to figure out which of them is not applicable: the demand that a defuzzification must be continuous. Let’s give informal explanation (a formal one will be given in the next subsection). If there is only one rule that recommends turning to the right, then x ¯ > 0, and there is no paradox. If there is only one rule that recommends turning to the left, then again there is no paradox, and x ¯ < 0. We can also consider the continuum of intermediate cases when the degrees of belief of these rules continuously change from 0 and 1 (we believe only in the right turn rule) to 1 and 0, when we believe only in the left-turn rule. If we use a continuous defuzzification procedure, then its result must pass from a negative to a positive value, and so it must inevitably for some intermediate degrees of belief pass through 0. So we must consider discontinuous defuzzification procedures. 31

How to design a discontinuous defuzzification procedure. The general idea was proposed in [YPL92]. We want to avoid the cases, when the value µ(x) for the resulting control x is too small. So we must establish some threshold value p, and consider only the values x, for which µ(x) ≥ p. The paradox occurs, when the values x for which µ(x) ≥ p form several disjoint regions. So in this case we choose a region that is most “probable” (in some reasonable sense), then restrict the function µ(x) to this region, and apply a continuous defuzzification to the result of this restriction. R The authors of [YPL92] propose to choose a region with the biggest area µ(x) dx, therefore they call their method a Centroid of Largest Area (CLA) Defuzzification. The main reason for choosing area and not any other characteristic is that using area really helps the robot to avoid the obstacles. Again a natural question arises: can we choose a better characteristic or is the area the best one already? In order to answer this question, let us formulate this problem in mathematical terms. 10.2 Definitions and negative results Definition 26. According to Definition 24, by a defuzzification we mean a mapping f that transforms a membership function µ(x) into a number x ¯ = f (µ). We say that a membership function has a finite support if it is equal to 0 outside some interval. We will consider only defuzzifications that are applicable to all continuous membership functions with finite support that are not identically equal to 0. We’ll say that a value x is prohibited by a membership function µ if µ(x) = 0, and allowed if it is not prohibited. 1) A defuzzification is called consistent with prohibitions if for every function µ(x) the result of a defuzzification x ¯ = f (µ) is an allowed value. If for some function µ(x) this result is a prohibited value, we say that the defuzzification is inconsistent with prohibitions. 2) A defuzzification is called symmetric if for every membership function µ(x) the results of applying a defuzzification to this function and to its mirror image µ ˜(x) = µ(−x) are opposite, i.e., f (˜ µ) = −f (µ). 3) A defuzzification is called continuous if the defuzzification f is continuous, i.e., if the functions µn (x) uniformly converge to µ(x), then f (µn ) → f (µ). THEOREM 11. Every symmetric defuzzification is inconsistent with prohibitions. THEOREM 12. Every continuous defuzzification is inconsistent with prohibitions. Comment. These results are not deep mathematical theorems, they just express accurately what we said before: that we need a discontinuous defuzzification procedure. 10.3 How to define defuzzification in case of prohibitions: general idea Definition 27. Suppose that a number p from the interval (0,1) is fixed. This number will be called a threshold. Suppose that a continuous defuzzification f is defined, as is a mapping g that transforms membership functions into numbers that will be called a characteristic. For every membership function µ(x), the set of all values x for which µ(x) > p, can be divided into one or several disjoint connected intervals I1 , I2 ,..., that will be called regions. By a restriction of a function µ to an interval I, we mean a function µ|I defined by the formula µ|I (x) = µ(x) if x ∈ I and µ|I (x) = 0 for x outside x. We say that a region I is most probable if the restriction of µ to I has the biggest characteristics, i.e., g(µ|I ) ≥ g(µ|J ) for any other region J. By a defuzzification of most probable region we mean the result of applying f to the restriction µ|I of µ to the most probable region I. Comments. R 1. In particular, if we use a centroid for f and an area g(µ) = µ(x) dx, our definition turns into the centroid of largest area (CLA) defuzzification proposed in [YPL92]. 32

2. Instead of fixing the value of a threshold, we could as well choose a threshold that depends on a membership function. For example, the authors of [YPL92] propose to take p = p0 maxx µ(x) for some fixed value p0 (they call it a relative threshold strategy). 3. We have already argued that the only reasonable choice of a continuous defuzzification is a centroid. Let’s now figure out what is a reasonable choice of a characteristic g. 10.4 Motivations of the following definitions We have already (in Section 5) dealt with a problem of finding a functional. We are looking for a functional that is defined on the set of all membership functions. In the present paper we have already encountered a situation where we looked for such a functional. It was in Section 5, where we were looking for a functional that would describe the relative quality of different membership functions. So we can use the definitions from there. Some reasonable demands from Section 5 are also applicable to our problem. The same arguments as those in Section 5 show that it is reasonable to demand that g is an analytical functional, and that the corresponding ordering relation is invariant with respect to changing units of x, changing a starting point for x, and rescaling. To get rid of the degenerate cases, in which it is extremely difficult to solve the optimization problem, it is reasonable to demand that either a linear, or a quadratic part of this functional should be 0. However, some of the demands from Section 5 are not justified here. One of them is locality. We want to apply g only to “connected” membership functions, so the assumption that µ can consist of two parts located on two disjoint intervals is not applicable to our problem. Another demand from Section 5 thatRis not justified in this case is the demand that a functional is not equivalent to a linear functional a(x)µ(x) dx. The reason why we introduced this demand in Section 5 was that in that section we were using the desired functional J(µ) to choose the optimal membership function out of all possible functions. If weRadmit linear functionals, then the solution of the corresponding conditional optimization problem a(x)µ(x) dx → max under the conditions that µ(xi ) = µi in finitely many given points x1 , ..., xn , and 0 ≤ µ(x) ≤ 1 in all other points, is easy to obtain: the optimal function µ(x) is equal to µi for x = xi , to 1 for those x, for which a(x) > 0, and to 0 for those x, for which a(x) < 0. The value of µ(x) in the points, in which a(x) = 0, does not influence the functional J at all, and can therefore be arbitrary. Such a membership function, that almost everywhere takes only “crisp” values 1 and 0 (that correspond to “true” and “false”) does not look like a good description of any fuzzy statement. In our case, however, we are choosing only between finitely many functions that are restrictions of some initial membership function to its possible regions. In this case, the previous argument against using linear functionals is not applicable. Instead of the demands that are not applicable to our problem we have a new demand. This demand is that the functional should be applicable to all membership functions, at least to all that are piecewise continuous and located on an interval. In Section 5, we did not formulate this demand, and, moreover, the functionals that we came up with were applicable only to smooth functions. For non- smooth functions, the value of each of these functionals would be −∞. This is not a drawback for a problem that we solved in Section 5, namely, the problem of finding the best membership function. Indeed, if we are looking for a function µ(x) with the biggest possible value of J(µ), then the fact that J(µ) = −∞ for some µ simply means that this function µ is an inappropriate choice. In our case, however, the whole purpose of the functional g is to compare different membership functions. In this case, if we take a functional that is applicable only to smooth functions, then 33

for every non-smooth membership function µ we would get g(µ|I ) = g(µ|J ) = ... = −∞ for all its regions I, J, ..., and there is no way to make a choice. Now we are ready for the formal definitions. 10.5 Final definitions and the choice of the defuzzification in case of prohibitions Definition 28. We say that an analytical functional (in the sense of Definition 5) is everywhere defined if it is applicable to every piecewise-continuous function µ(x) with a finite support (i.e., whose values outside some interval equal 0). Definition 29. We say that an analytical functional R R g(µ) = a0 + a1 (x)µ(x) dx + a2 (x, y)µ(x)µ(y) dx dy + ... is weakly non-degenerate if either its linear part or its quadratic part is different from 0, i.e., either a1 6= 0, or a2 6= 0. Comment. We need g only for one thing: to choose the most probable region I; we choose I as the region, for which the value g(µ|I ) is the biggest. If two functionals g and g˜ are equivalent in the sense of Definition 7 (i.e., g(µ1 ) < g(µ2 ) if and only if g˜(µ1 ) < g˜(µ2 )), then they lead to the same choice of I. Therefore, like in Section 5, we will be satisfied if inside of describing the functional g itself, we will be able to describe a functional that is equivalent to g. This is a purpose of the following theorem. THEOREM 13(13′ ). Assume that a set of reasonable transformations is defined, g is an everywhere defined analytical weakly non-degenerate functional, and the corresponding ordering relation is invariant with respect to changing R units of x, changing aR starting point for x, and rescaling. Then g is equivalent either to g˜(µ) = µ(x) dx, or to g˜(µ) = |x − y|α µ(x)µ(y) dx dy for some α. R Comment. If we take g˜(µ) = µ(x) dx, then we arrive at the R centroid of largest area (CLA) method that was proposed in [YPL92]. If we take g˜(µ) = |x − y|α µ(x)µ(y) dx dy, then we choose a region, for which α-th central moment is the biggest. This method may be worth trying. In particular, for α =R 2 the computations can be simplified, because we can use the formula R R g(µ) = 2[( xµ(x) dx)( µ(x) dx) − ( xµ(x) dx)2 ]. 10.6 Why largest area and not largest central moment Comment. Since computer experiments show that namely area is a good characteristic to use, let’s figure out why. The arguments that we used previously did not allow us to choose namely area. So let us try additional arguments, similar to the ones that we used in deducing the continuous defuzzification. Namely, we will start with finite membership functions. Similar arguments show that transformations xi → g(x1 , ..., µ1 , ...) and µi → g(x1 , ..., xn , µ1 , ..., µn ) can be viewed as transformations between reasonable scales for representing uncertainty. The only condition from Definition 23 that is not applicable here is that the value of a functional should always lie between min xi and max xi . With this in mind we arrive at the following definition. Definition 30. Assume that some positive integer n is fixed. A function g(x1 , ..., xn , µ1 , ..., µn ) of 2n real variables is called a characteristic of finite membership functions if it is defined whenever not all µi are equal to 0. A characteristic is called reasonable if it satisfies the following demands:

34

1) (Symmetry) The value of g must not change after any permutation, i.e.: g(x1 , ..., xi−1 , xi , xi+1 , ..., xj−1 , xj , xj+1 , ..., xn , µ1 , ...µi−1 , µi , µi+1 , ..., µj−1 , µj , ..., µn ) = g(x1 , ..., xi−1 , xj , xi+1 , ..., xj−1 , xi , xj+1 , ..., xn , µ1 , ...µi−1 , µj , µi+1 , ..., µj−1 , µi , ..., µn ) 2) If µi = 0 for some i, then the characteristic must not depend on xi , i.e., g(..., xi−1 , xi , xi+1 , ..., µi−1 , 0, µi+1 , ...) = g(..., xi−1 , x′i , xi+1 , ..., µi−1 , 0, µi+1 , ...) for all xi and x′i . 3) If we fix the values of all its variables, except one of them, then the resulting functions xi → g(x1 , ..., xi−1 , xi , ..., xn , µ1 , ..., µn ) and µi → g(x1 , ..., xn , µ1 , ..., µi+1 , µi , µi+1 , ..., µn ) are reasonable transformations. From these demands we can conclude the following: LEMMA 2. Every reasonable characteristic of finite functions is multi-linear in xi , P P membership i.e., it has the form g(x1 , ..., xn , µ1 , ..., µn ) = a0 + i ai xi + i6=j aij xi xj + ..., where a0 , ai , aij , ... are fractionally linear functions of µ1 , ..., µk . Comment. We now know what a characteristic can be for a finite fuzzy function. In order to compute g(µ) for a continuous µ, we will use the same approach as we used for defuzzification: we approximate µ by a sequence of finite membership functions µn , apply the characteristic to µn , and then take the limit of the resulting values g(µn ) as g(µ). A natural way to approximate a continuous function µ by a set of finite pairs is to take its values on a grid. For example, if a fuzzy function µ is located on an interval [0,1], we can take for µn the set of pairs (0, µ(0)), (1/n, µ(1/n)), ..., (i/n, µ(i/n)), ..., (1, µ(1)). P P As n → ∞, the sums ij aij xi xj , ..., from the expression given by Lemma 2, tend i ai xi ,R R correspondingly to xa1 (x) dx, xya2 (x, y) dx dy,..., where P a1 (x), a2 (x, y), ... depend only on the values of µ(y) and not on x, y, .... The fact that the sum ij aij xi xi was limited only to the terms with i 6= j (in other words, that aii = 0 for all i), leads to the demand that the limit function a2 (x, y) satisfies the equality a2 (x, x) = 0 for all x. So we arrive at the following definitions: Definition 31. Suppose that some reasonable characteristic of P Pfinite membership function g(x1 , ..., xn , µ1 , ..., µn ) = a0 + i ai xi + i6=j aij xi xj + ..., where a0 , ai , aij , ... are fractionally linear functions of µ1 , ..., µk . We say that an analytical characteristic µ(x) → g(x) is equal to a limit of the results gn of applying a reasonable characteristic of finite membershipR functions to aR sequence µn of finite membership functions that approximate µ(x), if g(x) = a0 + xa1 (x) dx + xya2 (x, y) dx dy + ..., where the expressions a1 (x), a2 (x, y), ... can contain µ(x), µ(y), ..., but do not contain x, y, ... explicitly. Definition 32. We say that a characteristic is reasonable and continuous if it is an everywhere defined, analytical, weakly non-degenerate functional, the corresponding ordering relation is invariant with respect to changing units of x, changing a starting point for x, and rescaling, and g is equal to a limit of the results gn of applying a reasonable characteristic of finite membership functions to a sequence µn of finite membership functions that approximate µ(x). THEOREM 14. R Every reasonable and continuous defuzzification characteristic is equivalent to an area µ(x) → µ(x) dx. Comment. If in the defuzzification of a most probable region we use a reasonable and continuous defuzzification, and a reasonable and continuous characteristic, then we arrive at the method of centroid of largest area (CLA) from [YPL92].

35

11. WHY FUZZY CONTROL IS OFTEN BETTER THAN THE EXPERTS’ IT SIMULATES: AN EXPLANATION The results of this section appeared first in a technical report [KFLL91]. 11.1 How to compare traditional and fuzzy control Fuzzy control is nothing but an extrapolation. Fuzzy control does not design a control from nothing: in order to have a fuzzy control, one needs to start with the expert’s knowledge. And here the general rule of using computers can be applied: garbage in, garbage out. If we start with bad rules, we get bad control. So if we want to make a fair comparison, we must take the experts that already know how to control, extract the knowledge from them, and compare the resulting fuzzy control with the original one. Since we can ask only finitely many questions to the experts, they can only explain what they do in finitely many cases. The expert system must give some instructions in all the cases, so we need some kind of extrapolation. In these terms, fuzzy control is nothing else but a very special extrapolation procedure [SC91]. Why can’t we use well-known extrapolation methods? In principle, we could use any of the extrapolation procedures from numerical mathematics: spline approximation, for example. These procedures have been analyzed for decades (some of them even for centuries), and they are optimal in the sense that the average or maximal difference between the original function and the extrapolated one is the smallest possible. Our point is as follows: if we take the quality of the resulting control as a criterion for choosing extrapolation method, then fuzzy control is much better than using the extrapolation methods from numerical mathematics. In order to prove this, we must first explain what “better” means for a control. When is a control better: main criteria. If you look through the textbooks on control, you can easily notice that the buzzword there is stability. There is a theoretical stability, that means that every initial perturbation disappears sooner or later (when time t → ∞). This notion is often of little practical importance, because if “sooner or later” really means “in a year”, then this control is no good. So in practice stability is expressed as follows: we fix some level (e.g., 30 db, i.e., 1000 times), and measure stability of a system by the amount of time, during which the initial perturbation diminishes this number of times. Another important practical criterion is smoothness. The reason is that in many theoretical control models, the optimal control is discontinuous (for a general theorem see, e.g., [M91]). Such a control is called a bang-bang control, because it includes sharp turns and changes. For example, space flight is most fuel- efficient when we first accelerate as fast as we can, and decelerate as fast as we can at the end. For sure this kind of control is not the most comfortable one. The reason why bang-bang controls appear is that we did not take into consideration smoothness as one of the goals, when we formulated the optimality problem. There are two possible ways out: first, to formulate the smoothness restriction explicitly. This, however, makes the optimization problems practically impossible to solve. Another (practical) way out is to start with bang-bang control and somehow make it more smooth. Let’s analyze fuzzy control with respect to these characteristics. 11.2 Motivations: why should we consider only continuous membership functions One of the main differences between “crisp” (non-fuzzy) and fuzzy properties is as follows. If we have a crisp property, then the truth value changes abruptly from 0 to 1 when we move from a 36

value in which it is true, to a value, in which it is false. For fuzzy properties, there is no such abruptness, degrees of confidence change gradually. Therefore, it is reasonable to consider only continuous membership functions. All membership functions that are used in fuzzy control are continuous. 11.3 Definitions and the first result: fuzzy control is continuous Definition 33. Let’s fix a set of continuous membership functions that are different from 0. The elements of this set will be called fuzzy properties. By an elementary formula Rwe mean R Ran expression of the type P (z), where P is a fuzzy property, z is either x, or x, ˙ or x ¨, ..., or x, or x, etc. By a rule, we mean an expression of the type E1 , ..., Em → P (u), where Ei are elementary formulas, P is a fuzzy predicate and u is a special variable reserved for control. Formulas Ei are called conditions, P (u) is called a conclusion of the rule. By a knowledge base we mean a pair of a positive number U and a finite set of rules, such that if P is a conclusion of one of them and |u| > U , then P (u) = 0. Comment. This restriction means that all possible values of control are limited to some interval [−U, U ]. This restriction holds in every real-life control situation: there are limits on accelerations, etc. Definition 34. By a fuzzy R control that corresponds to a given knowledge base, we mean an expression R u ¯ = ( uµC (u) du)/( µC (u)du), where for every x: µC (u) = maxR (µCR (u)), maxR denoted the maximum over all the rules R from the knowledge base, and for every rule P1 (z1 ), P2 (z2 ), ..., Pn (zn ) → P (u) the value µCR (u) is equal to min(P 1 (z1 ), ..., P2 (z2 ), ..., Pn (zn ), P (u)). We say that a fuzzy control R is everywhere defined if µC (u)du is always positive. We say that a rule P1 (z1 ), P2 (z2 ), ..., Pn (zn ) → P (u) is applicable to some set of values x, x, ˙ etc., if for all its condition Pi (zi ) > 0. LEMMA 3. If for every set of values x, x, ˙ etc., at least one rule is applicable, then the resulting fuzzy control is everywhere defined. THEOREM 15. If a fuzzy control is everywhere defined, then the resulting value of the control u ¯ continuously depends on the input parameters x, x,... ˙ Comments. 1. This result means that even if we start with a bang-bang control, we come out with a continuous one. So if we control the first time derivative, we get a trajectory with a continuous derivative; if we control the acceleration (second derivative), we get a trajectory x(t) that has a continuous second derivative, etc. 2. The same result is true, if we use arbitrary continuous functions f& and f∨ instead of min and max: Definition 34A. Suppose that f& and f∨ are &- and ∨-operations (in the sense of Definition 14), such that if a, b > 0, then f& (a, b) > 0. By aR fuzzy control Rthat corresponds to a given knowledge base, we understand an expression u ¯ = ( uµC (u) du)/( µC (u)du), where for every x: µC (u) = f∨ (µCR (u)), f∨ is applied to the values, that correspond to all rules R from the knowledge base, and for every rule P1 (z1 ), P2 (z2 ), ..., Pn (zn ) → P (u) the value µCR (u) is equal to f& (P1 (z1 ), ..., P2 (z2 ), ..., Pn (zn ), P (u)). THEOREM 15A. If a fuzzy control is everywhere defined, then the resulting value of the control u ¯ continuously depends on the input parameters x, x,... ˙ 37

Comment. Informally, we can summarize these result as follows: fuzzy control is always continuous. 11.4 Preparing to analyze stability As we have already mentioned, fuzzy control is a kind of extrapolation. In reality there exists some control u(x, ...) that an expert actually applies. However, he cannot precisely explain, what function u he uses. So we ask him lots of questions, extract several rules, and form a fuzzy control from them. What we are interested in is whether this fuzzy control is better (in this subsection more stable) than the original one. Of course, it cannot always be better, because we could start from the very beginning with the best possible control u(x, ...). What we can analyze is whether this chain real control → discrete approximation → f uzzy control really improves reasonable, but not optimal, controls. Let’s describe this procedure formally for the simplest case, when u is a monotonic function of x (and does not depend on x). ˙ The fact that control must be monotone is prompted by common sense: the more we deviate from the desired position x = 0, the faster we need to move. For simplicity we also consider only the case when centroid defuzzification is sufficient. 11.5 Formal description of a fuzzy control, that stems from a monotonic control u(x) Definition 35. By a triangular fuzzy property with midpoint a and endpoints a − ∆1 and a + ∆2 we mean a fuzzy set with a membership function µ(x) = 0 if x < a − ∆1 or x > a + ∆2 ; µ(x) = (x − (a − ∆1 ))/∆1 if a − ∆1 ≤ x ≤ a and µ(x) = 1 − (x − a)/∆2 if a ≤ x ≤ a + ∆2 . Definition 36. Let’s fix some ∆ > 0. For every integer j by Nj we’ll denote a triangular fuzzy property with midpoint j∆ and endpoints (j − 1)∆ and (j + 1)∆. We will call N0 negligible (N for short), N1 small positive or SP , and N−1 small negative, or SN . Suppose that u(x) is a monotone function. By rules generated by u(x), we mean the set of following rules: “if Nj (x), then Mj (u)” for all u, where Mj is a triangular fuzzy property with the midpoint u(j∆) and endpoints u((j − 1)∆) and u((j + 1)∆). By u ¯(x), we denote the fuzzy control that is obtained from this set of rules. Comment. What these rules look like in a linear case. For example, let’s take linear control, in which u linearly depends on x and other input parameters. This control is one of the most often used [DH88]. In the simplest case (like a thermostat), when we control the first time derivative of the desired parameter x, i.e., when u = u(x), linear control means that u = −kx for some k > 0 (k > 0, because else such a “control” would only increase deviations, and not stabilize them at all). For linear control, Mj resembles N−j with the only difference being that instead of ∆ we use k∆. So we can reformulate the corresponding rules as follows: if x is negligible, then u mist be negligible; if x is small positive, then u must be small negative, etc. Here we mean ∆ when we talk about x, and we mean k∆ when we talk about u. The following definition will be limited to the case, when we control the first time derivative. Definition 37. Suppose that a control u(x) is given. By a trajectory of the controlled system we understand the solution of the differential equation x˙ = u(x). Let’s fix some positive number M (e.g., M = 1000). By a relaxation time t(δ) for a control u(x) and initial deviation δ, we understand the smallest time with the following property: if by x(t) we denote the trajectory of the controlled system with the initial condition x(0) = δ, then for all t ≤ t(δ) the values x(t) satisfy the inequality |x(t)| ≤ x(0)/M . 38

Comment. For linear control u(x) = −kx we have x(t) = x(0) exp(−kt), and therefore the relaxation time t is easily determined by the equation exp(−kt) = 1/M , i.e., t = ln M/k. For non-linear control (and fuzzy control is non-linear) t depends on δ, so to estimate, how successfully the system overcomes small perturbations, we can consider the limit value, when δ → 0. Definition 38. By a relaxation time T for a control u(x) we mean the limit of t(δ) for δ → 0. LEMMA 4. If the control u(x) is a smooth function of x, then the relaxation time equals to ln M/(−u′ (0)), where u′ denotes the derivative of u. Comment. So the bigger this derivative, the smaller the relaxation time, and therefore the better (more stable) the control. One can ask: if we are for more stable control, why do not we simply take −kx for some very big x? Sounds fine for small x, but for big perturbations, this would lead to uncomfortably quick changes (just like in the case of bang-bang control). So ideally we would combine moderate k for big x with bigger k for small perturbations x. Fuzzy control does not change the behavior of the system for big x: for x = j∆, as one can easily see, u ¯(x) = u(x) = −kx. In this sense it is really an extrapolation. Let’s see what happens for small x. Let’s first consider the case, when we use min and max: THEOREM 16 (min, max). If we start with a linear control u(x) with a relaxation time T , then the relaxation time T¯ of the resulting fuzzy control is equal to T¯ = 2/3T . Comment. So fuzzy control is really smarter! And the new relaxation time is 1.5 times smaller than the old one. This 50% improvement is in good accordance with the results of experimental comparison of fuzzy control and regular control [D91]. If the original control u(x) is smooth, but non-linear, then the following Theorem is true: THEOREM 17 (min, max). Suppose that a smooth control u(x) is given with a relaxation time T , and T¯∆ is a relaxation time of the fuzzy control that is obtained from using the parameter ∆ > 0. When ∆ → 0, this relaxation time T¯∆ tends to 2/3T . Corollary. For sufficiently small ∆, the relaxation time of the fuzzy control is smaller than the relaxation time of the original control. Comment. The same results are true for more complicated control situations, when the control depends on x, x, ˙ etc. [KFLL91]. 12. THE CHOICE OF & AND ∨ OPERATIONS. DOCKING AND TRACING 12.1 Two main types of control problems in Space Shuttle dynamics: tracing and docking Tracing and docking are the real-life examples of the control problems. In tracing (e.g., tracing a star or a moving object) the main objective is to keep tracing and not to lose an object. If we lost an object that we are tracing, then we must return to it as quickly as possible. So for such problems the reasonable criterion for choosing a control is maximal stability (i.e., the smallest possible relaxation time). In docking problems, this criterion makes no sense: if we unnecessarily speed up, we’ll crash into a space station instead of smoothly approaching it. So here a reasonable criterion is maximal smoothness. Here we encounter an additional problem: how to describe smoothness of a trajectory? We will show that in these two cases different & and ∨ operations are the best. 39

12.2 What & and ∨ operations will be considered Not all & and ∨ operations make sense for fuzzy control. As we have noticed in the previous section, not all operations make sense. For example, if f& (a, b) = 0 for some a > 0, b > 0, then for some x, x, ˙ ... the resulting membership function for a control µC (u) can be identically 0, and there is no way to extract a value of the control u ¯ from such a function. So we must restrict somehow the class of possible operations. Motivation. We have already mentioned in Section 3 that one of the most frequently used methods of assigning truth values t(A) to uncertain statements is based on using the ratio p(A) = N (A)/N , where N (A) is the number of experts who believe in A, and N is the total number of experts that were questioned. In this interpretation, the following inequalities are true: N (A ∨ B) ≤ N (A) + N (B), N (A ∨ B) ≤ N , N (A ∨ B) ≥ N (A) and N (A ∨ B) ≥ N (B). If we divide both sides of these inequalities by N , and combine them into one, we get the following inequality: max(P (A), P (B)) ≤ P (A ∨ B) ≤ min(P (A) + P (B), 1). Likewise, from N (A&B) ≤ N (A) and N (A&B) ≤ N (B) we conclude that P (A&B) ≤ min(P (A), P (B)). If belief in A and belief in B were independent events, then we would have P (A&B) = P (A)P (B). In real life, beliefs are not independent: if an expert has strong beliefs in several statements that later turn out to be true, then this means that he is really a good expert, and therefore it is reasonable to expect that his degree of belief in other statements that are true is bigger. If A and B are complicated statements, then many of those experts who believe in A are really good experts, and therefore they believe in B as well (and hence in A&B). Therefore, the total number N (A&B) of experts who believe in A&B must be bigger than the same number in the case when beliefs in A and B were uncorrelated random events. So we come to a conclusion that the following inequality sounds reasonable: P (A&B) ≥ P (A)P (B). In statistical terms we can express this inequality by saying that A and B are non-negatively correlated. In this case, we are guaranteed that if a > 0 and b > 0, then f& (a, b) > 0, i.e., we do not have a problem that we discussed in the beginning of this subsection. So we arrive at the following definitions: Definition 39. By an and-or pair we will understand a pair of & and ∨ operations f& (a, b) and f∨ (a, b) such that max(a, b) ≤ f& (a, b) ≤ min(a + b, 1) and f∨ (a, b) ≤ min(a, b). An and-or pair is called correlated if f& (a, b) ≥ ab for all a and b. 12.3 Tracing: main result THEOREM 18. If we start with a linear control u(x) = −kx with a relaxation time T , then the relaxation time of the fuzzy control for min(a, b) and min(a + b, 1) is T¯ = 1/2T , and for any other and-or pair T¯ ≥ 1/2T . Comments. 1. We proved that f& = min and f∨ (a, b) = min(a + b, 1) are the best choice for tracing. This result is in good accordance with the general result of Section 7, where the operations mentioned in Theorem 18 appeared as a particular case of the general family of reasonable operations. This result is also in good accordance with the experimental fact that control systems that use min and + are really better than the manual control of the experts whose knowledge they use [MYI87]. Several other advantages of + over max are described by B. Kosko [K92]. 2. Our proof of Theorem 18 does not use the triangular form of the membership function, it uses only the fact that this membership function is located on [a − ∆, a + ∆]. So the fact that min(a, b) and min(a + b, 1) form the best pair is true for other membership functions as well (however, the value of the relaxation time essentially can change). 40

For another frequently used pair of operations, product and sum, the results are not that impressive: THEOREM 19. If we start with a linear control u(x) = −kx with a relaxation time T , then the relaxation time of the fuzzy control for f& (a, b) = ab and f∨ (a, b) = min(a + b, 1) is T¯ = T . Comment. This result is in good accordance with the experimental fact that fuzzy control that uses the product and + is approximately of the same quality as the traditional control [K92]. 12.4 Docking: how to describe the smoothness of a trajectory x(t)? Motivation. Before we can decide what operations lead to smoother trajectories, we must be able to describe what “smoother” means. This problem is similar to the one that we solved in Section 5: to describe what “better” means for membership function µ(x). So we will look for a functional J(x). It is reasonable to assume that it is an analytical, non- degenerate, local functional, and that the corresponding ordering relation is invariant with respect to changing units of time t and changing a starting point for time t. Invariance with respect to reasonable rescaling of certainty values is not applicable here, but we can apply invariance with respect to a physical rescaling x(t) → kx(t), where k is the ratio of the old and the new units. So we come arrive at the following definition: Definition 40. By a physical rescaling, we understand a transformation x(t) → kx(t). Definition 41. By a non-smoothness functional, we understand an analytical, non-degenerate, local functional x(t) → J(x(t)), for which the corresponding ordering relation is invariant with respect to changing units of t, changing a starting point for t and physical rescaling. R THEOREM 20. Every non-smoothness functional is equivalent to J(x) = (x(n) (t))2 ) dt for some integer n. Comment. So the problem of choosing the optimal control is reduced to the problem of choosing a control for whose trajectories some non-smoothness functional takes the smallest possible values. The choice of n depends on the problem. What n to choose for docking? If we take n ≥ 2, then for all linear trajectories x(t) = Kx the value of the non-smoothness functional is 0, irrespective to whether K is small or big. For docking, however, such a control with big K would be a disaster. So n ≥ 2 is inappropriate, and hence for docking problem the only reasonable choice is n = 1. As with relaxation time, we can now define smoothness: Definition 42. ByRa non-smoothness J(δ) of a control u(x) and initial deviation δ, we understand ∞ the value J(x) = 0 (x(t) ˙ 2 ) dt, where x(t) is a trajectory of the controlled system with the initial condition x(0) = δ. Definition 43. By a non-smoothness J of a control u(x), we mean the limit of J(δ)/δ2 for δ → 0. Comment. /δ2 is added, because for linear control J(δ) ∼ δ2 when δ → 0, so we must divide by δ2 to get a finite limit. THEOREM 21. If we start with a linear control u(x) = −kx, then among all possible correlated and-or pairs the non-smoothness of the resulting fuzzy control is the smallest for ab and max(a, b). Comment. This result is in good accordance with the experiments on a Space Shuttle simulator.

41

13. HOW TO COMBINE RULES AND OPTIMIZATION WHEN DESIGNING A FUZZY CONTROL 13.1 Formulation of the problem A general problem and a real-life example. All the methods of designing fuzzy control that we had by now are based on experts’ rules only. Sometimes in addition to the rules we also know some characteristic of the control that we would like to optimize. For example, when we design a control for a Space Shuttle, one of the objectives is to save fuel. The bigger the acceleration, the more fuel we use. So we want to minimize the absolute value of the control (i.e., in this example, of the acceleration). In other situations, some other function J(u) of control u have to be optimized. Can we somehow add the demand that J(u) should be optimal to the initial set of rules, so that the method of translating rules into a control would automatically take care of that demand as well? A methodology of formalizing optimization as one of the demands was proposed by Zadeh even before the first application of fuzzy control [Z72] (for a current survey see [Z85]). We are given a fuzzy property P , and an objective function J(u). Among all the points that satisfy the property P , we want to find a value u, for which J(u) is the biggest possible. Since the set of the points in which P is true is not a crisp set, we are willing to accept as an answer the whole set of points, where J(u) is sufficiently large and where P (u) is true. How to describe it in mathematical terms? Zadeh’s solution to this problem is as follows. Let us denote by µP (u) a membership function that describes a fuzzy property P . If we can somehow find a membership function µm (u) that corresponds to the phrase “J(u) is sufficiently large”, then we can assign to a phrase “J(u) is sufficiently large and P (u) is true” a membership function µ(u) = f& (µP (u), µm (u)). If we are then interested in a single value u ¯, we can apply a defuzzification procedure to the resulting membership function µ(u). The main problem here is choosing a membership function µm (u) that corresponds to a maximization problem. Zadeh, in [Z72], proposed using µm (u) = (J(u) − m)/(M − m), where M is the maximum and m is the minimum of J(u). This proposal was semi- empirical, so no wonder that several other choices were proposed and turned out to be sometimes better [Z85]. So the main problem is: what to choose? 13.2 Motivations of the following definition Suppose that we have already somehow defined a membership function µm (u) that corresponds to optimization. Then for every value u we have two different reasonable scales that both express our degree of belief in the statement that u is sufficiently large. On one hand, we have a function µm (u), that expresses precisely our degree of belief in that statement. On the other hand, the bigger the value J(u), the more our belief that J(u) is sufficiently large, so we can use the value J(u) as a different scale to express our degree of belief. Since we have two reasonable scales, the corresponding rescaling J(u) → µm (u) must be a reasonable transformation in the sense of Section 3. So µm (u) = f (J(u)), where f (x) is a reasonable transformation. If J(u) attains an unconditional maximum at some point u, then for this point u we are absolutely sure that J(u) is sufficiently large. Therefore for this u we must have µm (u) = 1. On the other hand, if u is a point where J(u) attains a minimum, then there is absolutely no reason to call this value sufficiently large, so we must have µm (u) = 0. We arrive at the following definition. 42

13.3 Definition and the main result Definition 44. Suppose that a function J(u) is given. By a membership function that corresponds to a statement “J(u) is sufficiently large” we understand a function µm (u) = f (J(u)), where f (x) is a reasonable transformation such that f (m) = 0 and f (M ) = 1, where by M we denote an unconditional maximum of J(u), and by m its minimum value. THEOREM 22. A membership function that corresponds to a statement “J(u) is sufficiently large” is equal to µm (u) = (J(u) − m)/(M − m). THEOREM 22′ . A membership function that corresponds to a statement “J(u) is sufficiently large” is equal to µm (u) = (1 + t)(J(u) − m)/(M − m + t(J(u) − m)). Comment. A membership function from Theorem 22 is precisely the function that was proposed by L. Zadeh [Z72]. A formula from Theorem 22′ gives us an additional parameter t that we can use for adjusting the resulting control. APPENDIX: PROOFS Proof of Theorems 1 and 1′ 1. First of all, let’s prove that all linear functions f (x) = ax with 0 < a < 1 are reasonable transformations. Indeed, suppose that 0 < a < 1. Let’s take any N and choose M and M ′ in such a way that the transformation x → (N x + M ′ )/(N + M + M ′ ) is close to x → ax, i.e., that a is close to N/(N + M + M ′ ) and 0 to M ′ /(N + M + M ′ ). The last value can be made precisely equal to 0, if we take M ′ = 0. In this case, we must take M so that N/(N + M ) is close to a. ˜ , for which N/(N + M ˜ ) is precisely equal to a (this value will not We can easily find the value of M ˜ ˜ )/N = 1/a, necessarily be an integer). Namely, from N/(N + M ) = a we conclude that (N + M ˜ ˜ ˜ hence 1 + M /N = 1/a, M /N = 1/a − 1, and finally M = N (1/a − 1). Since we assumed that a < 1, the value of 1/a − 1 is positive. The value N (1/a − 1) is not necessarily an integer, but we can take ˜ , for example, the integer part of M ˜ : MN = ⌊N (1/a − 1)⌋. According to MN that is closest to M the definition of the class F of reasonable transformations, the transformation x → N x/(N + MN ) that corresponds to M ′ = 0, belongs to F . This transformation has the form x → aN x, where aN = ˜ | ≤ 1, therefore N/(N + MN ) = 1/(1 + MN /N ). From the definition of MN it follows that |MN − M ˜ ˜ |MN /N − M /N | ≤ 1/N , and since M /N = 1/a − 1, we conclude that MN /N → 1/a − 1, when N → ∞. Hence aN → 1/(1 + (1/a − 1)) = a. So F contains functions aN x with limN →∞ aN = a, and since F is a connected Lie group, it contains the limit function ax as well. 2. Let’s now prove that all linear functions f (x) = ax with a > 0 are reasonable transformations. Let’s consider 3 cases: a < 1, a = 1 and a > 1. We have already proved the above statement for a < 1. If a = 1, then the transformation is x → x, and according to the definition of a group F must contain it. If a > 1, then the inverse transformation x → (1/a)x satisfies the inequality 1/a < 1 and therefore (according to 1.) belongs to F . Therefore, since F is a group, it must also contain a transformation that is inverse to x → (1/a)x, i.e., a function ax. The statement is proved. 3. Let’s now prove that for every b > 0 the function f (x) = x + b belongs to F . Indeed, if we take M = 0 and M ′ = N , then from the condition 2) of the definition of F we conclude that the transformation x → (x + 1)/2 belongs to F . On the other hand, according to 2. the function ax belongs to F for all a > 0. Since F is a group, a composition always belongs to F , therefore for every k > 0 and l > 0 the composition of the three transformations x → kx, x → (x + 1)/2 and x → lx also belongs to F . This composition equals to l((kx + 1)/2). We want 43

to choose k and l in such a way that this function would be equal to the desired one x + b. For that we must choose k and l so that lk/2 = 1 and l/2 = b. From the second equation we conclude that l = 2b, and then from the first equation we conclude that k = 2/l = 1/b. For these k and l we have actually proved that the function x + b belongs to F . 4. Now we can prove that the function x → x + b belongs to F for every b. Indeed, we have already proved this for b > 0; for b = 0 it follows from the fact that F is a group and hence contains a function x → x; for b < 0 it follows from the fact that a function x → x + b with b < 0 is an inverse to a function x → x + |b| that belongs to F according to 3., and therefore this function x → x + b belongs to F as an inverse of a function from F . 5. Now let’s prove that any linear function ax + b with a > 0 belongs to F . Indeed, any such function is a composition of the functions ax and x + b, about which we have already proved that they belong to F . Since F is a group, it must contain their composition as well. So we have proved the second statement of Theorem 1. 6. Now F is a connected Lie group that contains all increasing linear transformations. So we can use the results of [GS64] and [SS65] who proved, in particular, that if a connected (finitedimensional) Lie group G contains all linear transformations ax + b with a > 0, then it coincides either with the group of all monotone linear transformations, or with the group of all fractionally linear transformations x → (ax + b)/(cx + d). In both cases all transformations are fractionally linear, so we have proved Theorem 1′ . 7. In order to complete the prove of Theorem 1, we must prove that if we assume 3), then the case when F coincides with the set of all fractionally linear transformations is impossible. Let’s show that in case F contains all fractionally linear transformations, we have a contradiction with the assumption 3) that there exist two triples that cannot be transformed into each other. Namely, we will show that for any two triples x < y < z and x′ < y ′ < z ′ there exists a fractionally linear transformation that transforms x < y < z into x′ < y ′ < z ′ . In order to prove that, let’s prove that for any x < y < z there exists a transformation f that transforms 0,1 and ∞ into x, y and z. Then similarly we would be able to prove that some other transformation f ′ transforms 0,1 and ∞ into x′ , y ′ and z ′ . Then the composition of f −1 (inverse to f ) and f ′ is a fractionally linear function that transforms x < y < z into x′ < y ′ < z ′ (it is a fractionally linear function, because such functions form a group). So let’s find such an f . The desired conditions f (0) = x, f (1) = y and f (∞) = z after substituting f (x) = (ax + b)/(cx + d) and taking d = 1 turn into b = x, (a + b)/(c + 1) = y and a/c = z. The third equation leads to a = cz. We already know b. Substituting a = cz and b = x into the second equation, we conclude that (cz + x)/(c + 1) = y. Multiplying both sides of this equation by its denominator, we conclude that cz+x = cy+y, hence c = (y−x)/(z−y). So a = cz = z(y−x)/(z−y). The statement is proved. So in case of fractionally linear functions we get a contradiction with the condition 3). This contradiction shows that in this case F can contain only linear functions. So the proof of Theorem 1 is complete. Q.E.D. Theorems 2, 3, and 3′ easily follow from Theorems 1 and 1′ .

44

Proof of Theorem 4 1. We are interested only in what the functional J is equivalent to. So we will try to find a simpler functional to which J is equivalent. First of all, we can notice that the relation J(µ1 ) < J(µ2 ) is equivalent to the relation J(µ1 ) − a0 < ˜ J(µ2 ) − a0 . Therefore, any functional J is equivalent to a functional J(µ) = J(µ) − a0 , in the expansion of which there are no constant terms at all. Therefore, without losing generality, we can assume that there is no such term in J, i.e., that a0 = 0. 2. Let’s now use the demand that the ordering relation is invariant with respect to rescalings (i.e., reasonable transformations). According to Theorem 1, for every a > 0, the transformation x → ax is a reasonable transformation, and therefore the transformation µ(x) → aµ(x) is a rescaling. For this transformation, the invariance demands means that if J(µ1 ) < J(µ2 ), then J(aµ1 ) < J(aµ2 ). According to the definition of an analytical functional, any such functional can be represented as a sum J(µ) = J0 (µ) + J1 (µ) + J2 (µ) + ..., where by Jk (µ) we mean the sum of all k-th order terms in J (so J0 (µ) is a constant term, J1 (µ) include all linear terms, J2 (µ) all quadratic terms, etc.). We have already noticed in 1. that we can always take J0 (µ) = 0. On the other hand, the demand that a functional J is non-degenerate means that J2 (µ) 6= 0. So the first non-zero term in this expansion is either J1 (µ), or J2 (µ). Let’s denote by k the number of this first non-zero term (then k = 1 or k = 2). Then the expansion turns into J(µ) = Jk (µ) + Jk+1 (µ) + Jk+2 (µ) + ... Since Jl (µ) by definition is a sum of elements of l-th order, we can conclude that Jl (aµ) = al J(µ). Therefore J(aµ) = ak Jk (µ) + ak+1 Jk+1 (µ) + ak+2 Jk+2 (µ) + ..., and the invariance demand means that if J(µ1 ) < J(µ2 ), then ak Jk (µ1 ) + ak+1 Jk+1 (µ1 ) + ... < ak Jk (µ2 ) + ak+1 Jk+1 (µ2 ) + ... Dividing both sides by ak , we conclude that Jk (µ1 ) + aJk+1 (µ1 ) + ... < Jk (µ2 ) + aJk+1 (µ2 ) + ... This must be true for all a > 0. In particular, in the limit a → 0 we conclude that Jk (µ1 ) ≤ Jk (µ2 ). Therefore, J(µ1 ) < J(µ2 ) implies that Jk (µ1 ) ≤ Jk (µ2 ). Likewise, we can prove that if J(µ1 ) = J(µ2 ), then Jk (µ1 ) = Jk (µ2 ). So is for two membership functions the value of J(µ) is the same, then the values of Jk (µ) will also be equal. If by f (a) we denote the value of Jk (µ) for all the functions µ, for which J(µ) = a, we conclude that Jk (µ) = f (J(µ)). Then from the first statement of this paragraph we conclude that a function f is monotone, i.e., if a < b, then f (a) ≤ f (b). Let’s prove that this function f (a) is strictly monotone. Indeed, suppose that it is not, i.e., that there exist a < b, for which f (a) = f (b). Since f is monotone, for every c from the interval [a, b] we have f (a) ≤ f (c) ≤ f (b) = f (a), hence f (c) = f (a). So the function f is constant on the whole interval [a, b]. But since both J and Jk are analytical, the function f that relates them is also an analytical function. An analytical function of one real variable is uniquely determined by its values in the neighborhood of any point. Therefore, since f (a) is constant on an interval, it must coincide with an analytical function that is everywhere equal to this constant. So f (a) = const for all a, therefore Jk (µ) equals to the same constant, and this contradicts the fact that Jk is a functional of either first order or second order. This contradiction proves that the function f (a) is strictly monotone, i.e., if a < b, then f (a) < f (b). By definition of f we have Jk (µ) = f (J(µ)). Therefore, if J(µ1 ) < J(µ2 ), then Jk (µ1 ) < Jk (µ2 ). Likewise, J(µ1 ) = J(µ2 ) implies that Jk (µ1 ) = Jk (µ2 ), and J(µ1 ) > J(µ2 ) implies that Jk (µ1 ) > Jk (µ2 ). From these three statements we can conclude that J(µ1 ) < J(µ2 ) is equivalent to Jk (µ1 ) < Jk (µ2 ). This means that J is equivalent to Jk . 3. There are only two possibilities: k = 1 and k = 2. Since we demanded that J is a non-degenerate functional, J cannot be equivalent to a linear functional J1 , therefore, k = 2 and J is equivalent to 45

J2 . In view ofR that, without losing generality, we can assume that a functional J is quadratic, i.e., that J(µ) = a2 (x, y)µ(x)µ(y) dx dy, where a(x, y) = a(y, x). Since we have only one function ai , we can for simplicity omit the index 2. 4. Let’s now use the invariance of the ordering relation with respect to changing the starting point for x, i.e., the condition that if J(µ1 (x)) = J(µ2 (x)) then J(µ1 (x + x0 )) = J(µ2 (x + x0 )). Like in 3., we can conclude that J(µ(x + x0 )) = f (J(µ(x)) for some analytical function f (a) = f0 + f1 a + f2 a2 + ... So J(T µ) = f0 + f1 J(µ) + f2 J 2 (µ) + ..., where T µ(x) = µ(x + x0 ). We know that J is 2-nd order homogenous, so f0 is 0-th order, f1 J is 2-nd order, f2 J 2 is 4-th order, etc. But J(T µ) is also 2-nd order, so f0 = f2 = f3 + ... = 0. Therefore, the function f (a) is linear f (a) = ca, where a constant c can depend on x0 . So J(µ(x + x0 )) = c(x0 )J(µ(x)). From this formula we can conclude that J(µ(x + x1 + x2 )) = c(x1 + x2 )J(µ(x)) and at the same time J(µ(x + x1 + x2 )) = c(x1 )J(µ(x + x2 )) = c(x1 )c(x2 )J(µ(x)), so c(x1 + x2 ) = c(x1 )c(x2 ), hence [A66] c(x) = exp(Ax) for some A. Substituting the formula for J from 3., we conclude that R R a(x, y)µ(x + x0 )µ(y + x0 ) dx dy = exp(Ax0 ) a(x, y)µ(x)µ(y) dt ds. Changing variables in the first integral to x′ = x +Rx0 , y ′ = y + x0 , we conclude that R a(x − x0 , y − x0 )µ(x)µ(y) dx dy = exp(Ax0 )a(x, y)µ(x)µ(y) dx dy. These two quadratic functionals coincide for all µ(x), hence their kernels coincide, i.e. a(x − x0 , y − x0 ) = exp(Ax0 )a(x, y) for all x, y and x0 . Substituting x0 = y, we conclude that a(x − y, 0) = exp(Ay)a(x, y), so a(x, y) = exp(Ay)a(x − y), where we denoted a(x, 0) by a(x). 5. Let’s now use invariance with respect to changing units for x, i.e., the demand that J(µ1 (x)) = J(µ2 (x)) implies that J(mu1 (λx)) = J(µ2 (λx)). Arguments, similar to the ones used in 4., lead to a conclusion that J(µ(λx)) = c(λ)J(µ(x)), where c(λ1 λ2 ) = c(λR1 )c(λ2 ). Therefore [A66], c(λ) = λα , R α and J(µ(λx)) = λ J(µ(x)), and a(x, y)µ(λx)µ(λy) dx dy = λα a(x, y)µ(x)µ(y) dx dy. Substitutvariables x′ = λx and y ′ = Ring new R λy into the first integral, we conclude that −1 a(λ x, λ−1 y)µ(x)µ(y) dx dy = λα a(x, y)µ(x)µ(y) dx dy, hence a(λ−1 x, λ−1 y) = λα a(x, y). Substituting the expression for a(x, y) from 4., we conclude that exp(−Aλ−1 y)a(λ−1 (x − y)) = λα exp(−Ay)a(x − y). For x = y we conclude that exp(−Aλ−1 y) = λα exp(−Ay) for all y and λ, hence A = 0. So a(x, y) = a(x − y). The symmetry of the function a(x, y) (i.e., the fact that a(x, y) = a(y, x)) leads to the conclusion that the function a(x) is even: a(−x) = a(x). Substituting this value A = 0 into the above formula, we get for the case y = 0 the equation a(λ−1 x) = λα a(x). This means that a(x) is a homogenous generalized function in the sense of [GSh64, Ch. I, 3.11]. Such functions have been classified in [GSh64], and the most general even solutions are a(x) = C|x|−α (if α is not an even positive integer) and a(x) = C|x|−2n + C1 δ(2n)R if α equals to an even positive integer 2n, where δ(x) is a delta-function (defined by the property δ(x)f (x) dx = f (0)), and δ(2n) means 2n-th derivative of δ. 6. Now we will use the locality demand. It means that if I1 and I2 are disjoint intervals, µ1 (x) and µ ˜1 (x) are located on I1 , and µ2 (x) and µ ˜2 (x) are located on I2 , then from J(µ1 + µ2 ) > J(˜ µ1 + µ2 ) follows that J(µ1 + µ ˜2 ) > J(˜ µ1 + µ ˜2 ). Similar to the arguments from 4., we conclude that for every two functions µ2 (x) and µ ˜2 (x) that are located on I2 , there exists a function f (a) such that for every function µ(x) that is located on I1 , the following is true: J(µ(x) + µ ˜2 (x)) = f (J(µ(x) + µ2 (x))), and this function f (a) is linear f (a) = f + f a. If we substitute this expression for f (a) and the 0 1 R expression for J(µ(x)) (= a(x − y)µ(x)µ(y) dx dy), we conclude that R R R a(x −Ry)µ(x)µ(y) dx dy + a(x − y)(µ(x)˜ µ (y) + µ(y)˜ µ (x)) dx dy + a(x − y)˜ µ2 (x)˜ µ2 (y) dx dy = 2 2 R f0 R+ f1 a(x − y)µ(x)µ(y) dx dy + f1 a(x − y)(µ(x)µ2 (y) + µ(y)µ2 (x)) dx dy+ f1 a(x − y)µ2 (x)µ2 (y) dx dy. 46

This must be true for all functions µ(x) that are located on I1 . So the two quadratic expressions must coincide, and therefore, all their coefficients must coincide. The equality of the coefficients at quadratic terms leads to f1 = 1. After we substitute f1 = 1 into this equality, the equality of linear terms leads to the conclusion that if x ∈ I1 , then a(x − y)µ2 (y) = a(x − y)˜ µ2 (y). This must be true for all pairs of memberships functions that are located on I2 , in particular, for the case when µ2 (y) 6= µ ˜2 (y). Therefore, if y ∈ I2 and x ∈ I1 , we have a(x − y) = 0. But for every two different real numbers x 6= y, one can find disjoint intervals I1 and I2 such that x ∈ I1 and y ∈ I2 . Therefore, if x 6= y, we have a(x R− y) = 0. So the function a(x) is located only in 0, hence C = 0, a(x) = C1 δ(2n) (x), and J(µ) = C1 δ(2n) (x − y)µ(x)µ(y) dx dy. The value C1 cannot be equal to 0, because then J would not non-degenerate. So C1 6= 0, and, by dividing R be (2n) by C1 , we get an equivalent functional J(µ) = δ (x − y)µ(x)µ(y) dx dy R (n) Integrating by parts, we conclude that J(µ) = (µ (x))2 dx, therefore, the original functional J is equivalent to this one. Variational equations for this functional (see, e.g., [P83, Ch. 20]) lead to an equation µ(2n) (x) = 0 for all the points x that are different from x1 , ..., xn . So on every interval [xi , xi+1 ], the function µ(x) satisfies the equation µ(2n) (x) = 0, and is, therefore, equal to a polynomial of order ≤ 2n − 1. Q.E.D. Proof of Theorem 5 Let’s consider the case n = 2, when we have only two truth values t1 and t2 . Without losing generality, we can assume that t1 < t2 . Then, if we first apply the normalization to 1 and then the modifier, we get first t1 /t2 and 1, and then m(t1 /t2 ) and m(1). If we first apply the modifier and then normalize to 1, then we first get m(t1 ) and m(t2 ), and then m(t1 )/m(t2 ) and 1. From the definition of a modifier it follows that m(1) = 1 and m(t1 /t2 ) = m(t1 )/m(t2 ), or m(t1 ) = m(t1 /t2 )m(t2 ). If we denote a = t2 and b = t1 /t2 , then we can conclude that m(ab) = m(a)m(b) for all numbers a, b from the interval (0,1). All monotone solutions of this functional equation are well known [A66]: They coincide with m(x) = xd for some real d. The fact that d > 0 follows from the demand that m is monotone. Q.E.D. Proof of Theorem 5′ The condition of Theorem 5′ means that the equality is true for all normalizations, in particular, for the linear ones. Since it is true for all linear normalizations, we can apply Theorem 5, and thus conclude that m(x) = xd for some d > 0. But then if we take a non-linear normalization to 1, e.g., f (x) = kx/(1 + x) with k = (1 + t2 )/t2 , then, as one can easily check, this equality will no longer be true. So no modifiers are possible in this case. Q.E.D. Proofs of Theorems 6, 7, 6′ and 7′ In this proof we will use the general expression for ∨-and &-operations from [SS88]. Namely, according to our definitions these operations are t-norms and t-conorms in the sense of [SS88], so we can apply the classification theorem for such operations: “THEOREM &”. For every &-operation there exists: + 1) a set of intervals (a− n , an ) such that different intervals from this set have no common points; 2) for every interval there exists a continuous strictly decreasing function sn that is either a + + function from [a− of all non-negative real numbers, or a function from n , an ] onto the set R − + [an , an ] onto [0, 1]. Then f& (a, b) equals to: 47

1) min(a, b), if a,b do not belong to one interval; + 2) s−1 (here f −1 denotes the n (sn (a) + sn (b)) if they belong to one interval whose sn is onto R inverse function to f ); 3) sn−1 (min(sn (a) + sn (b), 1)) if they belong to one interval whose sn is onto [0,1]. “THEOREM ∨”. For every ∨-operation there exists: + 1) a set of intervals (a− n , an ) such that different intervals from this set have no common points; 2) for every interval there exists a continuous strictly increasing function sn that is either a function + + − + from [a− n , an ) onto the set R of all non-negative real numbers, or a function from [an , an ] onto [0, 1]. Then f∨ (a, b) equals to: 1) max(a, b), if a,b do not belong to one interval; 2) sn−1 (sn (a) + sn (b)) if they belong to one interval whose sn is onto R+ ; 3) sn−1 (min(sn (a) + sn (b), 1)) if they belong to one interval whose sn is onto [0, 1]. We will describe the proof in detail only for ∨-operations; for &-operations we can either repeat a similar proof, or use the fact that if f∨ (a, b) is an ∨-operation, then 1 − f∨ (1 − a, 1 − b) is a &-operation, and vice versa. For convenience we will denote f∨ (a, b) by a ∗ b. + 1. If there are no intervals [a− n , an ] at all, then according to “Theorem ∨” we have a ∗ b = max(a, b) for all a, b.

2. Let’s prove that for every interval a+ = 1. We will prove it by reduction to a contradiction. Suppose that a+ < 1. Take arbitrary value a from that interval. Then in view of “Theorem ∨”, a ∗ x > a for all x from this interval (a− , a+ ). For x > a+ , we have a∗x = x > a+ > a; so a∗x > a for all x > a− . Then according to the demand that the ∨ operation is reasonable, the function g(x) = a ∗ x is fractionally linear for x > a− . According to “Theorem ∨” for x > a+ this function is described by the formula g(x) = a ∗ x = x, therefore, g(x) = x for all x > a− . In particular, for x = a we conclude that a ∗ a = a, while for all points from this interval a ∗ a is always greater than a. The contradiction shows that our assumption was wrong, and, therefore, a+ = 1. 3. From 2. and the fact that all the intervals have no common points, it follows that there can be only one such interval. So either a− = 0, and this interval comprises the whole (0, 1), or a− > 0, then a ∗ a = a for all a < a− . 4–5. Let’s find the general formula for a ∗ b on the interval (a− , 1). According to “Theorem ∨” a ∗ b equals either to s−1 (s(a) + s(b)), or to s−1 (min(s(a) + s(b), 1)) for some strictly increasing continuous function s from (a− , a+ ) to R. In this case, g(x) = a∗x > a for x > a− , therefore, due to the definition of a reasonable ∨-operation and Theorem 1′ it is fractionally linear whenever g(x) < 1. 4. If we assume the condition 3) in the definition of a reasonable transformation, then according to Theorem 1 the function a, b → a ∗ b is linear in each of the variables, hence it is bilinear on (a− , 1). The general form of a bilinear function is well known: a ∗ b = a0 + a1 a + a2 b + a3 ab. It is convenient to turn to new variables a − a− and b − a− , so that the initial point of the interval [a− , 1] will turn into 0. If a function is bilinear in a and b, then it is bilinear in a − a− and b − a− as well. So in these new variables a ∗ b = b0 + b1 (a − a− ) + b2 (b − a− ) + b3 (a − a− )(b − a− ) for some constants bi . According to the definition a ∨-operation is symmetric, therefore, b1 = b2 and 48

a ∗ b = b0 + b1 (a − a− ) + b1 (b − a− ) + b3 (a − a− )(b − a− ). Substituting this expression into the formula a− ∗ a− = a− , we conclude that a− = b0 ; substituting it into a ∗ a− = a, we conclude that a = a− + b1 (a − a− ) for all a, therefore, b1 = 1. If the interval is of type 3), then we have already found the general formula. If this interval is of type 2), then from the condition that a+ ∗a+ = a+ we conclude that a+ = a− +2(a+ −a− )+b3 (a+ −a− )2 , therefore, b3 = −1/(a+ − a− ) and a ∗ b = a− + (a − a− ) + (b − a− ) − (a − a− )(b − a− )/(a+ − a− ). Substituting a+ = 1 and denoting a− by a∨ , we get the desired formula. Theorem 7 is proved. 5. Let’s now consider the case when we do not assume the condition 3) in the definition of a reasonable rescaling; then the function f∨ (a, b) is fractionally linear in each variable (while its value is smaller than 1). What information can we gain from the fact that the desired function f∨ becomes fractionally linear whenever we fix the value of one of its two parameters? This situation is similar to computerized tomography, when we have to reconstruct a 2-D or 3-D picture by observing 1-D projections. Here we have a similar problem: reconstructing a function from its 1-dimensional projections. The only difference is that we now know the analytical expressions for these projections, not the numerical values. So it is natural to call these problems analytical tomography [K89]. The solution to this problem is given in the auxiliary theorem that follows, but first, we must present some definitions: Definition A1. Assume that a function f (x1 , ..., xn ) of n real variables is given. By its projection, we mean a function of one real variable that is obtained from f by fixing the values of all its variables but one. Example. If n = 2, then the functions x → f (x, y0 ) and y → f (x0 , y) are projections of f for arbitrary real numbers x0 and y0 . For example, for f (x1 , x2 ) = sin(x1 )/x2 , projections include the functions c sin(x1 ) for arbitrary c, which appear if we fix x2 , and k/x2 for |k| ≤ 1, which appear if we fix x1 . Comment. We want to describe all of the functions with the property that all of their projections are fractionally linear. Is there any similar problem whose solution we can use? The natural answer to this question is “yes”. There is a description of all functions with the property that all their projections are linear: these are the so-called multi-linear functions. Since a fractionally linear function is a fraction of two linear functions, it is natural to guess that a function whose projections are fractionally linear must be a fraction of two functions whose projections are linear, i.e., a fraction of two multi-linear functions. As we shall see, this guess is correct. As a reminder, we repeat a definition of a multi-linear function. Definition A2. By a multi-linear function of n real variables x1 , ..., xn we mean a function that is linear in each of the variables, i.e., the expression of the type a0 + a1 x1 + a2 x2 + ... + an xn + a12 x1 x2 + a23 x2 x3 + ... + a12...n x1 x2 ...xn where a1 , ..., a12 , ... are constants. AUXILIARY THEOREM [KQ92]. If all of the projections of a function f (x1 , x2 , ..., xn ) of n real variables are fractionally linear, then f can be represented as a fraction of two multi-linear functions. Comment. We presented this statement as a separate theorem, because we will use it once again while discussing defuzzification. We will give the proof of this auxiliary theorem right after the current proof, and now we will assume that it is true, and continue to prove Theorem 7. Let’s finally find the possible values of the coefficients of these expressions. For simplicity let’s express the value of f∨ (a, b) in terms of a′ = a − a− and b′ = b − a− (the transformation to 49

these new variables is linear, so a fractionally linear function will still be fractionally linear in the new variables). The general form of the ratio of two symmetric bilinear functions is f∨ (a, b) = (a0 + a1 (a′ + b′ ) + a2 a′ b′ )/(b0 + b1 (a′ + b′ ) + b2 a′ b′ ). When a′ = b′ = 0 (i.e., when a = b = a− ), the denominator turns into b0 . But the expression for f∨ (a, b) must be defined everywhere, therefore, we conclude that b0 is not equal to 0. Therefore, we can divide both the numerator and the denominator by b0 , thus getting a new expression with b0 = 1. In view of this possibility, let’s assume that we already have b0 = 1 in the above expression. The condition f∨ (a, a− ) = a due to a−′ = a− − a− = 0 turns into (a0 + a1 a′ )/(1 + b1 a′ ) = a− + a′ for all a′ . Multiplying both sides of this equation by 1 + b1 a′ and transferring all the terms into the right-hand side, we conclude that the following polynomial is identically 0: b1 a2 +linear terms= 0. So all its coefficients are 0, in particular, b1 = 0. Therefore, from the above equality we conclude that a0 = a− and a1 = 1. Substituting these values into the above formula, we get the new expression for f∨ (a, b) with only two unknown coefficients: (a− + a′ + b′ + a2 a′ b′ )/(1 + b2 a′ b′ ). Let’s find the possible range of values of these coefficients. If we have an interval of type 2), then we can also substitute this expression into the formula f∨ (1, 1) = 1 and get the desired relationship between a2 and b2 . 6. The values of a ∗ b for the case when either a or b is outside the interval are given by “Theorem ∨”. Q.E.D. Proof of the Auxiliary Theorem. We shall prove this theorem using mathematical induction on the number of real variables n in this function. For n = 1, this fact is obvious: f is itself fractionally linear, hence it can be represented as a fraction of two linear (thus, multi-linear) functions. Assume that this fact has been proven for n variables. We now want to prove it for n + 1 variables. 1. Let us fix a value of xn+1 . We get a function of n real variables g(x1 , x2 , ..., xn ) = f (x1 , x2 , ..., xn , c), where c is this fixed value. We want to apply the inductive assumption to this function g of n variables. For that we need to know what its projections are. In order to get a projection of g we have to fix the values of n − 1 variables. But g is itself obtained from f by fixing the value of one of the variables of f , namely, we fixed the value of xn+1 to be equal to c. If we combine these two “fixings”, we conclude that every projection of g can be obtained by fixing all but one variable of f so that every projection of g is a projection of f . But all of the projections of f are fractionally linear (this we assume in the formulation of our Theorem), therefore, all projections of g are also fractionally linear. We assumed that a Theorem has already been proved for the case of n variables. So we can apply the conclusion of the Theorem for n variables to this function g and thus conclude that g can be represented as a ratio of two multi-linear functions. In other words, for every c, there exist real numbers ai , bi such that f (x1 , x2 , ..., xn , c) = (a0 + a1 x1 + ... + a12 x1 x2 + ...)/(b0 + b1 x1 + ... + b12 x1 x2 + ...). 2. These values ai , bi are not uniquely determined because we can multiply all of these coefficients by a constant and the resulting expression will still be the same. In order to remove this uncertainty, let us divide all of these coefficients by the first non-zero coefficient in the sequence a0 , a1 , ..., a12 , ..., b0 , ... After this transformation, the first non-zero coefficient will be equal to 1. So if a0 6= 0, we shall have after this transformation a0 = 1. If initially a0 = 0 and a1 6= 0, we get a0 = 0 and a1 = 1; etc. 3. For different values c of xn+1 , the values of ai , bi can, of course, be different, so these values can be viewed on as functions of xn+1 . In order to make this dependency explicit, let us write 50

xn+1 as a parameter. For example, we write a2 (xn+1 ) so that a2 (5.3) means the value of a2 when xn+1 is 5.3. After this change, the above expression for f turns into: f (x1 , x2 , ..., xn , xn+1 ) = (a0 (xn+1 )+a1 (xn+1 )x1 +...+a12 (xn+1 )x1 x2 +...)/(b0 (xn+1 )+b1 (xn+1 )x1 +...+b12 (xn+1 )x1 x2 +...). In order to complete the proof, we need to figure out what these functions ai and bi are. 4. How can we determine the values of ai , bi for a given xn+1 ? If we fix the values of x1 , ..., xn , then the expression resulting from 3. becomes an equation for the unknowns ai , bi . This equation is not linear in its unknowns, but it can be easily made linear if we multiply both sides of it by the denominator of the right-hand side: f (x1 , x2 , ..., xn , xn+1 )(b0 (xn+1 ) + b1 (xn+1 )x1 + ... + b12 (xn+1 )x1 x2 +...) = a0 (xn+1 )+a1 (xn+1 )x1 +...+a12 (xn+1 )x1 x2 +... We now get a linear equation. If we fix another set of values for x1 , ..., xn , we get another linear equation, and we can get as many linear equations as we wish. In other words, the equation from 3. can be equivalently represented as a system of infinitely many linear equations, corresponding to different sets of x1 , ..., xn . If we take sufficiently many sets, we get sufficiently many equations and that, together with the above condition that a0 = 1 or etc., leaves us with a unique solution. Hence we have a system: f (xk1 , xk2 , ..., xkn , xn+1 )(b0 (xn+1 ) + b1 (xn+1 )xk1 + ... + b12 (xn+1 )xk1 xk2 + ...) = a0 (xn+1 ) + a1 (xn+1 )xk1 + ... + a12 (xn+1 )xk1 xk2 + ... where k is a number of the set that we use in k-th equation and xki are the values from that set. 5. According to Kramer’s rule, every variable from the solution of a system of linear equations can be expressed as a ratio of two determinants, i.e., the ratio of two expressions that are polynomial in coefficients of this system. In our case, this means that ai and bi are expressed as ratios of the expressions that are polynomial in the coefficient of the above linear system. In order to figure out how they depend on xn+1 , we need to figure out how the coefficients depend on xn+1 . The coefficients on the right-hand side are just constants because they do not depend on xn+1 at all. The coefficients on the left-hand side are proportional to f (xk1 , xk2 , ..., xkn , xn+1 ). So we have to find out how this expression depends on xn+1 . For that we can use the fact that for every k, the function f (xk1 , xk2 , ..., xkn , xn+1 ) of one variable xn+1 is a projection of f . Therefore, according to the assumption of our Theorem this function is fractionally linear, i.e., f (xk1 , xk2 , ..., xkn , xn+1 ) = (ak + bk xn+1 /(ck + dk xn+1 ). Thus, each of the functions ai , bi can be expressed as a ratio of two expressions that are polynomial in this fractionally linear function. A fractionally linear function is an example of a rational function, i.e., a ratio of two polynomials. It is well known that if we add, subtract, multiply or divide two rational functions, the result will be a rational function (to prove that we can use the usual rules for adding frations, subtracting them, multiplying them, etc.). Therefore, any polynomial of a rational function (in particular, any polynomial of a fractionally linear function) is a rational function itself. Therefore, all of the coefficients ai , bi are rational functions: ai = a′i /a′′i and bi = b′i /b′′i , where a′i , a′′i , etc. are polynomials in xn+1 . 6. Substituting these expressions into the expression for f from 3., we get the following formula: f (x1 , x2 , ..., xn , xn+1 ) = N/D, where N = (a′0 (xn+1 )/a′′0 (xn+1 )) + (a′1 (xn+1 )/a′′1 (xn+1 ))x1 + ... + (a′12 (xn+1 )/a′′12 (xn+1 ))x1 x2 + ..., D = (b′0 (xn+1 )/b′′0 (xn+1 )) + (b′1 (xn+1 )/b′′1 (xn+1 ))x1 + ... + (b′12 (xn+1 )/b′′12 (xn+1 )x1 x2 + ..., and a′i , a′′i , b′i , b′′i are polynomials. In order to simplify this expression let us multiply all the terms in the numerator and denominator by the product of all the polynomials a′′i and b′′i . The dependency of both numerator and denominator on xn+1 then becomes polynomial. The dependency of both denominator and numerator on the variables x1 , ..., xn is still the same: multi-linear. So we get the following formula: f (x1 , x2 , ..., xn , xn+1 ) = Q(x1 , x2 , ..., xn , xn+1 )/R(x1 , x2 , ..., xn , xn+1 ). Here Q and R are polynomials that are linear in each of the variables x1 , x2 , ..., xn . 51

7. If the polynomials Q and R have common divisors, we can divide both numerator and denominator by the greatest common divisor, thus getting a representation in which there are no common divisors. So we can assume that Q and R have no common divisors. 8. We chose xn+1 and came to this expression for f : a ratio of two expressions that are multilinear in xi , 1 ≤ i ≤ n, and polynomial in xn+1 . We could have also chosen any other variable, for example, x1 , and deduced an expression f = Q′ /R′ , where Q′ and R′ are polynomials with no common divisors that are multi-linear in x2 , x3 , ..., xn+1 and polynomial in x1 . From the fact that both representations represent the same function f , we conclude that Q/R = Q′ /R′ , hence RQ′ = QR′ . The left-hand side of this equation is divisible by R, so the polynomial in the righthand side must be divisible by R as well. But the right-hand side is a product of Q and R′ , and Q has no common divisors with R, so R′ must be divisible by R. Likewise, one can prove that R must be divisible by R′ , therefore, R and R′ can differ only by a multiplicative constant: R′ = cR for some constant c. But R′ is linear in xn+1 , therefore, R is also linear in xn+1 . We have already proven that R is linear in all other variables x1 , ..., xn . Therefore, R is multi-linear in all of its variables. Likewise, we can prove that Q is multi-linear in all of its variables. So f is a ratio of two multi-linear functions. Q.E.D. Proofs of Theorems 8 and 8′ Theorem 8. If we assume 3) in the definition of a reasonable transformation, then according to Theorem 1, g(x) is linear: g(x) = ax + b. Substituting x = 0 and x = 1, we get the desired formula. Theorem 8′ . In this case, according to Theorem 1′ g(x) = (ax + b)/(cx + d). Let’s first show that b 6= 0. Indeed, if b = 0, then the condition g(1) = 0 turns into a/(c + d) = 0, therefore, a = 0 and g(x) ∼ 0, but we assumed that g(0) = 1 6= 0. So b 6= 0, and we can divide both ax + b and ˜ with b = 1. cx + d by b without changing g(x). Thus we get a new formula g(x) = (˜ ax + 1)/(˜ cx + d) Substituting this formula into the equations g(1) = 0 and g(0) = 1, we conclude that a ˜ + 1 = 0 (so ˜ a ˜ = −1) and d = 1; hence we get g(x) = (1 − x)/(1 + kx). From monotonicity we conclude that k > −1. Q.E.D. Proof of Theorem 9 1. Let us first prove that the optimal family Fopt exists and is shift-invariant in the sense that Fopt = Fopt + a for all real numbers a. Indeed, we assumed that the optimality criterion is final, therefore, there exists a unique optimal family Fopt . Let’s now prove that this optimal family is shift-invariant. The fact that Fopt is optimal means that for every other F , either F < Fopt or Fopt ∼ F . If Fopt ∼ F for some F 6= Fopt , then from the definition of the optimality criterion we can easily deduce that F is also optimal, which contradicts the fact that there is only one optimal family. So for every F , either F < Fopt or Fopt = F . Take an arbitrary a and let F = Fopt + a. If F = Fopt + a < Fopt , then from the invariance of the optimality criterion (condition ii)) we conclude that Fopt < Fopt − a, and that conclusion contradicts the choice of Fopt as the optimal family. So F = Fopt + a < Fopt is impossible, and therefore Fopt = F = Fopt + a, i.e., the optimal family is really shift-invariant. 2. Likewise, we can prove that an optimal family is invariant with respect to changing units of x, i.e., cF = F for every c > 0. So the optimal family is invariant with respect to changing units of x and changing a starting point for x. In the remaining part of this proof let’s denote this optimal family simply by F . 52

3. Let’s first simplify the problem a little bit. A family F is closed under multiplication, i.e., it contains a product of every two of its elements. There is a well-known way to go from multiplications to a simpler operation (addition): namely, if we consider logarithms instead of the original membership functions, then the set of these logarithms will be closed with respect to addition (i.e.m contain a sum of any two of its elements), because log(ab) = log(a) + log(b). We can always consider logarithms, because we consider only positive membership functions. 4. So let us consider the set L of all functions of the type log(µ(x)), where µ(x) ∈ F . Let’s first prove that the set L is closed under addition, i.e., if l1 , l2 ∈ L, then l1 + l2 ∈ L. In mathematical terms we can express it by saying that L is an additive semigroup. Indeed, if l1 ∈ L, then by the definition of L l1 = log(µ1 ) for some µ ∈ F ; here µ = exp(l1 ), so this is equivalent to saying that exp(l1 ) ∈ F . Likewise, from l2 ∈ L we conclude that exp(l2 ) ∈ F . Since F is closed under multiplication, we conclude that the product exp(l1 ) exp(l2 ) also belongs to F . Therefore, log(exp(l1 ) exp(l2 )) ∈ L, but this logarithm is precisely l1 + l2 . 5. Likewise, from the fact that a family F is invariant we conclude that if a function l(x) belongs to L, then for every a and c > 0 the functions l(cx) and l(x + a) also belong to L. 6. F is an m-dimensional family, i.e., could be obtained by choosing values of parameters ~u of some function f (~u, x) in a connected open region. All functions from L are just logarithms of functions from F , so they are obtained from logf (~u, x)) for different values of ~u. Therefore, L is also an m-dimensional family. 7. Let us now consider the set D of all the functions p(x) that can be represented as differences between the two functions from L, i.e., the set of all the functions of the type p(x) = l1 (x) − l2 (x) for some li ∈ L. Since we need n parameters to describe l1 and n to describe l2 , we need to have at most 2m parameters to describe any function from D, so D is a 2m-dimensional family. 8. It is easy to show that if p(x) ∈ D, then p(x + a) ∈ D and p(cx) ∈ D for any a and c > 0. Indeed, if p(x) ∈ D, then p(x) = l1 (x) − l2 (x) for some li (x) ∈ L. Since li (x) ∈ L, we have l1 (x + a) ∈ L and l2 (x + a) ∈ L, and, therefore, their difference l1 (x + a) − l2 (x + a) also belongs to D, but this difference is equal to p(x + a). For p(cx) the proof is similar. 9. Let us prove that this set D is a group under addition. We must prove that 0 ∈ D, that if p ∈ D, then −p ∈ D, and that if p and q belong to D, then p + q ∈ D. First, 0 can be represented as l − l and, therefore, belongs to L. If p belongs to D, this means that p = l1 − l2 , where li ∈ L, but in this case −p = l2 − l1 and, therefore, −p also belongs to L. Finally, if p ∈ L and q ∈ L, this means that p = l1 − l2 and q = m1 − m2 , where li ∈ L and mi ∈ L. In this case, p + q = (l1 + m1 ) − (l2 + m2 ), where both sums l1 + m1 and l2 + m2 belong to L, because L is an additive semigroup. 10. So D is a continuous finite-dimensional additive subgroup of the group of all functions. All such subgroups are known: they are linear subspaces. So we come to a conclusion that D is a finitedimensional linear space. This means that there exists a finite set of functions p1 (x), p2 (x), ..., pr (x) from D (they are called a base) that are linearly independent, and such that any other function P from D can be represented as a linear combinations of the functions from this base, i. e., as Ci pi (x) for some coefficients Ci . 11. We have proved that if p(x) ∈ D, then p(x + a) ∈ D as well. In particular, it is true for the functions pi (x). So for every a and every i the function pi (x + a) belongs to D and can, 53

therefore, be represented as pi (x + a) = Ci1 (a)p1 (x) + Ci2 (a)p2 (x) + ... + Cir (a)pr (x) for some constants Cij , depending on a. Let us prove that these functions Cij (a) are differentiable. Indeed, if we take r different values xk , 1 ≤ k ≤ r, we get r linear equations for Cij (a): pi (xk + a) = Ci1 (a)f1 (xk ) + Ci2 (a)f2 (xk ) + ... + Cir (a)fr (xk ), from which we can determine Cij using Kramer’s rule. Kramer’s rule expresses every unknown as a ratio of two determinants, and these determinants polynomially depend on the coefficients. The coefficients either do not depend on a at all (pj (xk )) or depend smoothly (pi (xk + a)) because pi are smooth functions (as differences of logarithms of smooth functions). Therefore, these polynomials are also smooth functions, and so is their ratio Cij (a). We have an explicit expression for pi (x+a) in terms of pj (x) and Cij . So, when a = 0, the derivative of pi (x + a) with respect to a equals to the derivative of this expression. If we differentiate it, we ′ (0) get the following formula: fk′ (x) = ci1 f1 (x) + ci2 f2 (x) + ... + cir fr (x), where by cij = Cij ′ we denoted the derivative of Cij (x) at x = 0. So the set of functions pi (x) satisfies the system of linear differential equations with constant coefficients. The general solution of such system is well known [B70]: each of the functions pi is a linear combination of the functions of the type xp exp(αx) sin(βx + φ), where p is a non-negative integer, α, β and φ are real numbers. 12. For every i and c the result pi (cx) of changing the unit of x must belong to the same set D, i.e., pi (cx) = Ci1 (c)f1 (x) + Ci2 (c)f2 (x) + ... + Cir (c)fr (x) for some constants Cij , depending on c. This functional equation is almost the same as the one for shift-invariance, the only difference is that we have a product instead of a sum. In order to reduce it to the previous case, let’s us the same trick as before. If we turn to logarithms, then product turns into the sum: ln (ab) = ln a + ln b for all a and b. So let’s introduce a new variable Z = ln x (so that x = exp(Z)), and new functions Fi (Z) = pi (exp(Z)) (so that pi (x) = Fi (ln x)). Then for these new functions, this functional equation takes the form Fi (Z + A) = C¯i1 (A)F1 (Z) + ... + C¯ir (A)Fr (Z). This is precisely the system of functional equations that we already know how to solve. So we can conclude that Fi (Z) is a linear combination of the functions Z p exp(αZ) sin(βZ + φ) from 11. When we substitute Z = ln x, we conclude that pi (x) = Fi (Z) = Fi (ln x) is a linear combination of the functions (ln x)p exp(αln x) sin(βln x + φ). Since exp(αln x) = (exp(ln x))α = xα , we finally get the expression for pi (x) as a linear combination of the functions of the type (ln x)p xα sin(β(ln x) + φ), where p is a non-negative integer, α, β and φ are real numbers. 13. As a result of 11. and 12., for each of the functions pi (x) we have two different expressions obtained from the demands of shift- invariance and unit-invariance. When can a function pi (x) satisfy both conclusions, i.e., belong to both classes? If it contains terms with logarithms, it cannot be a linear combination of the functions from 11., because there are no logarithms among them. The same if it contains sines of logarithms. So the only case when a linear combination of the functions (ln x)p xα sin(βln x + φ) is at the same time the linear combination of the functions ¯ + φ), ¯ is when p = β = 0. In this case, the above expression turns into xα , xp¯exp(¯ αx) sin(βx and from the equality of these expressions we conclude that α = p¯. But p¯ is necessarily a nonnegative integer, and, therefore, α is non-negative integer as well. So pi (x), which is equal to a linear combination of such terms, is equal to the linear combination of the terms xα for non-negative integers α, i.e., each of the functions pi (x) is a polynomial. Therefore, every function p(x) from D, which is a linear combination of polynomials, is a polynomial itself. The order of each of these polynomials is not greater than the largest order of the polynomials pi (x). 14. Let us now prove that the functions from L are also polynomials. Indeed, suppose that l(x) ∈ L. As we have proved in 5., for every a the function l(x + a) also belongs to L. So according to the 54

definition of D their difference l(x+a)−l(x) belongs to D. Since D is a linear space, it also contains a function (l(x + a) − l(x))/a. Since D is a finite-dimensional space, it is closed, and, therefore, it must contain a limit of these functions when a → 0, that is, a function l′ (x). So the derivative of l(x) is an element of D, and hence a polynomial. So l(x) is equal to an integral of the polynomial, i.e., also to a polynomial. 15. So all elements of L are polynomials. By definition of L elements of F are of the form exp(l), where l ∈ L, so we get the desired result. Q.E.D. Proof of Corollary to Theorem 9 If n = 1, then this class consist only of constants. If n = 2, then we also have the functions µ(x) = exp(a0 + a1 x). If a1 > 0, then µ(x) → ∞ for x → ∞, which contradicts the assumption that µ(x) is a membership function and, therefore, µ(x) ≤ 1. If a1 < 0, then µ(x) → ∞ if x → −∞: also a contradiction. So the only possible case is a1 = 0, in which case µ(x) is a constant. For n = 3, the demand that µ(x) does not tend to ∞ when x → ∞ leads to a2 < 0. Then the desired form of a function can be obtained by using the same transformation that is used while solving quadratic equations: ax2 + bx + c = a(x − (b/(2a))2 + (c − (b/(2a))2 . Q.E.D. Proof of Lemma 1 According to 4), the dependency of f on xi (if we fix all other variables) is fractionally linear, i.e., f = (axi + b)/(cxi + d). If c 6= 0, then this function is not defined for xi = −di /ci , and we assumed that it is everywhere defined. So ci = 0, and f is a linear function of each of xi . Therefore, for fixed µi a function f is a multi-linear function of the variables x1 , ..., xn , i.e., f = a0 + a1 x1 + ... + an xn + a12 x1 x2 + .... If the terms that are proportional to xi xj (or some other product) are different from 0, then for big xi these terms are prevailing, so f increases at least as k2 if we multiply all xi by k. For sufficiently big k we’ll then have f > max xi , that contradicts 1). So f can contain only linear terms: f = a0 + Pa1 x1 + ... + an xn . This same assumption 1) allows us to prove that a0 = 0 (take xi = 0) and ai = 1 (take x1 = x2 = ... = xn ). As for the dependency on µi , we can use the previously proved Auxiliary Theorem to prove that it is a ratio of multi-linear functions, and the assumptions of symmetry and independence on xi for µi = 0 to deduce the precise form of this function. Q.E.D. Proof of Theorem 10 According to the definition of a reasonable defuzzification procedure and Lemma 1, a reasonable R R defuzzification has the form x ¯ = ( xα(x) dx)/( α(x) dx), where α(x) = Cµ(x) and C = a0 + R R a1 µ(y) dy + a2 µ(y)µ(z) dy dz + ... This value C does R not depend Ron x, therefore, we can move C outside both integrals and conclude that x ¯ = (C xµ(x) dx)/(C µ(x) dx). The values C in the denominator and the numerator cancel each other, and, therefore, we arrive at the centroid formula. Q.E.D. Proof of Theorem 11 Suppose that f is a symmetric defuzzification, and let us prove that is inconsistent with the prohibitions. In order to prove this, let’s consider an even function µ(x), for which µ(0) = 0. For example, we can take a piecewise linear function µ(x) that is defined as follows: µ(x) = |x| for |x| ≤ 1, µ(x) = 2 − |x| for 1 ≤ |x| ≤ 2, and µ(x) = 0 for |x| > 2. Then µ(−x) = µ(x), therefore, µ ˜(x) = µ(x). If f is a symmetric defuzzification, then according to the definition f (˜ µ) = −f (µ). But µ ˜ = µ, therefore, f (˜ µ) = f (µ). So f (µ) = −f (µ), hence f (µ) = 0. But µ(0) = 0, i.e., 0 is a prohibited point for µ. Q.E.D. 55

Proof of Theorem 12 Let’s prove it by reduction to a contradiction. Suppose that f is a continuous defuzzification, and f is consistent with prohibitions. Let’s consider the following family of the piecewise linear membership functions µα = αµ− (x) + (1 − α)µ+ (x), where α ∈ [0, 1], µ+ is defined as µ+ (x) = x for 0 ≤ x ≤ 1, µ+ (x) = 2 − x for 1 ≤ x ≤ 2, and µ+ (x) = 0 for x outside [0,2], and µ− (x) as µ− (x) = µ+ (−x) (i.e., µ− is a mirror image of µ+ ). Since f is continuous, we can conclude that f (µα ) is a continuous function of α. Consistency with prohibition means that µ(f (µ)) > 0 for every µ. In particular, for µ = µα it means that µα (f (µα )) > 0. For α = 0 the function µα coincides with µ+ . The function µ+ is different from 0 only for x ∈ (0, 2), so f (µ0 ) belongs to the interval (0,2), and is, therefore, positive. For α = 1 the function µα coincides with µ− . The function µ− is different from 0 only for x ∈ (−2, 0), so f (µ1 ) belongs to the interval (-2,0), and is, therefore, negative. So f (µ0 ) > 0, f (µ1 ) < 0, and, since f (µα ) is a continuous function of α, the intermediate value theorem proves that there exists an intermediate value α ∈ (0, 1) such that f (µα ) = 0. But µα (0) = 0, so 0 is a prohibited value. This result contradicts our assumption that f is consistent with prohibitions; this contradiction shows that we cannot have a continuous defuzzification that is consistent with prohibitions. Q.E.D. Proof of Theorem 13(13′ ) 1. Just like in the proof of Theorem R 4 (1. and 2.), we can prove that a functional g is equivalent eitherR to a linear functional g1 = a1 (x)µ(x) dx, or to a quadratic functional g2 = a2 (x, y)µ(x)µ(y) dx dy. 2. In the first case, from the invariance of the ordering relation with respect to changing the starting point for x, we conclude (like in part 4. of the proof of Theorem 4) that g(µ(x+x0 )) = c(x0 )g(µ(x)), where c(x) = exp(Ax) for some A, and, therefore, a1 (x) = exp(Ax)a1 (0). Now, using invariance with respect to changing units of x, we conclude (like in part 5. of the proof of Theorem 4) that a1 (λx) = λα a1 (x). Substituting here a1 (x) = C exp(Ax), where C = a1 (0) is a constant, we R conclude that A = α = 0, and, therefore, a1 = const. So in this case g is equivalent to µ(x) dx. 3. In the second case, when 4., 5. of Theorem 4, R a2 6= 0, applying the arguments from parts α weR conclude that g˜(µ) = a(x − y)µ(x)µ(y) dx dy, where a(x) = C|x| + C1 δ(2n) . i.e., g˜(µ) = R (n) 2 α C |x − y| µ(x)µ(y) dx dy + C1 (µ ) dx. If C1 6= 0, then this functional is applicable only to smooth functions µ(x), and we assumed that it is everywhere defined. Therefore, C1 = 0, and we arrive at the desired expression for g. Q.E.D. Proof of Lemma 2 According to 4), the dependency of g on xi (if we fix all other variables) is fractionally linear, i.e., f = (axi + b)/(cxi + d). If c 6= 0, then this function is not defined for xi = −di /ci , and we assumed that it is everywhere defined. So ci = 0, and f is a linear function of each of xi . Therefore, for fixed µi f is a multi-linear function of the variables x1 , ..., xn , i.e., f = a0 +a1 x1 +...+an xn +a12 x1 x2 +.... Q.E.D. Proof of Theorem 14 In viewR of Theorem 13 it is sufficient to prove that any characteristic g that is equivalent to g˜(µ) = |x − y|α dx dy is not continuous in the sense of Definition 32. Indeed, from the proof of Theorem R13 it follows that in this = f (˜ g (µ)) for some analytical R case g(µ) α α 2 function f (a), i.e., g(µ) = f0 + f1 |x − y| dx dy + f2 ( |x − y| dx dy) + ... If α 6= 2n for some integer n, then the resulting dependency on x is not even analytical, therefore, these values of α 56

are impossible. If α = 2n, then for n = 1 we get quadratic terms, for n = 2 fourth order terms, etc. In all these cases we violate the condition that g should be equal to a limit of a multi-linear function of xi . So all these characteristics are really not continuous in the R sense of Definition 32, and the only continuous characteristics are those equivalent to the area µ(x) dx. Q.E.D. Proof of Lemma 3 Since the possible values of control are limited by an interval [−U, U ], all the integrals are over the range that is contained in [−U, U ]. The values of integrated functions are limited by 1, therefore, both integrals are finite. For each set of input parameters, at least one rule is applicable, therefore, for this rule all the values Pi (zi ) that correspond to its conditions, are positive. The fuzzy property P that is a conclusion of that rule, corresponds (by the definition of a fuzzy property) to a nonzero membership function. This means that there exists a value u, for which P (u) > 0. For this value u we have µCR (u) = min(P1 (z1 ), ..., P2 (z2 ), ..., Pn (zn ), P (u)) > 0 and, therefore, µC (u) = max(µCR (u)) > 0. The function µC (u) is a composition of continuous functions min, max and P (u), therefore, it is also continuous itself. Since it is continuous and positive in some point u, it is positive (and > ε/2, where R ε denotes µC (u) > 0) for all the values from some interval [u − δ, u + δ]. Therefore, the integral µC (u) du is greater than 2δ · (ε/2) > 0, i.e., positive, and this means that a fuzzy control is defined. Q.E.D. Proof of Theorem 15 By definition of the fuzzy control, the value µC (u) depends on u, x, x˙ and, maybe, several higher derivatives of x. Let is consider this value as a function of all these parameters: µC (u) = F (u, x, x, ˙ ...). This function F is a composition of continuous functions min, max, Pi and P , therefore, F is also continuous. It is well known from calculus that if we integrate a continuous function over one of its variables, the resulting function is also continuous. So both integrals are continuous, and, therefore, their ratio is continuous as well. Q.E.D. Proof of Theorem 15A is similar. Proof of Lemma 4 is simple, because for small δ the control is approximately linear: u(x) ≈ u′ (0)x. Proof of Theorem 16 In view of Lemma 4, we must compute the derivative u′ (0) = limx→0 (u(x) − u(0))/x). Evidently u(0) = 0, so let’s take small positive x and estimate u ¯(x). For x ∈ [0, ∆/2] only two of Nj are different from 0: N (x) = N0 (x) = 1 − x/∆ and SP (x) = N1 (x) = x/∆. Therefore, only two rules are fired for such x, namely, those that correspond to N (u) and SN (u). Computing the necessary maxima and minima, we arrive at the following expression for µC (u): 0 if u < −k∆, 1 + u/(k∆) if −k∆ ≤ u ≤ −kx, 1 − x/∆ for |u| ≤ kx, 1 − u/(k∆) for kx ≤ u ≤ k∆ − kx, x/∆ when k∆ − kx ≤ u ≤ 2k∆ − kx and linear decreasing to 0 on the interval [2k∆ − kx, 2k∆]. We can explicitly integrate this piecewise linear function and get the desired expression. R To make it more transparent: When u = 0, the denominator µC (u) du equals to k∆ (the area of the triangle that is the graph of the membership function), so in the limit u →R0 we get k∆. So everything depends on the numerator. For a symmetric function f the integral uf (u) du equals 0. The above function µC (u) is almost symmetric, with the exception of the domain from k∆ to ∆, where it is constant kx, and small areas (of size ≈ x) where it is ≈ x. So the integrals over the symmetric part cancel each other, and it is sufficient to integrate only over this additional domain, R 2k∆ which gives u ¯(x) ≈ ( k∆ (x/∆)u du)/(k∆) = 3/2 kx. Q.E.D. Proof of Theorem 17 follows from the fact that the relaxation time is uniquely determined by the behavior of a system near x = 0. The smaller ∆ we take, the closer u(x) to a linear function on an 57

interval [−∆, ∆] that determines the derivative of u ¯(x), and, therefore, the closer the corresponding relaxation time to a relaxation time of a system that originated from the linear control. This is the basic idea, and routine computation easily confirms this result. Proof of Theorem 18 The fact that for min and + we have u ¯(x) ≈ 2kx for small x, can be confirmed by direct computations, like in Theorem 17. Let’s prove that for the other and-or pair the derivative of u ¯ is not greater than 2k. Just like in Theorem 17, the denominator tends toR the integral of the triangle as x → 0, therefore, we can consider only the numerator integral uµC (u) du, where µC (u) equals to 0 for u < −k∆, to f& (1 − x/∆, 1 + u/(k∆)) for −k∆ ≤ u ≤ 0, to f∨ (f& (1 − x/∆, 1 + u/(k∆)), f& (x/∆, u/(k∆)) for 0 ≤ u ≤ k∆ and some other values for greater u. Let’s first fix f& and consider different functions f∨ . If we increase the values of f∨ , then the values of µC (u) for u < 0 will be unaffected, but the values for u > 0 will increase. Therefore, the total value of the numerator integral will increase. So, if we change f∨ to a maximum possible function min(a + b, 1), we will increase this integral, and, therefore, we will arrive at a new pair of functions, for which u ¯ is not smaller for small x, and, therefore, the derivative of u ¯ in 0 is not smaller. So in order to prove the inequality, it is sufficient to prove it for the case when “or” is represented by min(a + b, 1). In this case we can represent µ as a sum of two functions: a symmetric part µ1 that R is equal to µ(u) for u < 0 and to µ(−u) for u > 0, and whose integral uµ1 (u) du = 0, and an additional part µ2 that is equal to µ for u > k∆ and to f∨ (f& (1 − x/∆, 1 + u/(k∆)), f& (x/∆, u/(k∆)) − f& (1 − x/∆, 1 + u/(k∆)) = f& (x/∆, u/(k∆)) for 0 ≤ u ≤ k∆ (except when u is close to 1, where we have to use the 1 part of a min(a + b, 1) formula; these terms lead to O(x2 ) terms in u ¯ and are therefore negligible). This function µ2 is non-zero only for positive u, and so the bigger f& , the bigger are the values of it, and the bigger is the numerator of the ratio that defines u ¯. So the maximum is attained, when f& attains its maximal possible value min(a, b). Q.E.D. Proof of Theorem 19 is, like in Theorem 17, by direct computations. Proof of Theorem 20: we have actually proved this statement when we proved Theorem 4. Proof of Theorem 21 For linear systems u(x) = −kx we have x(t) ˙ = −kδ exp(−kt), and the nonR ∞= δ exp(−kt), so x(t) smoothness functional equals to J(δ) = δ2 0 k2 exp(−2kt) dt = (k/2)δ2 . Therefore, J = k/2. For non-linear systems with a smooth control u(x) we can likewise prove that J = −1/2u′ (0). Therefore, the problem of choosing a control with the smallest value of non-smoothness is equivalent to the problem of finding a control with the smallest value of |k| = |u′ (0)|. This problem is directly opposite to the problem that we solved in Theorem 18, where our main goal was to maximize |k|. Similar arguments show that the smallest value of |k| is attained, when we take the smallest possible function for ∨, and the smallest possible operation for &. Q.E.D. Proof of Theorems 22 and 22′ is simple: we take the general formulas for reasonable transformations from Theorems 1 and 1′ and substitute the conditions f (m) = 0 and f (M ) = 1. CONCLUSIONS In the present paper, we analyzed the process of designing the fuzzy control. In order to design a specific control procedure, one must make three choices: choose membership functions that correspond to fuzzy words, choose operations corresponding to & and ∨, and choose a defuzzification procedure. For each of these stages, we formulate reasonable restrictions on the set of possible 58

choices. In all three cases, these restrictions are naturally formalized in a special mathematical formalism (group theory). This formalization allows us to apply the known deep results of group theory and: 1) conclude that the reasonable choice of a membership function is linear, fractionally linear, piecewise polynomial or Gaussian, 2) show that a reasonable defuzzification is a centroid or a Centroid of largest area, and 3) enumerate all possible & and ∨ operations. Thus: 1) we give theoretical explanations to the existing semi- empirical choices; these explanations are based on a single formalism and thus form a unifying theory for all 3 stages of control design; 2) we formulate the class of possible choices, so that for every specific situation the optimal choice can be found by analyzing only these choices; we also actually find the best choices for several typical situations (tracing and docking); 3) we show that fuzzy control is not a semi-empirical craft but it can be based on the same theoretical foundations as theoretical physics and some parts of Computer Science; we hope that this formalism will be helpful to solve other problems of fuzzy control theory. ACKNOWLEDGEMENTS This work was supported by a NSF Grant No. CDA-9015006, NASA Research Grant No. 9-482 and the Institute for Manufacturing and Materials Management grant. The authors are greatly thankful to A. Bernat, R. Bell, C.-C. Chang, M. Gelfond and D. Tolbert (El Paso, TX), I. Frenkel, A. Mostow and G. Zuckerman (Yale), P. Hajek (Prague), J. Halpern (IBM Almaden Research Center), Yu. Gurevich (Ann Arbor, Michigan), Y. Jani (NASA Johnson Space Center), D. Knuth, J. McCarthy, P. Sarnak, and P. Suppes (Stanford), H. Landau and L. Shepp (AT&T Bell Laboratories), V. Lifschitz (Austin, TX), C. McDonald (White Sands), W. Madych (Storrs, CT), H. Przymusinska (CalPoly), T. Przymusinski (Riverside, CA), S. Sternberg (Harvard) and V. Zheludev (St. Petersburg) for discussing our preliminary results, and to all the participants of the 1st International Workshop on Industrial Applications of Fuzzy Control and Intelligent Systems (College Station, TX 1991), especially to B. Kosko (Los Angeles), R. Langari (College Station, TX), T. Lotan (MIT), N. Pfluger (College Station, TX), M. Yamshidi (Albuquerque, NM), J. Yen (College Station, TX), and L. Zadeh (Berkeley) for valuable discussions. REFERENCES [A66] J. Aczel. Lectures on functional equations and their applications. Academic Press, N.Y. and London, 1966. [BK88] R. Baekeland and E. E. Kerre. Piecewise linear fuzzy quantities: a way to implement fuzzy information into expert systems and fuzzy databases. In: Uncertainty an Intelligent Systems (B. Bouchon, L. Saitta, R.R. Yager, eds.) Lecture Notes in Computer Science, Vol. 313, SpringerVerlag, Berlin-Heidelberg-New York, 1988, pp. 119–126. [BCDMMM85] G. Bartolini, G. Casalino, F. Davoli, M. Mastretta, R. Minciardi, and E. Morten. Development of perfomance adaptive fuzzy controllers with applications to continuous casting plants, in [S85], pp. 73–86. [B70] R. E. Bellman. Introduction to Matrix Analysis, McGraw- Hill, New York, 1970. [BG73] R. E. Bellman and M. Giertz. On the analytic formalism of the theory of fuzzy sets. Inform. Science, 1973, Vol. 5, pp. 149–157. [BZ70] R. Bellman, L. A. Zadeh. Decision-making in a fuzzy environment, Management Science, 1970, Vol. 17, No. 4, pp. 141– 164. [B91] H. R. Berenji. Fuzzy logic controllers. In: An Introduction to Fuzzy Logic Applications in Intelligent Systems (R. R. Yager, L. A. Zadeh. eds.), Kluwer Academic Publ., 1991. 59

[Be88] J. Bernard. Use of a rule-based system for process control. IEEE Control Systems Magazine, 1988, pp. 3–13. [B74] J. M. Blin. Fuzzy relations in group decision theory, J. of Cybernetics, 1974, Vol. 4, pp. 17–22. [BW73] J. M. Blin and A. B. Whinston. Fuzzy sets and social choice. J. of Cybernetics, 1973, Vol. 3, pp. 28–36. [B88] B. Bouchon. Stability of linguistic modifiers compatible with a fuzzy logic. In: Uncertainty an Intelligent Systems (B. Bouchon, L. Saitta, R.R. Yager, eds.), Lecture Notes in Computer Science, Vol. 313, Springer-Verlag, Berlin-Heidelberg-New York, 1988, pp. 119–126. [BS84] B. G. Buchanan and E. H. Shortliffe. Rule-based expert systems. The MYCIN experiments of the Stanford Heuristic Programming Project. Addison-Wesley, Reading, MA, Menlo Park, CA, 1984. [CZ72] S. S. L. Chang and L. A. Zadeh. On fuzzy mapping and control, IEEE Transactions on Systems, Man and Cybernetics, 1972, Vol. SMC-2, pp. 30–34. [CK91] J. Corbin and V. Kreinovich. Dynamic tuning of communication network parameters: why fractionally linear formulas work well. University of Texas at El Paso, Computer Science Department, Technical Report UTEP-CS-91-4, June 1991. [D91] B. Daviss. Laid-back computers. Discover, January 1991, pp. 60–61. [DH88] J. J. D’Azzo and C. H. Houpis. Linear control system analysis and design: conventional and modern. Mc-Graw Hill, New York, 1988. [D81] H. Dishkant. About membership-function estimation, Fuzzy Sets and Systems, 1981, Vol. 5, No. 2, pp. 141–148. [D82] J. Dombi. A general class of fuzzy operators, the De Morgan class of fuzzy operators and fuzziness measures induced by fuzzy operators, Fuzzy Sets and Systems, 1982, Vol. 8, pp. 149- -163. [DP80] D. Dubois and H. Prade. Fuzzy sets and systems: theory and applications. Academic Press, N.Y., London, 1980. [DP80a] D. Dubois and H. Prade. New results about properties and semantics of fuzzy settheoretic operators. In: Fuzzy sets: theory and applications to policy analysis and information systems (P.P. Wang, S.K. Chang, eds). Plenum Press, New York, 1980, pp. 59–75. [DP88] D. Dubois and H. Prade. Possibility theory. An approach to computerized processing of uncertainty. Plenum Press, N.Y. and London, 1988. [DP91] D. Dubois and H. Prade. Basic issues on fuzzy rules and their application to fuzzy control. Proceedings of the Fuzzy Control Workshop, Sydney, Australia, August 1991. [FK89] A. M. Finkelstein and V. Kreinovich. Formal models of nonformalizable reasoning: applications to computer science. Universita di Bari (Italy), Dipartamento di Scienze Filosofiche, Rapporto Scientifico No. 7, 1989. [F79] M. J. Frank. On the simultaneous associativity of F (x, y) and x+y −F (x, y). Aequationes Mathematicae, 1979, Vol. 19, pp. 194–226. [FF74] L. W. Fung and K. S. Fu. An axiomatic approach to rational decision-making in a fuzzy environment. In: Fuzzy Sets and Their Application to Cognitive and Decision Processes (L. A. Zadeh, K. S. Fu, K. Tanaka, M. Shimura, editors), Academic Press, New York, 1975, pp. 227–256. [GSh64] I. M. Gelfand and G. E. Shilov. Generalized functions. Vol. 1, Academic Press, N. Y. and London, 1964. [G76] R. Giles. Lukaciewicz logic and fuzzy set theory, Intern. J. Man–Machine Studies, 1976, Vol. 8, pp. 313–327. [GS64] V. M. Guillemin and S. Sternberg. An algebraic model of transitive differential geometry, Bull. Amer. Math. Soc., 1964, Vol. 70, No. 1, pp. 16–47. [GQ91] M. M. Gupta and J. Qi. Theory of t-norms and fuzzy inference methods, Fuzzy Sets and Systems, 1991, Vol. 40, pp. 431–450. 60

[GQ91a] M. M. Gupta and J. Qi. Connectives (and, or, not) and T -operators in fuzzy reasoning, in: Conditional Logic in Expert Systems, I. R. Goodman et al (eds.), Elsevier, Amsterdam, N.Y., 1991, pp. 211–233. [H85] P. Hajek. Combining functions for certainty degrees in consulting systems. Intl. J. of Man-Machine Studies, 1985, Vol. 22, pp. 59–65. [HV87] P. Hajek and J. J. Valdes. Algebraic foundations of uncertainty processing in rule-based expert systems. Technical Report, Academy Science Institute of Mathematics, Prague 1987. [H75] H. Hamacher. Uber logische Verpunfungen Unscarfer Aussagen unde Deren Zugenhorige Bewertungs-funktionen. In: Progress in Cybernetics and Systems Research, 1975, Vol. 3 (R. Trappl, G. J. Klir and L. Ricciardi, Eds.) Hemisphere, N. Y., pp. 276–287. [H78] H. Hamacher. Uber logische Aggregationen nicht-binar expliziter Entscheidungskrieten. Frankfurt/Main, 1978. [HC76] H. M. Hersch and A. Caramazza. A fuzzy-set approach to modifiers and vagueness in natural languages. J. Exp. Psychol.: General, 1976, Vol. 105, pp. 254–276. [KB88] M. Kallala and W. Bellin. A study of Arab computer users: a special case of a general HCI methodology. In: Uncertainty and Intelligent Systems (B. Bouchon, L. Saitta, R.R. Yager, eds) Lecture Notes in Computer Science, Vol. 313, Springer- Verlag, Berlin-Heidelberg-New York, 1988, pp. 338–350. [K86] A. Kandel. Fuzzy mathematical techniques with applications, Addison-Wesley, Reading, MA, 1986. [K75] A. Kauffman. Introduction to the theory of fuzzy subsets. Vol. 1. Fundamental theoretical elements, Academic Press, N.Y., 1975. [KG85] A. Kauffman, M. M. Gupta. Introduction to fuzzy arithmetic. Theory and applications. Van Nostrand, N.Y., 1985. [KKS85] J. B. Kiszka, M. E. Kochanska, and D. L. Sliwinska. The influence of some parameters on the accuracy of a fuzzy model. In [S85], pp. 187–230. [KF88] G. J. Klir and T. A. Folger. Fuzzy sets, uncertainty and information. Prentice Hall, Englewood Cliffs, NJ, 1988. [KB76] M. Kochen and A. N. Badre. On the precision of adjectives which denote fuzzy sets. J. Cybern., 1976, Vol. 4, No. 1, pp. 49–59. [K92] B. Kosko. Neural networks and fuzzy systems, Prentice- Hall, Englewood Cliffs, NJ, 1992. [KKM88] V. Kozlenko, V. Kreinovich, and M. G. Mirimanishvili. The optimal method of describing the expert information. In: Applied problems of systems analysis. Proceedings Georgian Polytechnic Institute, 1988, No. 8, pp. 64–67 (in Russian). [K83] V. Kreinovich. Foundations of the Maslov’s operator. In: Proceedings of the 3-rd National Conference on the Applications of Mathematical Logic, Tallinn, 1983, pp. 80–81 (in Russian). [K87] V. Kreinovich. A mathematical supplement to the paper: I. N. Krotkov, V. Kreinovich and V. D. Mazin. A general formula for the measurement transformations, allowing the numerical methods of analyzing the measuring and computational systems, Measurement Techniques, 1987, No. 10, pp. 8–10. [K87a] V. Kreinovich. Semantics of Maslov’s iterative method. In: Problems of Cybernetics, Vol. 131, Moscow 1987, pp. 30–63 (in Russian). [K89] V. Kreinovich. The optimal choice of formulas of fuzzy logic. Center for New Informational Technology ”Informatika”, Leningrad, Technical Report, 1989 (in Russian). [K89a] V. Kreinovich. Optimization in case of uncertain optimality criteria: group-theoretic approach. Center for New Informational Technology “Informatika”, Leningrad, Technical Report, 1989 (in Russian). [K89b] V. Kreinovich. Zadeh or Piasecki formula for fuzzy probabilities? Center for New Informational Technology ”Informatika”, Leningrad, Technical Report, 1989. 61

[K90] V. Kreinovich. Group-theoretic approach to intractable problems. Lecture Notes in Computer Science, Springer, Berlin, Vol. 417, 1990, pp. 112–121. [KFLL91] V. Kreinovich, O. Fuentes, R. Lea, and A. Lokshin. Expert systems become suspiciously smart. University of Texas at El Paso, Computer Science Department, Technical Report UTEP-CS-91-12, 1991. [KK90] V. Kreinovich and S. Kumar. Optimal choice of &- and ∨- operations for expert values in: Proceedings of the 3rd University of New Brunswick Artificial Intelligence Workshop, Fredericton, N.B., Canada, 1990, pp. 169–178. [KK90a] V. Kreinovich and S. Kumar. Optimal choice of &-and ∨-operations for expert values. University of Texas at El Paso Computer Science Dept. Technical Report UTEP-CS-90-12, June 1990. [KK91] V. Kreinovich and S. Kumar. How to help intelligent systems with different uncertainty representations communicate with each other. Cybernetics and Systems: International Journal, 1991, Vol. 22, pp. 217–222. [KL90] V. Kreinovich and A. M. Lokshin. On the foundations of fuzzy formalism: explaining formulas for union, intersection, negation and modifiers. University of Texas at El Paso, Computer Science Dept. Technical Report UTEP-CS-90-28, October 1990. [KQ91] V. Kreinovich and C. Quintana. Neural networks: what non-linearity to choose?, Proceedings of the 4th University of New Brunswick Artificial Intelligence Workshop, Fredericton, N.B., Canada, 1991, pp. 627–637. [KQ92] V. Kreinovich and C. Quintana. How does new evidence change our estimates of probabilities: Carnap’s formula revisited. Cybernetics and Systems: International Journal, 1992 (to appear). [KQ92a] V. Kreinovich, C. Quintana and O. Fuentes, Genetic Algorithms: What Fitness Scaling if Optimal? Technical Report (University of Texas at El Paso, Computer Science Department, 1992). [KQL91] V. Kreinovich, C. Quintana, and R. Lea. What procedure to choose while designing a fuzzy control? Towards mathematical foundations of fuzzy control, Working Notes of the 1st International Workshop on Industrial Applications of Fuzzy Control and Intelligent Systems, College Station, TX, 1991, pp. 123–130. [KR86] V. Kreinovich and L. K. Reznik. Methods and models of formalizing prior information (on the example of processing measurements results). In: Analysis and formalization of computer experiments, Proceedings of Mendeleev Metrology Institute, Leningrad, 1986, pp. 37-41 (in Russian). [KM87] R. Kruse and K. D. Meyer. Statistics with vague data. D. Reidel, Dordrecht, 1987. [KA91] B. Kuipers, K. Astrom. The composition of heterogenous control laws, Proceedings of the 1991 American Control Conference, 1991. [L88] R. N. Lea. Automated space vehicle control for rendezvous proximity operations. Telemechanics and Informatics, 1988, Vol. 5, pp. 179–185. [LJB90] R. N. Lea, Y. K. Jani and H. Berenji. Fuzzy logic controller with reinforcement learning for proximity operations and docking. Proceedings of the 5th IEEE International Symposium on Intelligent Control, 1990, Vol. 2, pp. 903–906. [LTTJ89] R. N. Lea, M. Togai, J. Teichrow and Y. Jani. Fuzzy logic approach to combined transnational and rotational control of a spacecraft in proximity of the Space Station. Proceedings of the 3rd International Fuzzy Systems Association Congress, 1989, pp. 23–29. [L90] C. C. Lee. Fuzzy logic in control systems: fuzzy logic controller. IEEE Transactions on Systems, Man and Cybernetics, 1990, Vol. 20, No. 2, pp. 404–435. [M78] P. J. MacVikar-Whelan. Fuzzy sets, the concept of height and the edge “very”. IEEE Transactions on Systems, Man, and Cybernetics, 1978, Vol. 8, pp. 507–511. 62

[M74] E. H. Mamdani. Application of fuzzy algorithms for control of simple dynamic plant, Proceedings of the IEE, 1974, Vol. 121, No. 12, pp. 1585–1588. [MYI87] S. Miyamoto, S. Yasunobu and H. Ikara. Predictive fuzzy control and its applications to automatic train operation system in J. C. Bezdek (Ed.) Analysis of Fuzzy Information. Vol. 2. Artificial Intelligence and Decision Systems, CRC Press, Boca Raton, FL, 1987, pp. 59–72. [M91] R. Mohler. Nonlinear systems. Vol. 1. Dynamics and control. Prentice Hall, Englewood Cliffs, NJ, 1991. [O77] G. C. Oden. Integration of fuzzy logical information, Journal of Experimental Psychology: Human Perception Perform., 1977, Vol. 3, No. 4, pp. 565–575. [O83] S. V. Ovchinnikov. General negations in fuzzy set theory. Journal of Math. Analysis and Applications, 1983, Vol. 92, pp. 234–239. [P83] C. E. Pearson (ed.) Handbook of applied mathematics, Van Nostrand, N.Y., 1983. [S54] L. J. Savage. The foundations of statistics. Wiley, New York, 1954. [SS61] B. Schweizer and A. Sklar. Associative functions and statistical triangle inequalities. Publicationes Mathematicae Debrecen, 1961, Vol. 8, pp. 169–186. [SS88] B. Schweizer and A. Sklar. Probabilistic metric spaces, North Holland, N.Y., 1988. [S76] E. H. Shortliffe. Computer-based medical consultation: MYCIN. Elsevier, New York, 1976. [SS65] I. M. Singer and S. Sternberg. Infinite groups of Lie and Cartan, Part 1, Journal d’Analyse Mathematique, 1965, Vol. XV, pp. 1–113. [SC91] S. M. Smith and D. J. Comer. Automated calbration of a fuzzy logic controller using a cell state space algorithm, IEEE Control Systems, August 1991, pp. 18–28. [S74] M. Sugeno. Theory of fuzzy integrals and its applications. Ph. D. Thesis, Tokyo Institute of Technology, Tokyo, 1974. [S77] M. Sugeno. Fuzzy measures and fuzzy integrals. In: Fuzzy automata and design processes, (N. M. Gupta, G. N. Saridis, B. R. Gaines), eds., Amsterdam, New York, 1977, pp. 89–102. [S85] M. Sugeno (editor). Industrial applications of fuzzy control, North Holland, Amsterdam, 1985. [T79] E. Trillas. Sobre funciones de negacion en la teoria de conjunctos difusos. Stochastica, 1979, Vol. III, No. 1, pp. 47- -59. [W83] S. Weber. A general concept of fuzzy connectives, negations and implications based on t-norms and t-conorms. Fuzzy Sets and Systems, 1983, Vol. 11, pp. 115–134. [W62] N. Wiener. Cybernetics, or Control and Communication in the animal and the machine, MIT Press, Cambridge MA, 1962. [Y79] R. R. Yager. On the measure of fuzziness and negation, Part 1. Membership in the unit interval. Intl. J. General Systems, 1979, Vol. 5, pp. 221–229. [Y80] R. R. Yager. On a general class of fuzzy connectives, Intl. Journal of General Systems, 1980, Vol. 4, pp. 235–242. [YIS85] O. Yagishita, O. Itoh, and M. Suf=geno. Application of fuzzy reasoning to the water purification process, [S85], pp. 19– 40. [YM85] S. Yasunobu and S. Miyamoto. Automatic train opertaion systems by predictive fuzzy control, [S85], pp. 1–18. [YP91] J. Yen and N. Pfluger. Path planning and execution using fuzzy logic, In: AIAA Guidance, Navigation and Control Conference, New Orleans, LA, 1991, Vol. 3, pp. 1691–1698. [YP91a] J. Yen and N. Pfluger. Designing an adaptive path execution system. In: IEEE International Conference on Systems, Man and Cybernetics, Charlottesville, VA, 1991. [YPL92] J. Yen, N. Pfluger, and R. Langari. A defuzzification strategy for a fuzzy logic controller employing prohibitive information in command formulation, Proceedings of IEEE International Conference on Fuzzy Systems, San Diego, CA, March 1992. [YSB90] H. Ying, W. Siler, and J. Buckley. Fuzzy control theory: A nonlinear case. Automatica, 63

1990, Vol 26, No. 4, pp. 513–520. [Z65] L. Zadeh. Fuzzy sets. Information and control, 1965, Vol. 8, pp. 338–353. [Z71] L. A. Zadeh. Towards a theory of fuzzy systems, In: Aspects of network and systems theory, R. E. Kalman, N. DecLaris (eds.), Holt. Rinehart, Winston, 1971. [Z72] L. A. Zadeh. On fuzzy algorithms. Memorandum, University of California at Berkeley, ERL M325, Berkeley, CA, 1972. [Z73] L. A. Zadeh. Outline of a new approach to the analysis of complex systems and decision processes, IEEE Transactions on Systems, Man and Cybernetics, 1973, Vol. 3, pp. 28–44. [Z74] L. A. Zadeh. A rationale for fuzzy control, Transactions ASME, Ser. G, Journal of Dynamic Systems, Measurement and Control, 1974, Vol. 94, pp. 3–4. [Z75] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning, Part 1, Information Sciences, 1975, Vol. 8, pp. 199–249. [Z78] H. J. Zimmerman. Results of empirical studies in fuzzy set theory. In: Applied General System Research (G.J.Klir, ed.) Plenum, New York, 1978, pp. 303–312. [Z85] H. J. Zimmerman. Fuzzy set theorey and its applications, Kluwer–Nijhof Publ., Boston– Dordrecht, 1985.

64

WHAT NON-LINEARITY TO CHOOSE? MATHEMATICAL ... - CiteSeerX

WHAT NON-LINEARITY TO CHOOSE? MATHEMATICAL ... - CiteSeerX

Suggest Documents