The Use of Fuzzy Logic in Creating a Visual Data Summary of a Telecom Operator’s Customer Base Julia Sidorova1, Lars Sköld2, Håkan Lennerstad1 and Lars Lundberg1 1
Department of Computer Science, Blekinge Institute of Technology, Karlskrona, Sweden 2 Telenor AB, Stockholm, Sweden
[email protected]
Abstract. As pointed out by Zadeh, the mission of fuzzy logic in the era of big data is to create a relevant summary of huge amounts of data and facilitate decision-making. In this study, elements of fuzzy set theory are used to create a visual summary of telecom data, which gives a comprehensive idea concerning the desirability of boosting an operator's presence in different neighborhoods and regions. The data used for validation cover historical mobility in a region of Sweden during a week. Fuzzy logic allows us to model inherently relative characteristics, such as "a tall man" or "a beautiful woman", and importantly it also defines "anchors", the situations (characterized with the value of the membership function for the characteristic) under which the relative notion receives a unique crisp interpretation. We propose color coding of the membership value for the relative notions such as "the desirability of boosting operator's presence in the neighborhood" and "how well the operator is doing in the region". The corresponding regions on the map (e.g., postcode zones or larger groupings) are colored in different shades passing from green (1) though yellow (0.5) to red (0). The color hues pass a clear intuitive message making the summary easy to grasp. Keywords: Mobility Data, Call Detail Records, Fuzzy Membership Function, Color.
1
Introduction
Currently, a technology is wanted that is capable of creating a useful summary of multitudes of data. Such a summary was desired to be made in a natural language, but at the same time with a mathematical precision preserved [1]. Fuzzy logic has a potential to become a component of such technology, since it bridges from mathematics to the way humans reason and the way the human environment operates. Fuzzy logic deals with the concepts that are not precisely defined in a mathematical sense. The classical examples are the "class of all real numbers which are much greater than 1," or "the class of beautiful women," or "the class of tall men". Such notions do not constitute classes or sets defined with usual mathematical rigor. Yet, “the fact remains that such imprecisely defined notions play an important role in human thinking, particularly in the domains of decision-making, abstraction and communication of information” [2]. Application of fuzzy logic in business intelligence has not been studied
2
extensively due to certain inherent difficulties in practical realization of such systems, and yet, in spite of these difficulties, such applications are possible and very useful. A comprehensive review of these applications is given in [3]. The challenges of practical application of fuzzy logic in business intelligence systems include at least the following. First, not every problem permits trial and error calibration of threshold values (called the “anchors”) needed in these applications. (Any relativity and ambiguity disappears at the anchors and the situation is crisply clear.) Second, membership functions and inference methods must have tangible meanings, which can be very context-dependent. Third, fuzzy theory is a borderline discipline with psycholinguistics, and as such it is less objective than formal sciences such as logic or set theory. There is a big body of experimental cognitive research validating the work of Zadeh and colleagues (see [4] for a comprehensive summary), and, still, it may require yet unavailable knowledge about human cognition. The notion of fuzziness has distinct understandings and there are important consequences of these discrepancies. Fuzzy logic enables formulating a natural language interface between big data, numeric analytics, and a human being via hiding the complexity of data and methods and generating a comprehensive summary. In the previous work we summarized data in a natural language (with linguistic hedges) and formulated appropriate queries [5]. Examples of such queries for the telecommunication system environments are: “Which neighborhoods are highly desirable?” “Is the infrastructure rather loaded or highly loaded in this region?”. The contribution of this paper is to complement such data summaries with visual summaries (especially that a map is better seen than read about) with intuitively clear color symbolism, to let the human cognitive System I (the term used for cognitively untaxing mode of thinking) to say a word. System I response is quick and intuitive and complements the careful System II, which is logical, slow and cautious [6]. Human judgments happen in the interplay of logic, memory and intuition, and which type of thinking overtakes depends on the task and on the person. The rest of the paper is organized as follows. Section 2 covers the related literature concerning fuzzy logic and our previous work on resource allocation based on historical mobility data. Section 3 describes the data. In Section 4 the proposed methodology of color hues is explained. The results are presented in Section 5. The paper concludes with our aspirations for future research in Section 6.
2
Related Literature
2.1
Background on Fuzzy Logic, Formal and Cognitive
For a review on insights in fuzzy logic modeling the reader is referred to [7] and a comprehensive guide of good practices in fuzzy logic analysis in social sciences can be found in [8]. Since fuzzy logic has many applications in different fields of science, basic terminology differs slightly even in the cited sources. This paper uses the terminology of social sciences [8] and cognitive research. This does not hamper its explanatory capacity, because the use of fuzzy set theory is limited to the basic notions, and their descriptive and summarization power is investigated. Definition [2]: A fuzzy set A in X is characterized by a membership function fA(x),
3
which associates with each point in X a real number in the interval [0, 1], with the value of fA(x) at x representing the "grade of membership" of x in A. For the opposite quality the membership value is defined as: fnotA(x) = 1-fA(x). Fuzzy membership scores reflect the varying degree, to which different cases belong to a set and combine a qualitative and quantitative assessment. A tangible event called an “anchor” must occur at the values of a state switch and such values are application-dependent [8]. A fuzzy membership score of 1 indicates a full membership in a set; the scores close to 1 (e.g. 0.8 or 0.9) indicate strong but not quite full membership in a set; 0.5 is the point of maximal ambiguity regarding the quality; the scores greater than 0.5 but less than 0.8 indicate weaker but still notable class membership, scores less than 0.5 but greater than 0 (e.g. 0.2 and 0.3) indicate that the objects are more “out” than “in” a set, but still are weak members of the set; and finally a score of 0 indicates full non-membership in the set. We remind that the exact number of intervals in the range of the membership function and the anchor values that define those intervals are context-dependent and have a crisp meaning, for example: no gain no loss (0.5 for profitable) or the optimal value (x* in LP) denotes “fully successful” (1 for successful) [9-10]. Beyond the formal work of Zadeh and colleagues, cognitive scientists experimentally verify the principles of fuzzy logic in order to investigate how truthfully they describe cognitive processes in the brain. There are competing views (paradigms) on where fuzziness arises and there is a body of experimental cognitive research conducted, for example [4] is a review. Here, the question that interests us is what it means from the cognitive view point that a membership value has a value x. Let x be 0.7. In the likelihood paradigm, the membership function has a value of 0.7, if 70% of the population declares that the sample (a woman) belongs to the category (beautiful). Under this model there are the following sources of fuzziness: errors in measurement, incomplete information, and interpersonal contradictions. The advocates of the likelihood paradigm adhere to the philosophical view point that meaning is essentially objective and is a convention among the users of the language, while its measurement is essentially a vague process. Within the similarity paradigm, fuzziness arises from the insufficient cognitive abilities of the person, who is faced with the task of “comparing the object with a certain prototype or imaginable ideal”. It arises naturally from prototype theory (by Lakoff and colleagues), where membership is a notion of being similar to a representative of the category. The membership function measures the degree of similarity of an element to the set in question. It is assumed that there exists a perfect example of the set (or the category) that belongs to the set to the full degree. Others belong to the set to a degree measured by their relative distance to the perfect sample. 2.2
Queries on Geodemographic Data
The postcode of the client’s home address determines his/her geodemographic category. For marketing campaigns, geodemographic segmentations, such as ACORN or MOSAIC, are used, since it is known how the segments can be targeted to achieve the desired goal as, for example, the promotion of a new mobile service in certain neigh-
4
borhoods. It is known that people of similar social status and lifestyle tend to live close [11-12]. In our previous research [5], [13] we proposed two types of queries on geodemographic data: 1) the desirability of different geo-demographic segments, and 2) operator’s current success compared to the best theoretically possible. In both cases the queries return the value of the membership function that is further interpreted. Both types of queries rely on the outcome of the resource allocation module, which operated in the following manner. The problem is that of finding an optimal combination of user segments, given that we want to maximize the overall number of users, who consume finite resources. This problem belongs to a classical family of resource allocation problems for which solutions are found with linear programming (LP) [15-17]. The LP is defined by the decision variables, the objective function and the restrictions. The individual mobility patterns of different user segments sum up into the collective footprint, which the whole customer base produces on the infrastructure in a time-continuous manner. The desired property of such a collective footprint is that it does not exhibit skinny peaks and gaps in time. The closer to the optimal “even load” scenario, the better the infrastructure is exploited. The variables used further in the text are the following1. • The vector x with the decision variables: x={x , x , x , x , x , x }. CC
CA
MJM
QA
T
VA
The decision variables represent the scaling coefficients for each geodemographic segment. In case of segmentation at Telenor (a Scandinavian operator for whom the study was carried out) they are: cost-aware (CA), modern John/Mary (MJM), quality aware (QA), traditional (T), value aware (VA), and corporate clients (CC). A scaling coefficient xi is greater than 1, if the number of clients of a given geo-demographic segment is desired to be increased. For example, for the category in the customer base that is to be doubled xi = 2. Similarly, if xi < 1 for a geo-demographic segment, it means that the number of clients is to be reduced. The xi = 0 value indicates that the segment is absolutely unwanted in the clientele. By formulation x is nonnegative. • The objective function seeks to maximize the number of subscribers: 1
• I: the set with defined user segments {segment1, …, segmentk}; • D: the mobility data for a region that for each user contain client’s geodemographic segment, time stamps, when the client generated traffic, and which antenna served the client. • Si: the footprint by segment i, i.e. the number of subscribers that belong to a geodemographic segment i; • Si,t,j: the number of subscribers that belong to a geodemographic segment i, at time moment t, who are registered with a particular cell j; • Cj: the capacity of cell j in terms of how many persons it can safely handle simultaneously; • x: the vector with the scaling coefficients for the geodemographic segments or other groups such as IS clients; • Nt,j: number of users at cell j at time t.
5
Maximize Σi
∈
{CC, CA, MJM, QA, T, VA} Si
xi
(1)
• The restrictions: for all j, t, Σi∈{ CC, CA, MJM, QA, T, VA} Si,t,j xi ≤ Cj, x ≥ 0.
(2)
The objective function represents the observed number of persons in each user group at a particular time and served by a particular cell multiplied by the scaling coefficient. This value is required not to exceed the capacity of the cell Cj in terms of how many persons it can handle at a time. In other words the restriction says: if the historical number of users are scaled with a coefficient for their geodemographic category, the cells should not be overloaded. A consensus reached in the literature [1820] is that the mobility pattern for the subscribers is predictable due to strong spatiotemporal regularity of human activities. Consequently, the increase in the number of subscribers in a given segment with a factor x will result in an increase of the load generated by the segment with a factor x for each time and cell. The LP model is solved (with the gurobi sotware) for the input data D and the set of segments I: (xI, max_objI,D) = combinatorial_optimization(D,I). (3) The output is the vector with the optimal scaling coefficients xI and the maximum value of the objective function. For further details of the LP formulation the reader is referred to [9]. Consider a small example with two cells, two subscriber segments and three time slots. The footprint values are shown in Table I. The total number of subscribers in segment 1 is 60, and the total number of subscribers in segment 2 is 40 (s = (60, 40)T). The capacity of both radio cells is 200, i.e., c = (200, 200)T. The optimization problem becomes: Maximize
60x1 + 40 x2.
Subject to:
for t1, cell 1: 40x1 ≤ 200, for t1, cell 2: 20 x1 + 20x2 ≤ 200, for t2, cell 1: 40 x1 ≤ 200, for t2, cell 2: 40x2 ≤ 200, for t3, cell 1: 25x1 + 25x2 ≤ 200, for t3, cell 2: 10x1 + 20x2 ≤ 200, x ≥ 0.
Solving this LP problem yields the optimal x = (5, 3)T, corresponding to 420. Table 1. The number of subscribers in each segment for all time slots and cells for the small example. Time slot
t1 t2
Cell 1
Cell 2
Segment 1 40
Segment 2 0
Segment 1 20
Segment 2 20
40
0
0
40
6
t3
3
25
25
10
20
Geospatial and Geo-demographic Data
The study has been conducted on anonymized geospatial and geo-demographic data provided by a Scandinavian telecommunication operator. The data consist of CDRs (Call Detail Records) containing historical location data and calls made during one week in a midsized region in Sweden with more than 1000 radio cells. Several cells can be located on the same antenna. The cell density varies in different areas and is higher in city centers, compared to rural areas. The locations of 27010 clients are registered together with the identification of the cells that served them. The client’s location is registered every 5 minutes. In the periods when a client does not generate any traffic, she does not make any impact on the infrastructure and such periods of inactivity are not relevant in the light of the resource allocation analysis. These periods are not used in the present study. Every client in the database is labeled with her geodemographic segment. The fields of the database used in this study are: • the cells IDs with the information about the users they served at different time instants, • the location coordinates of the cells, • the time stamp of every event that generated traffic, and the ID of the user that originated the event, and • the MOSAIC geo-demographic segment for each client. There are 14 MOSAIC segments present in the database; for their detailed description, the reader is referred to [21].
4
Color Coding
Based on a circle showing the colors of the spectrum originally fashioned by Sir Isaac Newton in 1666, the color wheel (Fig. 1) he created serves many purposes today. Painters use it to identify colors to mix and designers use it to choose colors that go well together. The classic color wheel shows hues arranged in a circle, connected by lines or shapes. This is not the spectrum, but a rather its naïve represenation. Its advantage over the “true” spectrum is that yellow has a wide zone, angle-wise proportional to red and green.
7
Fig. 1. Moses Harris, in his book The Natural System of Colours (1776).
We propose to use color hues to visualize the fuzzy concepts. On the color wheel we define the angle α (Fig. 2). Please note that the angle α is defined from π not from 0. Green corresponds to α = 0, yellow to α = π/2, and red to α = π. Such uses of color are traditional in everyday symbolism (and therefore, System I recognizes them immediately). For example, an everyday use is that of the traffic lights: green symbolises a good decision/outcome, yellow is indecisive, and red is prohibitive. In chemoinformatics it is not uncommon to color different parts of molecules highlighting toxic parts in red, and in general highlighting the chemical activity spectrum from green (for a desirable property of materials) to red (for undesirable properties). In Linux terminal “OK” for a process is highlighted in green, and error messages are in red. The values of the membership function f(x) with 1 ≥ f(x) ≥ 0 are mapped into colors on the color wheel following the mapping rule: α = arccos(2f(x)-1) + π/2,
(4)
where the angle α uniquely defines the color hue and let the corresponding method be denoted as Return_color(α). For example, f(x) = 1 corresponds to green, f(x) = 0.5 corresponds to yellow, and f(x) = 0 corresponds to red. The colors corresponding to α > π (the cold shades) are
8
not used in the present study, but their uses are indispensable to avoid certain cognitive pitfalls and outlined in the Future Work.
α
Fig. 2. A color wheel and the angle α from the horizontal plane.
4.1
Algorithm to Calculate the Color Hue for a Neighborhood: Where to Go?
Firstly, the linear optimization is solved for the postcodes in place of geodemographic segments. It can be seen as if every postcode becomes a separate geodemographic segment. The coefficients are returned in the normalized form in the range of [0,1], to match the range of the membership function. Secondly, the set of desirability coefficients is mapped into the set of color hues, according to Equation 4. This procedure is formalized with Algorithm 1. Algorithm 1: calculating color hues for postcodes. Input: data set D: . {xcode_1, xcode_2,…, xcode_last} = combinatorial_optimization(I,D); for each i in 1 … last { αcode_i = arccos(2f(x)-1) ccode_i = Return_color(αcode_i) } Output: setColorsPostcode: the vector with a color hue for each neighbourhood c = {ccode_1, ccode_2,…, ccode_i}.
4.2
Algorithm to Calculate the Color hue for a Neighborhood: How Efficiently the Infrastructure is Exploited in the Region?
Strictly analogously to [13], we formulate the problem about how to measure and interpret the success of the infrastructure exploitation in a given geographic zone. Query 1: How successfully the infrastructure is currently exploited? fefficently exploited = current_obj (max_obj)-1 ,
9
where current_obj is the maximum number of persons that the infrastructure can serve, given that the present proportion of the segments is kept (x = 1), and max_obj is the theoretically largest possible number of clients that can be served given the ideal proportions of the segments from Equation 3. The segments understood as geodemographic segments. The region is defined as a concatenation of the zones corresponding to postcodes. Algorithm 2 formalizes this procedure. Algorithm 2: calculating color hues for the success of infrastructure exploitation for a predefined zone. Input: data set D: . [Dcodes = array containing postcodes of the zones of interest.] max_obj = combinatorial_optimization(I, Dcodes); [to get current_obj, every xi = 1 from Eq. 1 and the maximization operator omitted] current_obj = Σi∈{ CC, CA, MJM, QA, T, VA} Si; fefficently exploited = current_obj (max_obj)-1; αcode_n = arccos(2f(x)-1); cregion_exploitation = Return_color(αcode_n); Output: a color hue identifier for a predefined zone cregion_exploitation.
5
Results
With the method formalized with Algorithm 2, the color hues for postcodes were obtained reflecting the success of the infrastructure exploitation that transmit an intuitively clear message. In Fig. 3, the color of the round tag on the geographic zone symbolizes the operator’s marketing success: the closer to the pure green, the better. On a similarly looking map, in the case of Algorithm 1, the closer to red, the less customer base expansion is wanted in the zone (unless the physical infrastructure is upgraded and permits to safely serve more clients). The actual coefficients have been masked via a rotation to a random angle, which does not hamper the illustrative purposes and only hides the actual efficiency of the operator in the region, which is irrelevant for the method proposed.
10
3 1
Fig. 2. Different zones of the Borås and Gothenburg region (the resulting color is masked).
6
Conclusions and Future Work
Instead of tables with fractions in the range of [0,1] representing the scoring of “success regarding infrastructure exploitation” or “desirability of boosting presence in different neighborhoods”, the analyst is presented with a colored map. Transmitting information about geographic zones is much handier with colored maps compared to tables or verbal descriptions. The use of such a representation can vary. It creates a visual summary of operator’s global position regarding the relationship of infrastructure and customer base. It also provides analyst’s System I with the information for cognitive fusion, -- a good decision is expected to hold true both under the view point of System I (quick and intuitive) and System II (slow, numeric, that ideally serves to double-check quick conclusions). It also teaches the computer to communicate with an analyst appealing to his/her System I. In the future work, we plan to distinguish positive characteristics (e.g. efficient) and negative ones (e.g. risky) relying on a semantic analysis. Currently, if the score for “risky” is high, it can be matched to the green hue, which is misleading. We will divide the spectrum into "paradise" (0 to π in Fig. 2) and the "underworld" (π to 2 π). Thus, the color symbolism will rely on idiomatic color usage for positive (red to green as on the traffic lights) and negative semantics (blue for “under control” to purple for an alarming state).
11
As an implementation note, some parts are done semi-manually. Currently the angle is mapped into color via binning the angle (in the style of Fig. 1).
Acknowledgements The experiments were run on the servers of the Future SOC Lab, Hasso Plattner Institute in Potsdam. This work is part of the research project "Scalable resource efficient systems for big data analytics" funded by the Knowledge Foundation (grant: 20140032) in Sweden.
References 1. Zadeh, L., “Fuzzy Logic and Beyond - A New Look”, Zadeh, L. “Fuzzy Logic and Beyond – A New Look”, in (Eds.) Zadeh, L., King-Sun F., Konichi T, Fuzzy Sets and their applications to cognitive and decision processes: Proceedings of the US-Japan seminar on fuzzy sets and their applications, held at university of California, Berkeley, California, July 1-4, Academic Press, 2014. 2. Zadeh, L., Fuzzy sets. Information and control, 8(3), pp.338-353, 1965. 3. Meyer, A and Zimmermann, H. J., Applications of fuzzy technology in business intelligence. International Journal of Computers Communications & Control, 6(3), pp. 428-441, 2011. 4. Bilgiç, T. and Türkşen, I. B., Measurement of membership functions: theoretical and empirical work. In Fundamentals of fuzzy sets, D. Dubois, H. Prade (ed.), Boston, MA, USA, Springer, ch. 3, pp. 195-227, 2000. 5. Podapati, S., Lundberg, L., Sköld, L., Rosander, O. and Sidorova, J., Fuzzy Recommendations in Marketing Campaigns, the 1st International Workshop on Data Science: Methodologies and Use-Cases (DaS 2017) at 21st European Conference on Advances in Databases and Information Systems (ADBIS 2017). Nicosia, Cyprus, September 24, 2017, LNCS, 2830 August, Larnaca, Cyprus. 6. Kahneman, D. and Egan, P., Thinking, fast and slow (Vol. 1). New York: Farrar, Straus and Giroux, 2011. 7. Novak, V., Perfilieva, L. and Drovak, A, Insight into Fuzzy Modelling. Wiley, Hoboken, 2016. 8. Ragin, C. C., Qualitative comparative analysis using fuzzy sets (fsQCA). Rihoux, B, 2009.. 9. Sidorova, J., Rosander, O., Sköld, L. and Lundberg, L., Data-driven solution to intelligent network updates for a telecom operator, Optim Eng. 2018. https://rdcu.be/PkFM Accessed on Aug. 06, 2018. 10. Sidorova, J., Rosander, O., Sköld, L., Grahn, H. and Lundberg, L., Finding a healthy equilibrium of geodemographic segments for a telecom business, Machine Learning Paradigms – Advances in Data Analytics, Springer, Intelligent Systems Reference Library book series, Eds. G. Tsihrintzis et al. In press. 11. Haenlein, M. and Kaplan, A. M., Unprofitable customers and their management. Business Horizons, 52(1), pp. 89-97, 2009. 12. Debenham, J., Clarke, G. and Stillwell, J., Extending geodemographic classification: a new regional prototype. Environment and Planning A, 35(6), pp.1025-1050, 2003. 13. Sidorova, J., Sköld, L., Rosander, O. and Lundberg, L., Recommendations for Marketing Campaigns in Telecommunication Business based on the footprint analysis, The 8th IEEE
12
14.
15.
16. 17. 18. 19. 20. 21.
International Conference on Information, Intelligence, Systems and Applications, IISA 2017, 27-31 August, Cyprus. Sagar, S., Lundberg, L., Sköld, L. and Sidorova, J., Trajectory Segmentation for a Recommendation Module of a Customer Relationship Management System, International Symposium on Advances in Smart Big Data Processing (SBDP-2017) in conjunction with the 3rd IEEE International Conference on Smart Data (SmartData-2017), Jun 21-23, Exeter, UK, 2017. Kantorovich, L. V., "Об одном эффективном методе решения некоторых классов экстремальных проблем" [A new method of solving some classes of extremal problems]. Doklady Akad Sci SSSR. 28: 211–214, 1940. Dantzig, G. and Thapa, M., Linear programming 1: Introduction. Springer-Verlag, 1997. Linear programming, Wikipedia, accessed 05/07/2018. Naboulsi, D., Fiore, M., Ribot, S. and Stanica, R., Large-scale mobile traffic analysis: a survey. IEEE Communications Surveys & Tutorials, 18(1), pp.124-161, 2016. Song, C., Qu, Z., Blumm, N. and Barabási, A. L., Limits of predictability in human mobility. Science, 327(5968), pp.1018-1021, 2010. Lu, X., Wetter, E., Bharti, N., Tatem, A. J. and Bengtsson, L., 2013. Approaching the limit of predictability in human mobility. Scientific reports, 3. InsightOne MOSAIC lifestyle classification for Sweden http://insightone.se/en/mosaiclifestyle/, Retrived on Jul, 03, 2018.