Complex adaptive information flow and search transfer ... - Springer Link

Knowledge Management Research & Practice (2014) 12, 29–35 & 2014 Operational Research Society Ltd. All rights reserved 1477–8238/14 www.palgrave-journals.com/kmrp/

Complex adaptive information flow and search transfer analysis Szabolcs Feczak1 Liaquat Hossain1,2 and Sven Carlsson2 1 Complex Systems and Project Management Programme, Faculty of Engineering and IT, The University of Sydney, Australia; 2Department of Informatics, Lund University, Sweden

Correspondence: Liaquat Hossain, Complex Systems and Project Management Programme, Faculty of Engineering and IT, The University of Sydney, Darlington Campus, Sydney, NSW 2006 Australia. Tel: þ 61 2 90369110; Fax: þ 61 2 9351 8642; E-mail: [email protected]

Abstract Studying information flow between node clusters can be conceptualised as an important challenge for the knowledge management research and practice community. We are confronted with issues related to establishing links between nodes and/or clusters during the process of information flow and search transfer in large distributed networks. In the case of missing socio-technical links, social networks can be instrumental in supporting the communities of practice interested in sharing and transferring knowledge across informal links. A comprehensive review of methodology for detecting missing links is provided. The proportion of common neighbours was selected as best practice to elicit missing links from a large health insurance data set. Weights were based on geographical arrangements of providers and the dollar value of transactions. The core network was elicited based on statistical thresholds. Suspicious, possibly fraudulent, behaviour is highlighted based on social network measures of the core. Our findings are supported by a health insurance industry expert panel. Knowledge Management Research & Practice (2014) 12, 29–35. doi:10.1057/kmrp.2012.47; published online 14 January 2013 Keywords: socio-technical systems; organisational learning; networks; case study

Introduction

Received: 27 January 2011 Revised: 19 December 2011 Accepted: 21 September 2012

Organisations engaging in measures against fraudulent activities are pressured more than others to meet their goals that involve maintaining a positive company image and improving competitiveness and market share in comparison to their direct and indirect competitors in the marketplace. Knowledge management is increasingly seen as a driver in bringing new insights, which can be used to improve processes of managing information within the organisations (Lloria, 2008). Organisations are vulnerable to counter-error or fraud by taking protective steps to prevent significant loss. Increased security measures and tighter regulations on financial transactions can help to restrain malicious activities. However, usually these actions result in additional delays in payments and ultimately customer inconvenience. In most sectors, it is very important for companies to maintain a positive image, while they carry out these counter-measures. The aim of error or fraud prevention is to maximise crime or error reduction without much overhead on the core business activity (Cahill et al, 2002). In Figure 1, a conceptual overview is presented highlighting different boundaries of anti-fraud operations in insurance claims processing. To save on resources, an automated screening facility is fundamental to improving efficiency of the process. It is possible to automate the process because decisions on the acceptance of a claim are governed by insurance

30

Information flow and search transfer analysis

Figure 1 Boundaries of anti-fraud operation in an insurance company.

and medical rules. The bulk of these rules can be described with algorithms. Since a human workforce is more expensive than computational processing power and system development in the long run, reducing the involvement of people in the decision-making process improves efficiency in processing health insurance claims. Principally, these systems do not require interaction with the customer unless the claim is suspicious and put on hold for investigation. This helps to maintain a better image of the company (Major & Riedinger, 2002). The requirement for human intervention is dependent for most cases on the complexity of the claims as well as the rules associated with the claims. Anti-fraud operations can consume a great deal of human resources, because in most cases it is difficult to draw the distinction between fraud and error. Claims that have been deliberately falsified or altered in order to gain financial benefit are unlawful and labelled as fraud, while genuine mistakes are considered to be errors (Sparrow, 1996). Therefore, additional clues are required to make a reliable decision as to whether a transaction is fraudulent. This can be a lengthy process involving a legal team. To improve time-wise efficient payment of claims, we perform complex analytical processes on historical data. This, however, implies that recovering money from a fraudulent claim identified retrospectively is uncertain, since the claim has already been paid before the investigation was finished. Therefore, it is desirable to address problems upfront or minimise the possibility of problems occurring (Sparrow, 1996). The aim of this paper is to provide a methodological framework based on social network methods and related theories for examining possible fraudulent behaviour in health insurance claims processing by conducting information flow and search transfer analysis. We describe the relationship between network structure and criminal activity for flagging potential boundary cross-over. The model presented in this paper is built on the basis of classical theories originating from error detection and social networks for presenting a framework to increase the reliability and effectiveness of the anti-fraud operations.

Information Flow and Search Transfer Approaches Traditional methods of error detection suggest the use of a threshold decision based on profiling and tracking.

Knowledge Management Research & Practice

Szabolcs Feczak et al

Profiling has become best practice over threshold. There has also been a shift from detection towards prediction and prevention (Cahill et al, 2002). Adaptive capacity in combination with advanced technologies such as neural networks, fuzzy logic and genetic algorithms is required to reliably predict infringements (Phua et al, 2010). Adaptive systems are usually based on one of the two types of machine learning: (i) unsupervised and (ii) supervised learning. Unsupervised methods do not perform optimally under the condition of uneven class sizes and uncertain class membership. Unsupervised methods have a similar issue to traditional threshold methods, that is, they allow for a substantial number of false positives and false negatives. Supervised methods are seen to be more precise. However, they can only be used if prior sets with classification are available and reliably identified by other methods manually. Supervised learning algorithms can help to satisfy the requirement for adaptive capacity. Common algorithms used are: Bayes, FOIL, Ripper, CART and C4 (Bolton & Hand, 2002). The observations in detecting irregularities specifically in the health insurance sector focus on: behavioural heuristics, the flow of dollars, medical logic, frequency and volume of treatments, geographic origin, time and sequence of activities, granularity, and identification process (Fawcett & Provost, 1997; Major & Riedinger, 2002). A case is marked fraudulent once the accused is convicted by court order. However, in a broad sense, even in scientific literature, fraud is analogous to potential fraud, and unusual and suspicious behaviour. Differential association theory (Chambliss, 1984) links fraud detection to the study of social networks. The theory claims that when the balance between explanations for law violation and law-abiding breaks, a person turns away from lawful life if this is supported by social affiliation. Therefore, criminal behaviour is observed to be learned through socialising with a closed group of criminals. Not only the behaviour but motives and techniques and legal loopholes are also learned through the aforementioned association. Relation to a criminal society can vary in priority, duration and intensity. Furthermore, learning by association has been seen as an emerging theory for explaining the link between socialisation and learned behaviour. The statement ‘y fraudsters seldom work in isolation y’ (Bolton & Hand, 2002) suggests that social network analysis could support prevention and detection methods. There is clearly a need for support of the existing techniques since machine learning algorithms do not perform reliably on their own. A combination of different algorithms was proposed to improve the reliability, even though a combination of Probabilistic, Best match, Density selection and Negative selection algorithms only reaches 50% reliability (see the last column of Figure 2). Therefore, social networks and their analysis are introduced to develop a better understanding of detection of error and/or fraud.


31


argued that new and innovative information is usually received through weak ties because strong ties share common information (Granovetter, 1973, 1983). In this study, we seek an answer to the question: How does link analysis of customer/provider assist in providing clues related to fraud? We explore collective fraudulent activities, focusing on collusion between providers because it has a higher risk than collusion between health service providers and insurance fund members. Furthermore, discovering provider-to-provider agreements entails identifying the members associated with the questionable transactions.

Data Collection and Analytics Figure 2 Reliability of different algorithms and combination of them. Source: Wheeler & Aitken (2000).

Social Behavioural Modelling through Social Networks Social network measures deal with quantifying network structure properties and provide models to understand information flow, the evolution of the network, the role of nodes and links with certain properties, and the effect of changes in the network. The networks are built to represent interactions of nodes – usually human beings but they can also be information systems – which makes the research area overlap with human–computer interaction. The principal measures are size, centrality, density and link weights. In social networks, centrality denotes the structural power position of a node in a given network. Centrality has three basic measures: Freeman degree centrality – number of adjacent nodes; closeness – reciprocal value of the total number of hops by the shortest possible path to every other node; and betweenness – number of times the node appears on the shortest path between other nodes. The higher the value the more influence a particular node can have on the entire network. Competitive advantage, for example, can be explained by control on information flow. Entities (nodes) bridging network segments that otherwise would not be connected are in an advantageous position in which they can benefit from using the information flowing through them or controlling dissemination for their own good (Burt, 1995). Centrality does not only apply to nodes; this characteristic can be calculated for the total network as well using any of the above measures. Network density is the number of links divided by the number of all theoretically possible links (Freeman, 1979; Hanneman et al, 2001). Tie strength is another measure, which is the weight of the links between the nodes, and it can be used to explain information flow. For example, too densely connected actors provide mostly redundant, already known information. This noise causes a delay, which has a negative effect on cooperative work. It is also

Description and scope We were given access to claims-processing data of a major Australian health insurance fund. Four years of data contained over 65,000 providers, 2 million customers and 28 million claim entries. Clustering is crucial to derive reliable results and to be able to process the data in a reasonable amount of time. On the basis of the recommendations of the industry expert in health insurance fraud, a specific geographical area was targeted first, denoted by a single postcode. In order to avoid cutting off links of interest by the strict borders of a post code, the scope of the data set was extended using the following steps: selected all providers from the given post code, found all customers related through claims to these providers, extracted all claims of these customers, and based on the resulting claim set, collected all providers involved.

Filtering After clustering, the result set was still extremely large: over a hundred thousand claims and a couple of thousand providers. For the particular fund in this study internal provider codes are used. Internal providers are considered to be trusted and far more claims are running through them than individual providers; thus retaining them would have biased our analysis. Therefore, internal providers were eliminated from our data set. Our aim was to extract a subset of data containing the top 100 health service providers of interest. The interest of the insurance fund is the amount of benefit paid to these providers. Hence, most of our data filtering is based on dollar values. The first steps in filtering included the following: calculating accumulated benefit over the years per provider and keeping only the top 100, from the network analysis perspective at least two links are required to different providers from a customer to be able to predict a connection between providers. Furthermore, the links need to be significant,


32



removing customers who did not have claims through at least two providers, and removing transactions that are below average grouped by customer and provider pairs.

weight, and only established links are recorded. Therefore, to overcome the above issue we wrote a simple Python script, which converts an adjacency matrix into NCOL format.

Data preparation for analysis The Organizational Risk Analyzer (ORA) (Carley & Reminga, 2004) application from Carnegie Mellon University was used for analysis because it implements the folding of networks based on the shared links function. The first step was to format the relational data extract acquired from the insurance fund according to the needs of the social network analysis application. To do so, the following steps were taken in the first pass:

#!/usr/bin/env python import csv import sys import re m ¼ [] r ¼ csv.reader(open(sys.argv[1], ‘rb’)) for row in r: m.append( [i for i in row] ) of ¼ open(re.sub(“.csv”, “-ncol”, sys.argv[1]), ‘w’) for i in range(2,len(m)): for j in range(1,i): if (int(m[i][j])40): of.write (“%s;%s;%s\n”%(m[i][0],m[0][j],int(m[i][j]))) of.close()

selecting the total dollar value of benefits grouped by customer, provider pairs and year and month, exporting the result into a comma-separated value file, and inserting two new columns manually: source node class and destination node class. The dollar value summaries served as link weights and the two new columns were filled with equal row values: customer and provider. The combined year and month field was given in the database and was used as a network identifier, and thus once imported into ORA it further clustered the data based on time. Time information can also be used to analyse network evolution and dynamics.

GPS location of providers Analysis of the link weights was performed based on the flow of dollars as well as on the distances between the providers. Identifying the GPS coordinates of the providers based on their addresses was a prerequisite to calculating distances with relative precision. The database of Google maps was used to assign each provider with the appropriate longitude and latitude pair. On the basis of the list of coordinates of all providers, ORA automatically created finer regions than postcodes. Finer granularity enabled us to group the providers considered to be local to each other more precisely. Shortest geographical distance of providers Distances between linked providers were calculated and assigned as link weights. ORA was helpful in creating network data from relational data; however, the output of this operation was an adjacency matrix. For our purposes, the direction of the link is not important; therefore, the matrix is symmetric, and half of the output is redundant. Furthermore, to add distances it was only required to have data on established links. However, the majority of the matrix was filled with zeros. Therefore, the adjacency matrix is difficult to process with standard applications such as a spreadsheet. The NCOL format (Adai et al, 2004) for social network data is much more condensed and suitable for further processing the data in relational database management systems as well. It has only three columns, from node, to node and link


After the conversion, the results were loaded into a new relational database table and another table with provider identifiers and longitude and latitude values. The two tables were matched based on the provider identifiers, which resulted in coordinate-to-coordinate pairs. The great-circle distance calculation algorithm was implemented in a custom spreadsheet macro to return the distance between linked providers.

Percentage of shared customer base In order to achieve a higher precision in the analysis, the proportion of customers shared between the linked providers was taken into consideration. To assign the value of this attribute for each linked provider pair a relational approach was used, similar to the one for calculating the distances. Our first step was to convert the adjacency network data into binary format, and then fold the network based on shared customers. This resulted in a matrix where self-loops are the total number of customers and each link weight is the number of shared customers. These two numbers were divided to get the proportion, which was calculated for both nodes in a pair, and the higher value was used (Bolton & Hand, 2002). Limitations Looking at the final core network figure, it must be noted that it is filtered by certain statistical thresholds. In order to get a comprehensive picture, evaluation of related providers from the entire set of data is desirable. The egocentric approach can be used to keep the analysed data on a manageable level in this case of extension. Distances calculated might not be accurate due to unclean data on the addresses of providers; however, they proved to be precise enough to pinpoint extremes. The data used in the case study were collected from a single fund; however, in reality health service providers claim funds from a range of funds. This type of cross-fund data analysis, however, is prohibited by Australian public policy about privacy.



33

In addition, it is not possible to verify the link prediction results other than by experts. Public policy about privacy is one obstacle, but it is very likely that service providers would not answer with all honesty about their connection with other providers if they were fraudulent. In other cases, validation is impossible because certain providers are now out of business.

Health Insurance Claims Processing Case Study There are unique contextual features that influence certain aspects of the case study. The room for inconsistency between interpretations is fairly large, because there are no common standards for codes and rules; therefore the compliance gap can be quite substantial in health insurance claims. To put error detection into the context of our case study, claim processing will be detailed. In case of misinterpretation of a rule it should be labelled as an unintentional act or ‘error’ and deliberate misinterpretation or deception leading to any kind of unfair advantage as ‘fraud’ by law (Sparrow, 1996). The challenge here is to differentiate reliably between the two cases, since ‘it is not possible to determine the presence of fraud from the data alone’ (Bay et al, 2006). The use of certain thresholds is common to filter distinct member and provider profiles against standards and averages of similar profile clusters. Profiles filtered with values above the thresholds can be used as a basis of feedback to the particular provider or member. If continuous abuse of the rules is recognised, either from a fund member or a service provider, some sort of higherlevel disciplinary action should be taken. The case study presented in this paper is based on data acquired from a large national health insurance company in Australia. Health insurance funds are in a position to collect sensitive data; however, the use of these data is regulated by governmental privacy laws. Depending on the country, privacy laws can be substantially different in terms of strictness level. In Australia, these laws are extremely protective, and even in cases where in other countries it would be common to supply and use information this is prohibited. Public policy is therefore an obstacle to collecting data for such research across several insurance funds because linking entities based on some unique identifier is prohibited. In Australia, for example, it would be technically possible because electronic transactions run through a centralised Health Industry Claims and Payments Service similar to credit card payments. Therefore, inter-fund identifiers are already present both for health service providers (terminal identifier) and patients (card identifier). Researching data from a single fund is only the tip of the iceberg. If a provider serves customers from a range of different funds and the transactions are approximately evenly distributed, it is possible to stay below statistical thresholds through the detection system of all funds. If access were granted to see the big picture, it would become obvious in an instant that the sum of the transactions and patterns were out of the ordinary. There are three approaches to identifying possible connections between providers: (i) probabilistic models

Figure 3 Demonstration of topological similarity link prediction based on common neighbours.

Figure 4 Predicted core network of providers extracted from real data with customers around them.

supported by machine learning algorithms such as Relational Bayesian, Markov and Dependency Networks, (ii) node-wise and (iii) topological similarity. Probabilistic algorithms are powerful, but very complex and computationally expensive (Hasan & Zaki, 2011). Configuring node-wise prediction, which is based on information distance and similarity, disregards link information; it also requires supervised methods, and the available training data were not sufficient to do this. The topological similarity method works on node-based patterns of common neighbours. In Figure 3, a demonstration network with two different node classes is shown. Providers are dots and members are rectangles The lefthand side is the normal network information representing the connections based on the data available. The righthand side is the pure predicted network of providers. Topological similarity is the most straightforward model compared to others such as Jaccard’s coefficient, Adamic/ Adar and Preferential attachment. Due to its relative simplicity and readily available implementation, the


34

Figure 5



Predicted core network of providers in a filtered data set.

topological similarity method was selected to be applied to the data. Once connections are identified, significance can give an indication of the extent to which they are suspicious. The link between providers and members is established based on claims. The insurance company reimburses the fund member with a fraction of the service fee defined as benefit, based on the insurance policy in effect. From the insurance fund point of view the total amount is irrelevant; fraud protection only takes the benefit amount into consideration. Therefore, the sum of benefits paid to a fund member related to services performed by the health service provider establishes the link and weight expressed in dollar value. The weights themselves do not carry significance regarding irregularity; however, they convey the possible impact factor. To differentiate between coincidental and suspicious links we propose to use a combined measure of (i) geographical distance (Phua et al, 2010) and (ii) shared customer base. (i) Members visiting different providers at large geographical distances from each other is an unusual practice, since most people visit practitioners at convenient distances, and thus all of those linked generally would be in the same area. Consequently, providers in physical proximity to each other and linked through shared customers are considered to be normal, as long as they provide different types of services. Another normal scenario is when a patient is referred to a specialist, which can be far away; however, if (ii) the proportion of common neighbours against the entire client base is high, this makes it look irregular. Combining (i) and (ii) with


simple addition is expected to result in a good predictor for irregularity. A threshold of a minimum of ten shared customers was induced based on a histogram of this value after performing link detection based on common neighbours in our sample. Imposing this limit results in the core network of 13 providers as can be seen in Figure 4. Further analysing this limited graph, customer nodes were hidden to focus on provider connection with the network. Visualisation parameters were set to identify nodes based on type of speciality. The result is presented in Figure 5. As discussed earlier, to indicate significance, a summary of distance and shared customer base proportion is used as a link weight. The more weight a link carries the more suspicious the association is. Further to this, if a node is more central than others this means that it is more connected to other suspicious nodes. Centrality is visualised with the size of the node in the figure. Whenever nodes represented with the same font are linked to each other, this means that the same types of health service provider are sharing customers. All this information provides clues to spot irregularities.

Conclusions We have presented a comprehensive method to analyse health insurance claims data combining social network methods with traditional profiling. A focus group interview with a group of industry experts confirmed that the highlighted situations are indeed suspicious. The industry expert panel further provided validation that existing methods of prediction of fraud in insurance funds were not able to pick these provided and classify them as suspicious. We are waiting for the experts to verify our


results by conducting further manual data analysis and, in case of reasonable doubt, on-site investigations. We see great possibilities to continue and extend this work. Our next step will be to analyse network dynamics over time and look for clues in the evolution of the network. Moreover, further studies could open possibilities for correlating


35

the results with other moderating variables such as feedback, and for combining our proposed predictor variable with the risk index already used at the fund. Furthermore, we have existing cases of investigated fraud that can help us to develop a set of typical social network measures for potentially fraudulent networks.

References ADAI AT, DATE SV, WIELAND S and MARCOTTE EM (2004) LGL: creating a map of protein function with an algorithm for visualizing very large biological networks. Journal of Molecular Biology 340(1), 179–190. BAY S, KUMARASWAMY K, ANDERLE MG, KUMAR R and STEIER DM (2006) Large scale detection of irregularities in accounting data. Data Mining, 2006, ICDM ’06. Sixth International Conference on Data Mining, 18–22 December, pp 75–86, Hong Kong, China. BOLTON RJ and HAND DJ (2002) Statistical fraud detection: a review. Statistical Science 17(3), 235–249. BURT R.S. (1995) Structural Holes: The Social Structure of Competition. Harvard University Press, Cambridge. CAHILL MH, LAMBERT D, PINHEIRO JC and SUN DX (2002) Detecting fraud in the real world. In The Handbook of Massive Data Sets (ABELLO J, PARDALOS, PM and RESENDE MGC, Eds) pp 911–929. Kluwer Academic Publishers, Dordrecht. CARLEY KM and REMINGA J (2004) ORA: Organization Risk Analyzer, Technical report, Carnegie-Mellon University Institute of Software Research International. http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=ADA460034. CHAMBLISS WJ (1984) White collar crime and criminology. Contemporary Sociology 13(2), 160–162. FAWCETT T and PROVOST F (1997) Adaptive fraud detection. Data Mining and Knowledge Discovery 1(3), 291–316.

FREEMAN LC (1979) Centrality in social networks conceptual clarification. Social Networks 1(3), 215–239. GRANOVETTER M (1973) The strength of weak ties. American Journal of Sociology 78(6), 1360–1380. GRANOVETTER M (1983) The strength of weak ties: a network theory revisited. Sociological Theory 1(1), 201–233. HANNEMAN RA, RIDDLE M and ROBERT A (2001) Social Network Analysis. University of California, Riverside. HASAN MA and ZAKI MJ (2011) A survey of link prediction in social networks. In Social Network Data Analytics (AGGARWAL C,Ed) pp 243–275. Springer, New York. LLORIA MB (2008) A review of the main approaches to knowledge management. Knowledge Management Research & Practice 6(1), 77–89. MAJOR JA and RIEDINGER DR (2002) EFD: A hybrid knowledge/statisticalbased system for the detection of fraud. Journal of Risk and Insurance 69(3), 309–324. PHUA C, LEE V, SMITH K and GAYLER R (2010) A comprehensive survey of data mining-based fraud detection research. Computing Research Repository (abs/1009.6119). SPARROW MK (1996) Health care fraud control: understanding the challenge. Journal of Insurance Medicine 28(2), 86–96. WHEELER R and AITKEN S (2000) Multiple algorithms for fraud detection. Knowledge-Based Systems 13(2), 93–99.

About the authors Szabolcs Feczak is a Ph.D. student at the Complex Systems Research Group of the University of Sydney.

Visiting Professor at the Department of Informatics of Lund University, Sweden.

Liaquat Hossain is Professor and Director of the Complex Systems Research Group and Project Management Program at the University of Sydney, Australia. He is also

Sven Carlsson is Professor and Head of the Department of Informatics at Lund University, Sweden.