of credit cards by early fraud detection. For many years, the credit card industly has stud- ied computing models for automated detcc- tion systems; recently ...
Disfribufed Dam Mining in Credit C u d Fmud Defwdon Philip K. Chan, Florida Institute of Technology
Wei Fan, Andreas 1. Prodromidir, and Salvotore 1. Stalfo, Columbia University
_EDIT CARD TRANSACTIONS CONtiiiue to grow in iiuinbcr, taking an ever-larger share or the US paymcnt system and leading lo a higher rate of stolen nccount numbers and subscquenl losses by himks. Improved fraud detection thus has become essential to maintain Iht: viability OS the US payment syslem. Banks liave used early fraud warning sysletns for some ycars. I.arge-scale data-mining techoiqucs can improve on the statc ol the art i n commercial practicc. Scalable tcchniqries tu analyEe miissive amount^ oS transaction data that e m ciently cnmpole fraud detectors i n a timely manner is an important problem, especially Sore-commerce. Besides scalahility and efficiency, thc Sraud-detection task cxhihits leclinical problems that includc skewed distributions OS training tlntii ;ind oonunil~irincos1 pel- crror, both of which have iiot been widely studied in the knowledgc-discovery and dalaniining community. In this article, wc survey and evaluate a iiiirnheroStechniques that address lhesc three main issues concurrenlly. Our proposed mcthods of combining multiple learncd fraud detectors under a "cost modcl" are general imd demonstrably useful; our cnipirical results demonstrate that we can significantly reduce loss due to fraud through distributed dah1 mining of fraud models.
THIS SCALABLE BLACK-BOX APPROACH FOR BUILDING EFFICIENT FRAUD DETECTORS CAN SIGNIFICANTLY REDUCE LOSS DUE TO ILLEGITIMATE BEHAVIOR. IN MANY CASES, THE
.
.il
I "
AUTHORS' METHODS OUTPERFORV A WELL-KNOWN, STATE- j . , OF-THE-ART COMMERCIAL FRA UD-DETECTION SYSTEM.
Our approach In today's increasingly elcctronic society and with the rapid advances orelectronic commerce on thc Inlernet, the use oicredit cards for purchases has become convenient and necessary. Credil card transacti(ins have heconic the dc facto standard lor Internet and Webbased e-commerce. Thc US government estiniates that credit cards accounted for approximately US $ I1 hillion in Intcrnet sales during 1998. This figure is expected IO grow rapidly each year. However, the growing number of credit card transac~ioiisprovides morc opportunity for thieves to steal credit card numhcrs and subscquently comniil Sraud. When banks lose money hccause of credit card Sraud, cardhddcrs pay fur $111ofthat loss through highcr interest ratcs, higher fees, ancl reduced benefits. Hence, it is in both the hanks' and thc
cardholders' interest to reduce illegitimate use of credit cards by early fraud detection. For many years, the credit card industly has studied computing models for automated detcction systems; recently, these models have been the subject of academic research, especially with respect to e-commerce. The credit card fraud-detection domain presents a number of challenging issues for data mining: There arc millions of credit card transacprocessed each day. Mining such massive amounls of data requires highly eft'icienl techniques that scale. The datu are highly skewed-many more transactions are legitimate than Sriludulent. Typical accuracy-based mining tecliniques can generate highly accurate fraud dctcctors by simply predicting that all tions
-
~
~
_
_
~
_
67
-
Output the final hypothesis:
transactinns arc legitimate, although this is equivalenttonotdeiecting fraudat all. Each transaction record lias a diSfercnt dollar amount a i d thus lids a veriable potential loss, rather than a Fixed misclas' cornsificiltian cost per error type , 'IS' ' IS monly assumed in cost-based mining techniques. Our approach addresses the efficiency and scalabilily issues in several ways. We divide
a largc tlala sct OC lahcled transactions (eithcr Craudulent or legitim;ite) into smallcr subsets, apply mining techniques lo generate classiCiers i n parallel,and cornhine Ihc resullant basc models by metalearning h n i the clessifiers' behavior to gencrate a metaclassiSier. I Our approach treats the chssilicrs as black boxes so thaL we can employ a variely of learning algorithms. Besides exlensibility, conibining mulliplc models computed over all available data produces inctaclassiSiers that
can ofSset the loss of predictive performance that risuelly occurs when mining Sroin dala subsets or sampling. Furthermore, when we USC the leemed classifiers (fur example, during transaction autliorizalion). the base classificrs call executc in parallcl, with the malaclassilier then combining theirresnlts. So, our approach is highly cfkient in generaling these models and also rcletively efficient in applying them Another parallel approacli focuses on par-
~~
1611 INTELLIGENI SYSIEMS-
,
1
allclizing aparticolar algorithm on a particular parallel architecture. However, a new algorithm or architecture requircs a substaiitial amount or pal.allcl-programining work. Although our architeclure- and algorithm independent approach is not as efficient as some finc-grained paral lelinition approaches. it lets users plug difSeren1off-the-shelflenriiing programs into a parallel and distributed envimnineiit with relativc case and eliminatcs the nccd Sor expensive parallel hardware.
Furthermore, hccause our appruach cmld generate a potenlially large nuniber orclassifiers from the conciirrcntly processed dala subsek, and lhereforc potentially ieyiiirc more cumputational resourccs during detection, we investigate pruning methods that identify rcdundant classificrs and remove them from thc cnsemble wilhout sigiiificaiitly degrading the predictivc pcrformance. This pruning lechnique increases the lcarned &lectors’ coniputational pciforniaiice and throughput.
The issue of skewed distributions has not been sludicd widely because many of the data sets used in research do not exhibit lhis charactciistic. We address skewness hy partitioninglhedattisct intosubsets withadeaireddistribution, applying mining techniques to ihe subscts, and combining the mined classifiers by mctalearning (as we have already discussed). Other researchers attempt to rcmove unncccssary instanccs from the majority class-instances that are i n the borderline region (noise or redundant exeinplarsj are candidales [or rcnioval. In contrasl, our approach kecps all the data for mining and does not change the underlying inining algorithms. We addrclmss thc issue of nonuniform cos1 by dcwloping the appropriate cost model f b r the crctlit card fraud domain and biasing oiir methods toward redocing cosl. This cost modcl dctermiiies thc desired distribution just mentioncd. AdaCost (a cost-sensitive version oCAdaBoost)relies on thc cost model for updating wciglits in the training distribution (FurmoreonA&iCost, sce the“AdaCost ;ilgorithm” sidebar.)Naturally, this cost model also defines lhc primary evaluation criterion for our techniques. Furthermwc, wc investigate techniqucs to improve the cost perlormance of a hank‘s fraud deteclor by importing remote classifiers from other hiinks and cornhiniiig this rcniotcly learned knowledge with locally stored classificrs. The law and coinpetitivc coiicerns restrict hanks Srom sharing inlirrmation about their custwners wilh othcr banks. However, they may sliare black-hox Sraud-detection modCIS. Our distributed data-mining approach pruvidcs a direct and efficient solution io sharing knowlcdge wilhout sharing data. We also address possiblc incoinpiitihility of data schemata among clifrerent banks. We designed and dcveloped an agent-based distributed environmcnt to demonstrate our distributed and parallel data-mining techniques. The1AM (Java Agents fur Mctalcarningj systcin not only providcs distributed dataniining capabilities, it elso lcts uscrs monitor and visimlizc the various learning agents and derived models i u real time. Rescarchco havc studied a variety of algoiithnis and techiiiques hrcomhining multiple coniputed models. The JAM system pruvidcs generic fealures to casily implemenl any of these combining techlarge collection of baselearning algorithms), and it has been broadly availahlc~oruse.TlieIAMsystcmis available h r download at http:llwww.cs.coluinbi;l.edd -sal/ JAM/PRO.IECT.
Table I. Cost model assuming a fixed overhend. Oll,EOME
Credit card data and cost models
Miss (false negative-FN) False alarm (false oositive-FPI Hit (true posiiive-TP) Normal (true negative-TN)
COST
'
Tranarnt Overheadif tranamf > overheador 0 if tranarnts overhead Overheadif lranamt, overheador fmnainf if tranarnts overhead 0
Chase Bank and First Union Bank, m e n -
hers of the Financial Services Technology Consortium (FSTC), provided tis with real credit card data for this study. Thc two data sets contain credit card transactions labcled as fraudulent or legitimate. Each bank supplied S00,000records spanning onc yew with 20% fraud and 80% nonfiaud distribulion for Chase Bank and I S % versus 85% for First Union Bank. In practice, fraudulent transactions arc much lcss frequent than the 15% to 20% observed in the data given to us. These data might have been cases where the banks have difficulty in determining legitimacy correctly. In some OS our experimenls, we deliberately creale more skewcd distributions tu evaluate the effectivencss of our techniques under more cxtreme conditions. Bank personnel devcloped the schemata ufthe databases ovcr years of experience and continuous analysis to capture important information for fraud deteclion. We cannot reveal the details ofthe sclicma beyond what we havc described elsewhere.* Thc records of one schema have a fixed length of 117 bytes each and about 30 attributes, including the binary class label (fraudulenrllcgitimatc transaction). Some ficlds Lile numeric and the rest categorical. Because accounl identificalion is not present in the data, we cannot group trailsections into accounts. Thercfore, inslead of leaning behavior models ofindividual customer accounts, we build overall models that try to differentiate legitimate transactions from fraudulent ones. Our models are customer-independent and can serve as a second line of defense, the first being customer-dependent models. Most machiidearning literature conccntrates on model accuracy (either lrainiiig error or gencralination error on hold-out test data computed as overall accuracy, tme-positivc or false-positive rates, or return-on-cost analysis). This domain provides a considcrdbly different metric to evaluate the learned models' performance-models arc evaluated and rated by a cost model. Due to lhe different dollar amount of each credit card transactiun and otherfactors,thecostoff;iilingtodctectafraud varies with cach transaction. Hence, the cost model for this domain relies on lhe smn and average of loss caused by Sraud. We define Cumu1ati"eC"rf =
Cust(i)
and
whcre Cost(i) is the cost associated with transactions i, and n is the total number of Lransactiuns. After consulting with a bank representative, we jointly settled on a simplified cost model that closely reflects reality. Because it takes time and personnel to investigate a potentially fraudulent transaction, each investigation incurs an averliend. Othcr related costs-for example, the operational resources needed for the Iraud-detection system-me consolidated into overhead. So, if the amount of a transaction is smxller than the ovcrhwid, investigating the transactivn is not worthwhile even if it is suspicious. For enamplc, if it takes $10 to investigate a potential luss of $1, it is more economical not to investigate it. Therefore, assuming a Sixed overhead, we devised the cost inodel shown in Tahle I for each transaclion, where franamfis thc amount of a credit card transaction. The overhead threshold, cor obvious reasons, is a closely guarded secret and varies over time. The range of values uscd here arc probably reasonable levels as bounds for this data sct, hul are probably signilicantly lower. We evaluated all our empirical studics iisiiig this cost model.
Skewed distributions Given askewed dislributiiio, we wouldlike to generate atraining set of labeled transactions with a desired dislribution witlriul removing any data, which maximizes classifier periormancc. In this domain, we found that determining lhe desired distribution is an expcrimental art mdrequiresextcnsiveempirical lesls to find the most eSkctivc training distribution In our approach, we lirst create data subsets with the desired distribution (determined by exlensive sampling experiments). Then we generate classifiers Srom these subscts and combine them by mctalearning lrom their classification bchevior. For example, iS the given skewed distribution is 2W80 and the desired distribution for gencraling the
hcst models is 5050, we randomly divide the majority instances into Sonr partitions and Sorm four data subsets by merging the minority instances with each oC the four partitions crmtaioing majority instances. That is, the minority instances replicate across four data subsets to gencrale the desired 5O:SO distribntion for each distributcd training set. For c~incretencss,let N he the size of the data set with a distribution of x y (r is the percent-
agcoSthemior~rilyclass)andu:vbcthedesired distribulion. The nuinbcr olminority instanccs is N x x, atid the desired number of majority instances in a subset is Nx x v h . The number of subsets is the nuinher of majority instances IN x y ) dividcd by the number ol desired majority illstances in cach subset, which is Ny dividcd by Nxvlu or ylx x ulv. So, wc have y/xxulvsubscts,eachofwhichhas Nxminority illstances and Nxvlu majority instanccs. The next step is to apply a learning algurithm or algurilhms to each subset. Because the learning processes on the subsets arc independent, the subsets can be distributcd to different processors and each learning process zan run i n perallcl. For massive amounts of :lata, our approach can substanlially improve rpeed for superlincal- time-lcerning algorithms. Thc generatcd classifiers are coinbilled by inetalcarning from their classification behavior. Wc liave described sevcral metalearning strategies e1sewhere.l To simplify our discussion, we only icscribe thc clrrss-combiner (or stuckingj ulrategy? This strategy composes a inetalevel lraiiiing set by using the hase classiliers' prciiclions on a validation set as attribute values and the aclual classilication a s the class label. rhis training set then serves for training a InelaclassiSier. For inlegrating subsets, the :lass-combiner slrategy is inorc eSSective than the voting-based techniqoes. When the learned models are used during online Srand :letcction, transactions feed into the learnccl mse classifiers and the meteclassifier then :ombioes lheir predictions. Again, the base :lassiliers are indepcndent and can exccute in prallel on diSferent processors. In addition, )UT approach can prune rcdundant base classifiers without affecting the cost perlormance, inaking it relatively erlicient in the :redit card authorization proccss.
~~
-
__
~-
Table 2. Cost and rovings in the tredil turd fraud domain using tlosr.rombiner(tor1 t 95% ronfidente intervol).
Expcrimcnts and results. To evaluale our ificr class-combiner approach to skewed class distributiuns, we perlorined a se1olexperiments using the credit cm1 fraud data from Chase? We used transactions from lhefirsteightinonths(10/95-5/96)lortraioing, the ninth inunth (6196) for validating, and tlie twelfth month (9196) for testing. (Because credit card transactions have a nalu r d two-month business cycle-the timc to bill a customer is oiic month, followed by a one-month payment period-the true label of atransaction cannot bedelermincd i n less than twomonths’ timc. Hence, building models f m n data in one month cannot be rationally applied fur fraud detection in tlic next month. We therefore test our models on data that is at least two months newer.) Based on lhc empirical results fruni the effects of class distiibutions, thc dcsircd distribution is 5050. Because the given distribution i s 20:80, Four subsets are generalcd Crum each inonthfaratolaloC32suhsets.Wcapplied Cow learning algorithms (C4.5,CART, Ripper, and Baycs) to each subset and gcneraletl I28 base classifiers. Bascd on ourexperiencc with training metaclassifiers, Bayes is gcnerally more cflective and efficient, so it is the metaleamer for all Ihe experiments reported herc. Furthermore, to invcstigale if our approach is indccd Ruillul, we rim cxpcriments 011 the class-combincr strategy directly applied to thc original data sets from the lirsl eight months (that is, they have the given 20:80 distribution). We also cvaluated how individual classifiers generated from each month perform without class-combining. Table 2 shows the cost and savings from the class-combiner strategy using thc 5050 distribution (128 baseclassitiers), Ihe avemgcof individual CART classifiers generaled using the desired distribulion (10 cl;issiliers), classcombiner using the given distribulion (12 base classifiers-8 niontlis x 4 learning algorithms), and tlie average of individual classifiers using the given distribution (the average of 32 classiliers). (We did not perform experiments on simply replicating the minority inslances to achieve S0:50 in one singlc data 1999 - _NOVEMBER/DECEMBER ______ ~~
20.07 *.I3 23.64 +.96
46 36
29.08 t 1.60
21
sct bccause this approach increases the training-set size and is not appropriate in domailis with large amounts of data-one ol the three primary issues we address here.) Comparcd to the otlicr three methods, class-combining on subsets with a 5 0 5 0 fraud distribution clcarly achieves a significant increase in s i n ings--atleast$IIO,000fortIieinonlh (6196). When the overhead is $50, inure than halfof the losses were prevented. Surprisingly, we also observe that when tlie
50:50 distrihulian (generaled hy ignoring
somc data) achieved significantly more savings than combining classifiers trained lrom all eight months’ dala with the given distrihuliun. This reaffirms thc importance of employing the appropriate training class distribution i n this domain. Class-combiner also contribuled to the performance improvcrncnt. Conseqocnlly. utilizing the desircd lriiining distribution and class-combiner provides ti synergistic approach to data mining with nonuniform class and cost distribuliuns. Perhaps more importantly, how do iiurtcchniques perform compared to the bank‘s existing Sraud-detection system? We label thc current syslein “COTS” (commercial off-the-shelf systcm) in Table 2. COTS acliicved signitiwings than our lechniqucs in the thrcc overhead aniounts we repoit in Ibis table. This comparison might not be entirely accurale becsiisc COTS lies much mure training data than we have and il might bc optimized tu a different cost modcl (which might even bc the simple error rate). Fiirlhermore, onlikc COTS, om techniques are genel-al for problems with skewcd dislributions and ilu not utilize any domain knowledge in detecting credit card fraud-the only cxccption is the cost madcl used for evaluiition and search guidance. Nevcrthclcss, COTS’perfr,rnianccon tlie test data provides some indication of how the existing frauddctcctioii system behaves in the real warld. We also evaluated our mctliod with more skewed distributions (by clownsampling ininorily instiinces): 1090, 1:99, and I : 9 W
As wc discussed carlier, the desired dislribution is not necessarily 50:5&Ior instaocc, Ihc desired distribution is 30:70 when the given ilistributioa is 1090. With 10:90 distributions, our method rcduced the cost siignificantly more than COTS. With 1 3 9 [lisIributions, our method did not outperform COTS. Both methods did not achievc any savings with 1399 distributions. To cliaractcrize the condition when our tcchoiques arc cffective, wc calculale R, the ralio of the overhead amount to the avcragc cost: R = OverlrrodlAiwra,ye c m f .Our zipproach is significantly more cffcctive than the deploycd COTS whco R < 6. Both mctlioils are ti01 effective when the R > 24. So, under IIrciisonable cost model with a fixed overhead cost in challenging transaclions iis potentially fraudulent, when the numbor of fraudulent transactions is a very small percentagc ofthe totiil, it is financially undcsirablc to dclect fraud. Thc loss doe to fhis frrilud is ycl another cost of conducting business. Hawcver, fillering out “easy” (or law-risk) transactions (the data we received were possihly iiltered by a similar process) can reducc a high overhead-to-loss ratio. The filtering process can use fraud detectors that are buill bascd on indivirlual customer pmfilcs, which arc now in use by inany credit card cumpanics. These individual profiles charactcrizc the customers’ purchasing behavior. For example, if ii customer regularly buys groceries at a particular supermarket or has set up a monthly paymenl f i r phone bills, these triinsactions are closc to no risk; hencc, purchases of similar characteristics cim he safcly authorized wilhout fiwtlier checking. Ilzducing the overhead through s1re;imlining husiness operations and incrcased automation will also lower the ralio.
Knowledge sharing through bridging Much of the prior work on combining multiplc modcls assumes that ;dl mudels originate frum different (not necessarily dis-
tinct) subsets of a single data set as a mcans to increase accuracy (for example, by imposing probability distributions over the instances of the training set, or by stratified sampling, subsampling, and so forth) and not as a means to integrate distributed information. Although the JAM systcm addresses the latter problem by employing mctalearning techniques, integrating classification models derived from distinct and distributed databases might not always he feasible. In a11 cases considered so far, all classification models are assumed to originate from databases of identical schemata. Because classifiers depend directly on the underlying data's format, minor differences i n the schemata between dalabases derive incompatible classifiers-lhat is, a classifier cannot be applied on data of diSCerent formats. Ye1 these classifiers may target the same concept. We seek to bridge these disparate classifiers in some principled lashion. The banks seck to be able lo exchangc thcir classifiers and thus incorporate uscful information in their system that would otherwise be inaccessible to both. Indeed, for each credit card lransaction, both institutions record similar inSormation; however, they also include specific fields containing important information that each has acquired separately and that provides predictive value in determining Craudulent transaction patterns. To Cacilitate the exchange of knowledge (represented as remotely learned models) and take advantage of incompatible and otherwise useless classifiers, we need to devise methods that bridge the differences imposed by the different schemata. Database compatibility. The incompatible schenia problem impedes JAM [rum taking advantage of all available dalahases. Let's consider two data sites A and b' with databases DBA and DBB2having similar but not identical schemata. Without loss of gencrality, we assume that
Schcnta(DB,J = {Al,A,, ...,A,, A,,,,, Cl Schema(DB,) = [ E , , B2,...,B,,, C) where, Ai and Bidenote the ilh attribute of DBA and DB,,respectively, and C the class label (for example, the fraudllegitimate label in the credit card fraud illustration) of each instance. Without loss of generality, we firther assume lhat Ai = Bi, 1 < i < n. As for the A,,,! and B,,+l attribute& there arc two possibilities: ~
72 ~~
~_______~~ ~__ ___~.. ~~
1. A,,,, # B,,,,: The two attributes are of entircly differcnt types drawn from distinct domains. The problem can then be ileduced to two dual problems where onc database has one more attribute than the other: that is,
include it in its schema, and presumably other altributcs (including tlie common ones) have predictive value.
Bridging methods. Therc are two methods For handling the missing attribules.s
Pruning
Method 11: Learn a local model without the missing attribute and exchange. In this approach, database DO, can leain two local Sc!iema(DBA)= [Al, A , ...,A,,,A,,+,,C] mudels: one with the attribute A,,,, that can be used locally hy the metalearning agents and Schema(Db',) = [ B l ,B2...., U,,, C) one without it that can be subsequently where we assume thal attribute B