INTERNATIONAL JOURNAL OF PSYCHOLOGY, 1999, 34 (5/6), 460±464. Power Function ... The modi®ed model produces power function forgetting curves.
INTERNATION AL JOU RN AL OF PSYC HO LOGY, 19 99, 34 (5/6), 460± 464
Pow er Function Forgetting Curves as an Em ergent Property of Biologically Plausible Neural Netw ork Models Sverk er Sik stroÈ m U niversity of Toronto, Canad a
E mpirical forgetting curve data have been shown to follow a power function . In contrast, many connectionis t models predict either an exponential decay or ¯ at forgetting curves. This paper simulates power functio n forgetting curves in a Hop® eld network modi® ed to incorpo rate the more biologically realistic assu mption s of bounde d weights and a distribution of learning rates. The modi® ed model produces power functio n forgetting curves. The bounde d weights introduce exponential decay for individu al weights, and a power functio n forgetting curve when sum ming exponential decays with differen t learning rates. Because these assu mption s are biologically reasonable, power functio n forgetting curves may be an emergent property of biological networks. The results ® t empirical data and indicate that forgetting curves restrict possible implementation of models of memory. Il a e te de montre que les donn e es associe es aÁ la courbe d’ oubli suivent une fonctio n de puissance. Par ailleurs, plusieu rs modeÁ les connexionnistes pre disent soit un esto mpage exponentiel, soit des courbes d’ oubli aplaties. Cet article simule les courbes d’ oubli suivant une fonctio n de puissance dans un re seau Hop® eld modi® e a® n d’ incorporer les hypoth eÁ ses les plus re alistes possibles au plan biologique. Le modeÁ le modi® e produit des fonction s de puissance co mme courbes d’ oubli. Les coef® cients limites introdu isent un esto mpage exponentiel pour les coef® cient individu els et une fonctio n de puissance lorsque l’ on fait la somme de ceux-ci avec des ryth mes d’ apprentissage diffe rents. Parce que ces postulats sont raisonnables au plan biologiq ue, les courbes d’ oubli en fonction s de puissance pourraient eà tre une proprie te e mergente des re seaux biologiq ues. Les re sultats concordent avec les donne es empirique s et indique nt que les courbes d’ oubli restreignent les modeÁ les de me moire possible.
L aboratory stud ies o f recognition and autobiog raphica l data show a lin ear relation ship betw een the logarith m of a measurement of me mory (e.g. d 9 ) and the logarith m of the time since the items w ere encoded, indicating a pow er fun ction forgetting curve. Ru bin and Wenzel (1996 ) gathered a database of 210 published data sets of forgettin g curves, which ha d ® ve or mo re d atapoints, were s mooth , and to which at lea st one functio n could be ® tted w ith a correlation coef® cient of .90. T hey ® tted the database w ith 105 different function s an d fo und that the pow er fun ction accounted for more variance th an the exponential fun ction, or several other function s. H owever, on e mp irical grounds th ey found it dif® cult to distinguish betw een th e power functio n and three other function s, namely the logarith m, th e expon ential in the square root of time, a nd th e hyperbola in the squa re root of time. In accordance, this paper makes no stron g claim whether the p ower functio n or o ne of the o ther three fun ctions suggested by Rub in and Wenzel is the true function on e mp irical grounds.
C rovitz and Schiffman (1974 ) suggested that a memory measurement (M ) cou ld be su m marized by a power function of time (t): M 5
b
2 t a
(1)
where a and b are positive consta nts. It follows fro m E quation (1) that the logarith m of a me mory measurement [log(M )] is a linear relation ship w ith the log arith m of the ti me p assed since encoding. In autobiograp hical memory, M is the number o f memories p er tim e unit. In laboratory studies M is d 9 , which is a measurement proportio nal to th e underlying trace stren gth. To achieve a high degree o f accounted varian ce a large range o f perfo rman ce seems to b e n ecessary. For example, the pow er function in a utobiographical mem ory, where the range typica lly is th ree or fo ur magnitud es, has b een found to account for .95 to 1.00 of the variance. A nderson an d Tweney (1997) argued that the experimental power fu nction curves may be a n artefa ct due to averaging over subjects. H ow ever, W ixted a nd Ebbesen (1997 ) showed that the pow er function also ® ts better than the expo-
Requests for reprints should be address ed to Sverke r SikstroÈ m, PhD, Depart ment of Psychology, U niversit y of Toronto , 100 St Georg e Street, Toronto, Ontario, Ca nada M5 S 3G 3 (Fax: 1 1 1 416 97 8 4811; Tel: 1 1 1 416 978 4518; E-mail: sverker@ psych.u toronto.c a; Ho mepage: http://ww w.psych.u toronto.c a/~sverker/sikstroÅ m.ht ml).
q
1999 Internationa l U nion of Psychological Science
POW ER FUN CTION FORG ETTIN G CU RVES
nential functio n when data fro m the individual subjects are ® tted . T his indicates th at the power functio n is n ot an artefa ct due to averagin g over subjects. T he purpose of th is paper is to show that a H op® eld network modi® ed in two aspects show s power function fo rgetting curves. T he mo di® catio ns are bound ed w eights and a va riance in the distribution of learning rates. T he bo unded weights make the weigh ts d ecay exponentially. Power function fo rgetting curves are found when the exp onential decays are sum med over a distribution of learning rates. Bo unded weights and a distribution of learning rates are biologically reasonable assu mption s. T herefore, it is a rgu ed that a p ower fun ction forgetting curve may be an emergent property of a biologica l neural network. T he model is consisten t w ith TE C O (SikstroÈ m, 1996 a , b, 1998 ) that has been applied to a wide set of me mory pheno mena (a full sum mary of T EC O is beyo nd the scope of th e present paper).
A MODEL FOR POWER FUNCTION FORGETTING CURVES F irst, the H op® eld n etwork (H op® eld, 1982 , 198 4 ) is described. A ctivated nodes correspond to features in the represented infor mation. Ite ms a re represented in patterns of activation. Ea ch pattern consists of N nodes. t T he activation of no de i at time t ( x i ) can be in one of two states. T he active state is represented as 1 1 and the ina ctive state as 0. T he probability that a node is active is a (0 < a < 1). E ach pattern is created by rando mly setting exa ctly a N nodes to an active state and th e other nod es to an inactive state. A ll nodes are connected to all oth er nodes in the network an d a weight is atta ched to each conn ectio n. t T he weight betw een n ode i and n ode j at time t is w ij . t T he w eigh t change ( D w ij ) for i 5 j is zero, and for i ¹ j is calcu lated by:
D
w
t
1
5
ij
(x
N
t
i
- a )( x
t
j
- a)
(2)
A stan dard H op® eld network p roduces ¯ at forgetting curves b ecause the probability of retrieval is independent of when the patterns are stored. To account fo r forgetting curves it is sugg ested that the weig hts should be b ounded and that the learning rates shou ld be different for each weigh t ( h ij ). Th e boun daries are created by setting the weigh ts to the maxi mu m boundary (b) if the weight is above b. If the weight is below the minimu m b oundary (-b) then the weight is set to -b: w
t1
1
ij
5
t
i
S 5
a
ij
1 2
w ij
x
t
(3)
j
T he encoded p attern can then be retrieved by synchrono usly activating the nod es to 1 1 if the arg u ment is po sitive and o therwise to 0 .
t
h 1
ij
ij
D
t
w ij , b ], -b ]
(4)
5
1 2
M in
M in
[
h
ij
[
h
D
ij
w
,1
2b
2[a(1 2
a) ]
]5
2
,1
bN
]
(5)
where 0 < a ij < 1. L et t rep resent the lag, or the encoded items between encoding and retrieval. G iven a constant number of ite ms encoded at each time period then t can be rega rded as the ti me betw een encoding and retrieval. Perfor ma nce over time for a sing le w eight can be written as an exponential fun ction of time. T he perfo r ma nce over time (d 9 (t)) for the w hole network can then be w ritten a s the average of the expon ential fu nctions with different decay p ara meters: d 9 (t) 5
1
d 9 (0)
d 9 (0)
N
j= 1
M ax[ M in[ w
where M in[ ] takes th e minimu m o f the two argu ments and M ax[ ] ta kes the maxi mu m of the two argu ments. T he slope of the fo rgetting curves are zero as long as the weight does not reach the b oundary. W hen the bo undary is reached the slope of the forgetting curves beco me n egative because the bou ndary interferes w ith the weight chang es. Forgetting due to the bo undary at time (t) is equal to the probability that the w eights ``bu mp’ ’ into the bou ndary. This is equ al to the expected 1 value o f the absolute w eigh t change ( h ij D w) divided by the distance between the low and th e high bo undaries (2b). T he p robability that the bo undary is not reached ( a ij ) is then:
Retrieval fro m the netwo rk is conducted by p resentin g a pattern to the network. T he retrieved pattern can be fo und by calculating the net input for each no de: net
461
N
1 N
2
2
N
N
i= 1
j= 1
S S
N
N
i= 1
j= 1
S S
e
ln ( a
a
ij
t/2 ij
) t/2
5
(6)
where d 9 (0 ) is the d 9 at time 0 . T hus, it is predicted that bo unded w eigh ts yield an exponential forgetting curve. T his network w ill show th e slowest possible forgetting curve (co mpletely ¯ at) if the learn ing rate (in relatio n to 2
The expected absolute weight change ( D w) is the absolute weight change (abs( D w i )) times the probabili ty for each weight change (p i ) su mmed over the four possibl e co mbinations of weight changes: 1
D
w = = 2
S
4
i= 1
1 N
(abs( D w i ) p i ) 5 [2 a (1 - a)]
2
1 N
[abs((0 - a)(0 - a))(1 - a)
2
1
abs((0 - a)(1 - a))(1 - a)a 1
abs((1 - a)(0 - a))a(1 - a) 1
2
abs((1 - a)(1 - a))a ] (9)
The factor one half in the exponen t is included because the weight changes are dependent. The standard deviation of the depende nt mean weight changes is equal to the square roo t of the expected value of the independent weight changes.
462
SIK STROÈ M
the boun dary) is so slow that the bound ary is n ever reach ed. The network can also show the fastest possible forgetting curve if the learning rate is so large that only the last ite m is stored in the weights. Inter mediate fast forgetting curves can be fo und by u sin g the learning rates that are in b etween the fastest and the slowest possible. It should therefore be possible to ® n d a distribution of weight changes that show s a pow er functio n forgettin g curve by co mbining slow an d fast forgetting rates in a suitable way. Th e question is what distributio n of learning rates yields a pow er function forgetting curve for th e su m of expo nential. M athe matically, th is is not an easy qu estion . B y using L aplace transfor mations, N ew ell and Rosenbloo m (1981 ) argued that a rectan gular distribution of expo nential results in an aggregate pow er functio n. H ow ever, mo re recently Kahan a (personal com mun ication , 6 June 1998 ) argued that it can be shown mathe matically that any smooth probab ility distribu tion of d ecay parameters ( a ij ) in expon ential functions yields an agg regated pow er function. This is also w hat is fou nd in the simulations below.
SIM ULAT ION A simu latio n was run to study how forgetting curves depend on boun ded weights and the distribution of h ij . T he following settings were used: the number of encoded patte rns (p) 64, the number of n odes (N ) 60 , the activation level (a) 0.2, and the bo undary (b ) 0.000 6 7. Initially all weights w ere set to a zero. First, 64 to-be-en coded patterns and 64 lure patterns were created. T hen each o f th e 64 patterns (p ) was encod ed once in a temporal o rder. A ll weights in the network were changed usin g the sa me learn ing rule (as speci® ed earlier). T he 1, 2, 4, 8, 16, 32, a nd 64 latest encod ed p atterns w ere retrieved. N o lea rning occurred during retrieval. Each simulation w as rep eated 500 times. Th e fa miliarity ( m ) of a retrieved pattern w as calcut lated by the dot product betw een the net inpu t (net i ) a nd the activatio n of the encoded pattern scaled so that the t expected value is zero ( x i - a) sum med over the N number of nod es:
m 5
S
N
net i ( x t
i= 1
t
2
i
a)
(7)
T he results are presented as d 9 calcu lated fro m the fa miliarity of the targets ( m t ), th e fa miliarity of the lure ( m d ) and the stan dard deviation of fa miliarity o f the lure ( s d ): d9 5
m
t
s
m
2
d
(8)
d
T he learning rate ( h ij ) was varied a s follows. M o st p rominent me mory mod els use a constant learning rate, a nd several models have unbo unded w eigh ts (e.g. C H AR M , M etcalfe, 1991) , whereas some are boun ded (co ntext to items association in C happell and H u mphrey s’ model, 1994) . T he effect of constant w eight chang es w as simulated in M odel 1A , 1B, and 1C . In M odel 1A the weight
change was set to a large va lue so on ly the last ite m can be recalled fro m the netwo rk ( h ij 5 2). In M odel 1C the weight change was set so slow that the bo undaries can not be reached, i.e. the weight change w as p ractically unbo unded ( h ij 5 0.008) . In M odel 1B the learn ing rate was set to an arbitrarily cho sen inter mediate level ( h ij 5 0.04). In M odel 1D, 1E , and 1F th e learning rates were different for each connectio n. T he distribution was set to a linear distribution (M odel 1D, h ij 5 h ), an exponential distribu tion [M od el 1E, h ij 5 exp( h )], and a power fu nction (M o del 1F, h ij 5 h -1 ), w here h is a rando m variable w ith a rectangular distribution, bound ed so that h ij falls betw een 1 an d 0.008 . The learn ing rates were updated for each subject. H owever, the learning rate has to be constant during the simulation of each subject to p roduce app ropriate forgetting curves.
RESULTS T he results fro m the simulation s are presented in F ig. 1a and 1 b. It w as predicted that the bou nded w eights shou ld yield exponential forgetting curves, which is evident by a linear curve on log-linear plot. T his was also found for Simulation s 1A , 1B , and 1C (F ig. 1a). M odel 1B used an inter mediate b oundary so that several ite ms could be stored. T he explained variance on exponential functio n is 1.000 . T he predicted slop e according to E quation 6 wa s a = 0.94, and th e simulated slo pe w as a = 0.93 , indicating a reasonably go od ® t between pred icted an d simulated slo pes. T he forgetting w as faster th an a pow er 2 function. T he explained variance R on a pow er functio n wa s 0.78. M od el 1C, where the w eights w ere unbounded (i.e. due to a very low learn ing rate) show s a ¯ at ``fo rgettin g’ ’ curve independent of the time of learning. Th is forgetting curve is predicted fro m th e M atrix M odel (H u mphreys, B ain, & Pike, 1989 ) and M IN E RVA II (H intzman , 1987) , among others. M odel 1A w ith max ima lly bou nded weights show s a fo rgettin g curve where only the last encoded ite m can be retrieved (notice that the logarithmic curve is cut off so that very low d 9 are no t displayed in th e ® gure). T hus, th e models with bound ed w eigh ts and a constant learning rate show exponential forgetting curves. In M odel 1D to 1F, the learning rates were different fo r each w eig ht and distribu ted as linear, exponential, and power fun ctions. T he results a re show n in F ig. 1b. A good ® t w ith a pow er fun ctio n is in this graph represented by a linear curve w ith a negative slope. The results ® t a lin ear relatio nship o n the log-log scale well (M o del 2 2 2 1D R 5 0.9 93, M o del 1 E R 5 0.997 , M odel 1F R 5 0.993) . T hus, these models are consistent w ith pow er function forgetting curves often foun d emp irically in long-ter m memo ry. To su m marize, power fun ction forgetting curves were fo und using bo unded w eights and several different distribution s of learning rates. M odels w ith unbo unded
POW ER FUN CTION FORG ETTIN G CU RVES
463
(a) 2. 5 2 1. 5 1 ln ( d’)
0. 5 0 -0.5 0 -1
10
20
30
40
50
60
70
1 A F a ste s t
-1.5
1B In te rme diate
-2
1 C S lo w e s t
-2.5
#
(b) 3.00
2.00
ln ( d’)
1.00
0.00 1
2
4
8
16
32
64
#
-1.00 1D linear -2.00
-3.00
1E exp 1F power
FIG. 1. (a) Simulated data for a constant distributio n of learning rates: Th e y-axis shows the log e (ln) of d 9 and the x-axis the number of items encoded before the item w as retrieved on a linear scale (i.e. time). M odel 1A shows the results for the fastest possible learning rate, M odel 1B the results for an in termediate (i.e. h i j = 0.1 learning rate, and M odel 1C the results for the slowest possible learnin g rate. (b) Simulated data using three different distributio ns of learning rates: Th e y-axis shows the lo g e of d 9 and the x-axis the number of items encoded before the item was retrieved (i.e. time) on a log 2 scale. The distributions are linear (1D), exponential (1E), and power (1F).
weights o r a constant learning rate did not show a pow er function forgettin g curve.
DISCUSSI ON A modi® ed H op® eld mo del was proposed to accou nt for the p ower fun ctio n fo rgetting curves by assu ming bo unded weig hts and a distribution o f learning rates. T he bounded weights introduce expo nential d ecays for individu al weights. Su m ming expo nential decays w ith different learning rates (or decay pa rameters) yields a power function forgettin g curve consistent w ith empirical data.
A n i mp ortant aspect o f the mo del is that the assumptions a re biolog ically plausible. Weigh ts may be conceived as synaptic p lasticity. It is likely that synap tic plasticity in biological cells is boun ded. T he assu mption of a positive distribution of learning rates in the present model is also a plausible assu mption in neurological system. S ince th ese assu mptio ns are likely to be true in biological neural networks, it is also reason able to conclude that a pow er function forgettin g curve may be an e mergent p roperty of b iological networks. However, neurological data show ing unbou nded synaptic plasticity, or no distribu tion of synap tic plasticity, may po tentially falsify the p resen t theory. T he a utho r is unaware of exp licit neurologica l d ata on the distribution o f time
464
SIK STROÈ M
constants for synaptic plasticity, or bounda ries for synaptic plasticity. M ost pro minent me mory models (e.g. th e M atrix M o del, Hu mphrey s et al., 1989 ; the auto-a ssociator in the mod el o f C happell & Hu mphrey s, 1994 ; M IN E RVA , H intz man, 1987 ; C HA R M , M etcalfe, 1991 ) simply a dd the contribution of a n ewly encoded ite m to th e me mory vecto r and use unboun ded ``weights’ ’ w ith a constant learn ing rate. T hese models do not differentiate between the time of encoding, so that all encoded ite ms have the sa me expected probability o f retrieval independently of when they were encod ed. Th ese models p redict a co mpletely ¯ at ``forgettin g’ ’ curve. N one of the existing models referred to here have different learning rates in th eir current implementation. Ch appell an d H u mphreys’ (1994 ) model is the o nly model referred to in th is paper that uses weight b oundaries. T his model h as bou nded weights in the connections from th e representation of context to the ite ms whereas the ite m to item connections are unboun ded. It may be p ossible to modify other models by introdu cing weig ht bo undaries and a distribution of the lea rning rates. A lthough the present p aper has dealt with long -ter m me mory, the theory may be exten ded to short-ter m me mory an d serial position effects. For example, the recency effect may simp ly be a special case of the theory, whereas the primacy effect may be modelled by changing the lea rning according to the novelty of the encoding context.
REFERENCES A lbert, M .S., Butters, N., & Levin, J. (1979). Temporal gradients in the retrograde amnesia of patients with alcoholic Korsakoff ’s disease. Archives of N eurology, 36, 211± 216. A nderson, R.B., & Tweney, R.D. (1997). A rtifactual power curves in forgetting. M emory and Cognition, 25(5), 724± 730. Chappell, M ., & Hu mphreys, M .S. (1994). An auto-associative neural network for sparse representations : A nalysis and application to models of recognition and cued recall. Psychological Review, 101(1), 103± 128.
Crovitz, H .F., & Schiffm an, H. (1974). Frequency of episodic memories as a function of their age. Bulletin of the Psychonomic Society, 4, 517± 518. Hintzman, D.L. (1987). Recognition and recall in M IN ERVA 2: Analysis of the ``recognition failure’ ’ paradigm. In P. M orris (Ed.), M odelling cognition (pp. 215± 229). London : W iley. Hintzman, D.L. (1988). Judgement of frequency and recognition memory in a multiple trace memory model. Psychological Review, 95, 528± 551. Hop® eld, J.J. (1982). N eural networks and physical systems with emergent co mputation al abilities. Proceeding of the National Academy of Sciences, USA, 81, 3088± 3092. Hop® eld, J.J. (1984). Neurons with graded responses have collective co mputational abilities. Proceedings of the Na tional Academy of Science U SA, 81, 3008± 3092. Hu mphreys, M .S., Bain, J.D., & Pike, R. (1989). Different way to cue a coherent memory system: A theory for episodic, semantic and procedural tasks. Psychological Review, 96, 208± 233. M etcalfe, J. (1991). Recognition failure and the co mposite memory trace in CH AR M . Psychological Review, 98, 529± 553. M urdock, B.B. (1993). TO DAM 2: A model for the storage and retrieval of item, associative, and serial-order info r mation. Psychological Review, 100, 183± 203. Newell, A., & Rosenbloo m, P.S. (1981). M echanism of skill acquisition and the law of practice. In J.R. Anderson (Ed.), Cognitive skills and their acquisition (pp. 1± 55). Hillsdale, NJ: Lawrence Erlbau m Associates Inc. Rubin , D.C., & Wenzel, A.E. (1996). O ne hundred years of forgetting: A quantitative description . Psychological Review, 103, 734± 760. SikstroÈ m, P.S. (1996a). The TECO connectionis t theory of recognition failure. European Journal of Cognitive Psychology, 8, 341± 380. SikstroÈ m, P.S. (1996b). TECO : A connectionist theory of successive episodic tests. D octoral Thesis, U meaÊ University. SikstroÈ m, P.S. (1998). A connectionist model for novelty and familiar ity in episodic me mory. M anuscript sub mitted for publication. W ixted, J.T., & Ebbesen, E.B. (1997). G enuin e power curves in forgetting: A quantitative analysis of individu al subject forgettin g functio ns. M emory and Cognition, 25(5), 731± 739