Jan 22, 2018 - cache at most Nc contents and recommend Nm contents to each user. Denote the caching policy in the tth time slot as c(t) = [c. (t). 1 , ··· ,c. (t). Nf. ] ...
A Learning-based Approach to Joint Content Caching and Recommendation at Base Stations Dong Liu and Chenyang Yang
arXiv:1802.01414v1 [cs.NI] 22 Jan 2018
Beihang University, Beijing, China Email: {dliu, cyyang}@buaa.edu.cn Abstract—Recommendation system is able to shape user demands, which can be used for boosting caching gain. In this paper, we jointly optimize content caching and recommendation at base stations. We first propose a model to capture the impact of recommendation on user demands, which is controlled by a user-specific threshold. We then formulate a joint caching and recommendation problem maximizing the cache-hit ratio, which is NP-hard. To solve the problem efficiently, we develop a greedy algorithm. Since the user threshold is unknown in practice, we proceed to propose an -greedy algorithm to learn the threshold via interactions with users. Simulation results show that the greedy algorithm achieves near-optimal performance and improves the cache-hit ratio significantly compared with priori works with/without recommendation. The -greedy algorithm learns the user threshold quickly, and achieves more than 1− of the cache-hit ratio obtained by the greedy algorithm with known user threshold.
I. I NTRODUCTION Caching at the base stations (BSs) has been acknowledged as a promising way to support the explosively increasing traffic demands and improve user experience [1]. To increase the caching gain at the wireless edge where each node is with limited storage-size, various proactive caching solutions have been proposed, say increasing equivalent cache size seen from each user [2], devising ingenious codedmulticast [3], or combining with cooperative transmission [4]. A more aggressive approach to increase caching gain is to shape user demand itself to be more caching-friendly. Recommender system, whose original goal is to relieve users from information overload by learning user preference and recommending the contents that best match the preference [5], has been leveraged for demand-shaping in content distribution network (CDN) [6, 7] as well as wireless networks [8–10]. The basic idea of demand-shaping is that the recommender system does not necessarily recommend the contents that best matches the taste of each individual user; instead, it can recommend the contents that matches the user preference adequately and are also attractive to other users. If the users accept the recommendation results, the user demands can be less heterogeneous and with less uncertainty. In [6], the hit ratio of YouTube’s caches was increased by reordering the related video list so that the already cached contents are at the top of the list. In [7], the cost of data downloading for service provider was reduced by proactively pushing contents to users where demand-shaping was achieved by adjusting the rating of contents shown to each user. For mobile networks, although recommendation system and cache nodes are typically managed by different entities, i.e.,
content providers and mobile network operators (MNOs), respectively, there is an increasing trend for the convergence of MNOs and content providers. In [8], recommendation was integrated with wireless edge caching, where the BS recommends all cached contents to every user so that the content popularity is more skewed and hence achieves higher caching gain. However, the preference of each user was assumed as identical. Considering heterogeneous user preference, [9] proposed a recommendation policy to improve cache-hit ratio by recommending contents that are both cached and appealing to the user. In [10], caching policy was optimized to maximize “soft” cache-hit ratio by offering related contents in the cache if the originally requested content is not cached. In fact, content caching and recommendation are coupled with each other, since recommendation influences user demands, which further affects the optimal caching policy. Therefore, optimizing caching or recommendation alone can not fully reap their benefits. Moreover, both works assumed that the user demands with recommendation (i.e., the probability of user requesting for each content after recommendation [9], or the probability of user accepting a related content recommendation [10]) are known, which however are unavailable in practice. In this paper, we jointly optimize content caching and recommendation at BSs without assuming known user demands after recommendation. To capture the impact of recommendation on user demands, we first propose a model for user requests with recommendation, which is controlled by a userspecific threshold determining whether the user is prone to request the recommended content. We then formulate a joint caching and recommendation problem maximizing the cachehit ratio, which is NP-hard. To solve the problem efficiently, we propose a greedy algorithm. In practice, the user threshold is unknown. Inspired by the concepts in reinforcement learning [11], we propose an -greedy algorithm to trade off between learning the threshold via interactions with users (i.e., exploration) and achieving maximal cache-hit ratio based on the currently estimated threshold (i.e., exploitation). Simulation results show that the greedy algorithm is near-optimal, and improves the cache-hit ratio remarkably compared with priori works. The -greedy algorithm can converge quickly to achieve more than 1 − of the performance obtained by the greedy algorithm with perfect user threshold. The rest of the paper is organized as follows. In section II, we present the system model. In section III, we introduce the user demands model before and after recommendation. In section IV, we formulate the joint optimization problem,
and propose algorithms to find the solution with known and unknown user threshold. Simulation results are provided in section IV, and section V concludes the paper. II. S YSTEM M ODEL We consider a cache-enabled cellular network where each BS is equipped with a cache and connected to the core network via backhaul link. Assume that each user is associated with the closest BS. Then, we can focus on optimizing the caching and recommendation policy in a single cell.1 Suppose that there are Nu users located in the cell and total Nf equal-sized contents that the users may request. In each time slot (e.g., few hours or a day), the BS can cache at most Nc contents and recommend Nm contents to each user. Denote the caching policy in the tth time slot as (t) (t) (t) c(t) = [c1 , · · · , cNf ]T , where cf = 1 if the f th content is (t)
cached at the BS, otherwise cf = 0. The recommendation (t) (t) policy is denoted by M(t) = [m1 , · · · , mNu ]T , where (t) (t) (t) T mu = [mu1 , · · · , muNf ] is the recommendation policy to (t)
the uth user, and muf = 1 if the BS recommends the f th (t) content to the uth user, otherwise muf = 0. The recommended (t) contents to the uth user is denoted as Mu , {f |muf = 1}. When a user intends to request a content, e.g., open a video application (app) on the mobile device, a recommendation list is first presented, e.g., shown on the home screen of the app. If the user has no strong preference towards a specific content and the recommended contents match the taste of user adequately, the user is more likely to click a content in the recommendation list to request. By contrast, if the user has already determined what content to watch or the recommended contents do not match the taste of the user at all, the user may simply ignore the recommendation. If the eventually requested content is cached at the the BS, i.e., cache hit occurs, the BS can deliver the requested content from the local cache to the user directly. Otherwise, the BS needs to fetch the content via backhaul first and then transmits it to the user. With known user preferences that can be learned at the content providers, the BS can decide what to recommend, and can also determine what to cache [12]. III. M ODELING U SER D EMANDS B EFORE AND A FTER R ECOMMENDATION In this section, we first introduce inherent user demands before a user sees the recommendation list. Then, we provide a model to reflect user personality in terms of how easy a user is influenced by a recommendation. A. Inherent User Preference Assume that each content can be presented by a K dimensional feature vector, which can be extracted explicitly from the content metadata in the form of tags, e.g, the genre (or topic) of a movie, or learned in the latent space by various 1 The proposed framework can be extended to dense networks where a user can associate with more than one BS by considering distributed caching policy [2], which is not shown for conciseness.
representation learning methods, e.g., matrix factorization [5], or deep neural networks [13]. Denote the feature vector of the f th content as xf = [xf 1 , · · · , xf K ]T , where xf k reflects the relevance of the f th content to the kth feature. Similarly, we assume that the uth user can be presented by a K dimensional feature vector yu = [yu1 , · · · , yuK ]T , where yuk reflects the interest of the uth user to the kth feature. Since the user’s interest to each feature changes relatively slowly compared with the duration of each time slot, the user feature vector can be learned based on its content request history. Then, the inner product xTf yu reflects the attractiveness of the f th content to the uth user. The inherent user preference is the probability distribution of user requests for every content without recommendation. Denote puf as the probability that the uth user requests the f th content conditioned on that the user requests a content without recommendation. Based on the multinomial logit model in discrete choice theory [14], which is often used in economics to describe, explain, and predict choices among multiple discrete alternatives, puf can be expressed as exp(xTf yu ) puf = PNf T f 0 =1 exp(xf 0 yu )
(1)
Such a function is also acknowledged as a softmax function in machine learning field. Since our focus is to jointly optimize caching and recommendation, we assume that the feature vectors xf and yu are perfectly learned, and hence the inherent user preference is known a priori. B. Impact of Recommendation on User Demands We model the personality of a user in terms of the likelihood to accept a recommendation by setting a threshold θu (0 ≤ θu ≤ 1), which determines whether the user is prone to request the recommended content. Denote θ = [θ1 , · · · , θNu ] as the threshold vector for all the users. Specifically, if the inherent preference is above the threshold, i.e., puf ≥ θu , which means that the f th content attracts the uth user sufficiently, then the uth user will regard the f th content as a candidate content to request if it is recommended to the user. We call the contents in Mu that satisfies puf ≥ θu as the candidate subset of Mu , which is denoted by Au , {f | puf ≥ θu , f ∈ Mu }. The candidates subset limits the the contents that the user may request from the recommendation list. With this model, the generative process of the user demands after recommendation is shown in Fig. 1. Intuitively, the recommendation list will be more appealing to the user if the list contains more contents that are sufficiently attractive (i.e., Au is large), while will cause information overload to the user if list contains too many contents (i.e., Nm is large). Therefore, we assume that the probability that the uth user requests a content from the recommendation list is u| qu = |A Nm , where | · | denotes the cardinality of the set. When the user requests a content from the recommendation list, again
PNf 0
A user wants to request a content
If request from recommendation list Yes (with probability qu) Request content f in aaa with probability aaa
auf 0 m
(t) 0
where auf = 1 if puf ≥ θu , otherwise auf = 0, f =1Nm uf u| = |A Nm = qu is the probability that the uth user requests (t)
from the recommendation list, and No (with probability
q
)
Request content with inherent preference aaa
auf muf puf PNf (t) a 0 muf 0 puf 0 0 f =1 uf
is derived from combining (1) and (2). Then, the joint content caching and recommendation problem maximizing the cache-hit ratio can be formulated as Problem 1: max
h(c(t) , M(t) , θ)
c(t) ,M(t)
Fig. 1. A generative model of user demands with recommendation.
according to the multinomial logit model, the probability of the uth user requesting the f th is exp(xTf yu ) P , f ∈ Au T f 0 ∈Au exp(xf 0 yu ) p˜uf = (2) 0, f ∈ / Au Otherwise, the user ignores the recommendation with probability 1 − qu and requests a content according to inherent preference puf . From this model, we can see that when user threshold θu is too high or the recommended contents do not match the inherent user preference adequately so that the candidate subset is empty, i.e., the recommendation does not affect the user’s request. This is very different from the assumption in [9] that recommendation will always boost the request probability for every recommended content equally. On the contrary, when user threshold θu is low or all the recommended contents match the taste of the user sufficiently so that |Au | = Nm , the user will only request content in the recommendation list. Considering that Nm Nf in reality, the number of possible contents that a user may request shrink significantly compared with the case without recommendation. The uncertainty and heterogeneity of user demands can be reduced via effective recommendation, which suggests possible caching gain. Since the threshold and the preference towards each content may vary drastically among users due to different personalities, the recommendation policy should be designed carefully towards maximal caching gain. IV. J OINT C ONTENT C ACHING AND R ECOMMENDATION In this section, we formulate a joint content caching and recommendation problem, and solve the problem with known and unknown user threshold, respectively. A. Problem Formulation Based on the model in section III-B, the probability that the uth user requests the f th content after the recommendation can be expressed in the following form PNf (t) (t) 0 auf muf puf f 0 =1 auf muf 0 (t) quf (mu , θu ) = PNf (t) Nm 0 0 f 0 =1 auf muf 0 puf PNf (t) 0 f 0 =1 auf muf 0 + 1− puf (3) Nm
= p˜uf
s.t.
Nf X
(t)
cf ≤ Nc
(4a) (4b)
f =1 Nf X
(t)
muf = Nm , ∀u
(4c)
f =1
PNu PNf (t) (t) where h(c(t) , M(t) , θ) , u=1 f =1 su quf (mu , θu )cf is the cache-hit ratio of the BS, su is the probability that the request is sent from the uth user, which reflects active level of the user [12], (4b) is the cache size constraint, and (4c) is the recommendation list constraint. Consider a special case of Problem 1 when auf = 1, ∀u, f . With any given caching policy c(t) , the problem in the special case becomes a product assortment problem, which is known as NP-hard [14]. Hence, Problem 1 is also NP-hard. In the sequel, we propose a greedy algorithm to solve the problem. B. Greedy Algorithm for Known User Threshold From Problem 1, it is not hard to see that when M(t) is given, the problem is a special case of the knapsack problem with non-negative profit and unit weights [15]. Hence, the optimal caching policy is to cache the Nc contents with the PNu (t) highest values of vf , u=1 su quf (mu , θu ), i.e., the most popular Nm contents after recommendation. We can express the optimal caching policy with given recommendation policy M(t) as a function c(M(t) , θ). In the greedy algorithm, we first set the recommendation lists for all the users as empty, i.e., M(t) = 0. Then, at each iteration step, we add one content to one user’s recommendation list that maximizes the cache-hit ratio. The iteration finally stops when each user is recommended with Nm contents. The details are provided in Algorithm 1. The complexity of computing c(M(t) , θ) is O(Nf log Nf ) from sorting vf . Then, the complexity for solving the problem in step (3) of Algorithm 1 is O(|U||Mu |Nf log Nf ). Since |U||Mu | ≤ Nu Nf and the algorithm stops with Nu Nm iterations, the overall complexity of the greedy algorithm is at most O(Nu2 Nm Nf2 log Nf ), which is much smaller than Nf Nu O( N Nf log Nf ) for solving Problem 1 by exhaustive m searching. To further reduce the complexity of Algorithm 1 while not sacrificing much of the performance, we can initialize Mu = {f |puf ≥ θu }, which means that we only chose from the candidate contents to recommend.2 This can maximize the 2 If |M | < N , we add contents to M sequentially according to the u m u descend rank of puf until |Mu | = Nm .
Algorithm 1 Greedy algorithm Input: puf , su , θu (u = 1, · · · , Nu , f = 1, · · · , Nf ) 1: Initialize M(t) = 0, potential contents to be recommended to the uth user Mu = {1, · · · , Nf }, and users with recommended contents less than Nm , U = {1, · · · , Nu }. 2: while U is not empty do 3: [u∗ , f ∗ ] = arg maxu∈U ,f ∈Mu h(c(t) , M(t) + ∆uf , θ), where c(t) = c(M(t) , θ) and ∆uf is a Nu × Nf dimensional 0-1 matrix with only one “1” elements on the uth row and the f th column. 4: M(t) ← M(t) + ∆(u∗ , f ∗ ) Mu ← Mu \f ∗ 5: 6: if |Mu | = Nm then 7: U ← U\u∗ 8: end if 9: end while Output: M(t) , c(t) = c(M(t) , θ)
probability that the user requests content from the recommendation list, i.e, qu . Supposing that the number of contents satisfying puf ≥ θu is Na for each user, the complexity of Algorithm 1 can be reduce to O(Nu2 Nm Na Nf log Nf ), where Na Nf if θu is not too low. We refer to this algorithm as the low-complexity version of Algorithm 1. We call the caching and recommendation policy obtained by Algorithm 1 with known user threshold as oracle policy. C. -Greedy Algorithm for Unknown User Threshold As a parameter reflecting user behavior, θu is never known. Fortunately, we can estimate θu via interactions with users in (t) ˆ(t) , each time slot. Denote θˆu as the estimate of θu and θ (t) (t) [θˆ1 , · · · , θNu ]. Define an indicative function, I(u, f ) = 1 if the uth user requests the f th content from the recommendation list, I(u, f ) = 0 otherwise. We can obtain two observations from the user demands model with recommendation. Observation 1: If I(u, f ) = 1, we can infer that the threshold is no higher than the inherent preference of the uth user to the f th content, i.e., θu ≤ puf . Observation 2: After ranking the inherent preference of the uth user according to descend order as puf1 ≥ puf2 ≥ · · · ≥ pufNf and dividing [0, 1] into Nf + 1 subintervals as [0, puf1 ], (puf1 , puf2 ], · · · , (pufNf , 1], we can see that the user demands after recommendation only depends on which subinterval θu lies in but not the exact value of θu . Define θ˜u , pufn ≥ θu > pufn+1 3 as the right end-point of the ˜ , [θ˜1 , · · · , θ˜N ]. Then, we subinterval that θu lies in and θ u (t) ˜ can obtain quf (M , θu ) = quf (M(t) , θu ) for ∀ M(t) , which means that the caching and recommendation policy obtained ˜ is the same as the oracle policy. by Algorithm 1 based on θ Based on Observation 1, we can first set the initial estimate ˆ(0) = 1. Then, at each time slot, as its upper bound, i.e., θ the BS recommends and caches contents using Algorithm 1 ˆ(t) . This is actually exploitation based on the current estimate θ 3 We
define puf0 , 1 and pufN
f +1
, 0 to ensure mathematical rigorous.
since the cache-hit ratio is maximized based on the current knowledge of θ. After observing the user demands in the (t+1) current time slot, we can update θˆu = puf for the next (t) ˆ time slot if I(u, f ) = 1 and θu > puf . However, if we always recommend contents given by Algorithm 1 with the ˆ(t) , the recommendation will tend to be conservative estimate θ (i.e., only recommend contents that best match the user’s (t) inherent preference) since θˆu ≥ θu due to the initialization ˆ(t) . Consequently, the estimated user threshold cannot be of θ ˜ which in turn, makes the updated sufficiently to converge to θ, recommendation results remaining conservative and prevent to converge to the oracle policy eventually. Therefore, it is necessary to recommend contents that are not given by Algorithm 1. This is exploration since it enables us to improve the estimate of θ. Exploitation is the right thing to do to maximize the cachehit ratio in one time slot, but exploration may give better cache-hit ratio in the long run. Inspired by the trial-and-error approach to balance between exploration and exploitation in reinforcement learning [11], we propose an -greedy algorithm to solve the joint content caching and recommendation problem with unknown user threshold. In the -greedy algorithm, the BS either applies Algorithm 1 (with probability 1 − ) ˆ(t) to obtain the caching based on the current estimate θ and recommendation policy, or recommends contents in the (t) set Ru , {f |puf < θˆu } randomly for each user (with probability ). The details are provided in Algorithm 2. Algorithm 2 -greedy algorithm Input: puf , su , ˆ(0) = 1 1: Initialize θ 2: for t = 1, 2, 3, · · · do 3: Generate a uniformly distributed random variable rand ∈ [0, 1]. 4: if rand > then . Exploitation Step 5: Obtain M(t) and c(t) by Algorithm 1 based on the ˆ(t−1) . estimated threshold θ 6: else . Exploration Step 7: Set M(t) by recommending contents in Ru randomly and set c(t) = c(M(t) , θ). 8: end if 9: Observe the user demands in time slot t. 10: for each user-content request tuple (u, f ) during t do (t) 11: if I(u, f ) = 1 and θˆu > puf then (t+1) 12: Update θˆu ← puf 13: end if 14: end for 15: end for The convergence of the -greedy algorithm is shown in the following proposition. Proposition 1: The average number of time slots needed (t) for θˆu to converge to θ˜u is upper bounded by Nf −1 −1 Nf −1 Nf N Nm −1 m ¯ T 0}. Proof: See Appendix. Since T¯ is a finite value, when the number of time slots is ˜ Then, the caching and sufficiently large, we have θ(t) = θ. recommendation policy in the exploitation step of Algorithm 2 is the same as the oracle policy based on Observation 2. Considering the exploitation probability is 1 − . The cachehit ratio of Algorithm 2 is at least 1 − of the oracle policy.
0.7
0.6 Greedy Rec UP - Cache Adj - Rec Adj Cache Pop - Rec UP Cache Pop - Rec Pop Cache Pop - No Rec
0.5
0.4
0.3 0
5/Nf
10/N f
15/N f
max
Fig. 3. Cache-hit ratio versus user threshold, Nc = 10.
In Fig. 3, we show the impact of the user threshold. As expected, the performance of the methods with recommendation decreases with the threshold because user demands are less affected by recommendation when the thresholds are high. Again, the proposed greedy algorithm outperforms all existing methods. In Fig. 4, we compare the performance of the proposed -greedy algorithm with existing methods when the user threshold is unknown. Note that the method in [9] is not applicable in this case since it requires the knowledge of user demands after recommendation. To show the tradeoff between exploration and exploitation, we compare different values of . When = 0, the algorithm always exploits but
say the f th content, that satisfying puf < θˆut , and the user (t) requests the f th content from the recommendation list. If θu is only updated in exploration step, δu is lower bounded by N (u) ! Nf −1 minu,f ∈Fu {puf } q Nm −1 1− 1− δu ≥ Nf (7) Nm N
0.9
Cache-hit Ratio
0.8
0.7
0.6 = 1/t 0.75 = 0.1 = 0.01 =0 Cache Pop - Rec UP Cache Pop - Rec Pop Cache Pop - No Rec
0.66
0.5
0.64 0.62
0.4
m
Greedy (oracle)
0.68
5
10
0.3 0
50
100
150
200
Time Slot, t
Fig. 4. Convergence performance with learned user threshold. In each time slot, 100 user requests arrive randomly.
never explores. Although the algorithm converges very fast at the very beginning, its performance reaches plateaus quickly. As increases, the probability for exploration increases. The performance of the algorithm with = 0.1 approaches that of the oracle policy earlier than the algorithm with = 0.01, but only exploits about 90% of the time. The algorithm with = 0.01 learns more slowly, but eventually will outperform the algorithm with = 0.1. When reducing properly over 1 time, e.g., by setting = t0.75 , the performance improves faster at the beginning and achieves higher cache-hit ratio than all the other methods when t increases. VI. C ONCLUSION In this paper, we jointly optimized content caching and recommendation at base stations. We proposed a low-complexity greedy algorithm to solve the optimization problem efficiently, and proposed an -greedy algorithm to learn a threshold reflecting user personality via interactions with users. Simulation results showed that the greedy algorithm achieves near-optimal cache-hit ratio. The -greedy algorithm can learn the user threshold quickly and improve cache-hit ratio significantly compared with existing solutions. A PPENDIX We consider the worst case scenario when θu ≤ minf {puf } (t) and θu is only updated in the exploration step. In this (t) case, the number of updates required to obtain θˆu = θ˜u (t) is at most Nf . Denote nu (t) = 1 if θˆu is updated in the tth time slot, otherwise nu (t) = 0. Define T (δu ) as the (t) number of time slots needed to update θˆu for Nf times with update probability δu in each time slot. Then, according to Wald’s P Equation in martingale theory [16], we can obtain T (δu ) Nf = E n (t) = E[T (δ )]E[nu (t)] = E[T (δu )]δu , u u t=1 from which we have E[T (δu )] = Nf /δu
(6)
According to Observation 1, to update θˆut in the tth time slot, the recommendation list should contain at least one content,
Nf −1 Nf −1 is the lower bound of the probability where N Nm m −1 that the recommendation list contains only one content f (t) satisfying puf < θˆu , N1m minu,f ∈Fu {puf } is the lower bound of the probability that the uth user requests the f th content from the recommendation list according to the user demands minu,f ∈Fu {puf } Nq (u) is the lower model, and hence 1 − 1 − Nm bound of the probability that the uth user requests the f th content from the recommendation list in the tth time slot. Then, by substituting (7) into (6), Proposition 2 can be proved. R EFERENCES [1] D. Liu, B. Chen, C. Yang, and A. F. Molisch, “Caching at the wireless edge: design aspects, challenges, and future directions,” IEEE Commun. Mag., vol. 54, no. 9, pp. 22–28, Sept. 2016. [2] N. Golrezaei, K. Shanmugam, A. G. Dimakis, A. F. Molisch, and G. Caire, “Femtocaching: Wireless video content delivery through distributed caching helpers,” in Proc. IEEE INFOCOM, 2012. [3] M. A. Maddah-Ali and U. Niesen, “Fundamental limits of caching,” IEEE Trans. Inf. Theory, vol. 60, no. 5, pp. 2856–2867, May 2014. [4] A. Liu and V. K. N. Lau, “Mixed-timescale precoding and cache control in cached mimo interference network,” IEEE Trans. Signal Process., vol. 61, no. 24, pp. 6320–6332, Dec 2013. [5] M. D. Ekstrand, J. T. Riedl, J. A. Konstan et al., “Collaborative filtering R in Human–Computer recommender systems,” Foundations and Trends Interaction, vol. 4, no. 2, pp. 81–173, 2011. [6] D. K. Krishnappa, M. Zink, C. Griwodz, and P. Halvorsen, “Cachecentric video recommendation: An approach to improve the efficiency of youtube caches,” ACM Trans. Multimedia Comput. Commun. Appl., vol. 11, no. 4, pp. 48:1–48:20, Jun. 2015. [7] J. Tadrous, A. Eryilmaz, and H. E. Gamal, “Proactive content download and user demand shaping for data networks,” IEEE/ACM Trans. Netw., vol. 23, no. 6, pp. 1917–1930, Dec. 2015. [8] K. Guo, C. Yang, and T. Liu, “Caching in base station with recommendation via Q-learning,” in Proc. IEEE WCNC, 2017. [9] L. E. Chatzieleftheriou, M. Karaliopoulos, and I. Koutsopoulos, “Caching-aware recommendations: Nudging user preferences towards better caching performance,” in Proc. IEEE INFOCOM, 2017. [10] P. Sermpezis, T. Spyropoulos, L. Vigneri, and T. Giannakas, “Femtocaching with soft cache hits: Improving performance through recommendation and delivery of related content,” in Proc. IEEE GLOBECOM, 2017. [11] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998. [12] D. Liu and C. Yang, “Optimizing caching policy at base stations by exploiting user preference and spatial locality,” arXiv:1710.09983 [cs.IT]. [Online]. Available: http://arxiv.org/abs/1710.09983 [13] A. Singhal, P. Sinha, and R. Pant, “Use of deep learning in modern recommendation system: A summary of recent works,” International J. Comput. Appl., vol. 180, no. 7, pp. 17–22, Dec. 2017. [14] P. Rusmevichientong et al., “Assortment optimization under the multinomial logit model with random choice parameters,” Production and Operations Management, vol. 23, no. 11, pp. 2023–2039, 2014. [15] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems. Springer, 2004. [16] D. Williams, Probability with Martingale. Cambridge University Press, 1991.