Nash-Reinforcement Learning (N-RL) for Developing ...

5 downloads 351 Views 680KB Size Report
Orlando, Florida, U.S.A. [email protected]. Sina Khatami. Department of Infrastructure Engineering. The University of Melbourne. Melbourne ...
Nash-Reinforcement Learning (N-RL) for Developing Coordination Strategies in NonTransferable Utility Games Kaveh Madani

Milad Hooshyar

Centre for Environmental Policy Imperial College London London, U.K. [email protected]

Department of Civil, Environmental and Construction Engineering University of Central Florida Orlando, Florida, U.S.A. [email protected]

Sina Khatami

Ali Alaeipour

Department of Infrastructure Engineering The University of Melbourne Melbourne, Australia [email protected]

City of San Diego San Diego, CA, U.S.A. [email protected]

Aida Moeini Department of Obstetrics and Gynecology Shahid Beheshti University of Medical Sciences Tehran, Iran [email protected] Abstract—Social (central) planning is normally used in the literature to optimize the system-wide efficiency and utility of multi-operator systems. Central planning tries to maximize system’s benefits by coordinating the operators’ strategies and reduce the externalities, assuming that all parties are willing to cooperate. This assumption implies that operators are willing to base their decisions based on group rationality rather than individual rationality, even if increased group benefits results in reduced benefits for some agents. This assumption limits the applicability of social planner’s solutions, as perfect cooperation among agents is often infeasible in real world. Recognizing the fact that decisions are normally based on individual rationality in human systems, cooperative game theory methods are normally employed to address the major limitation of social planner’s methods. Game theory methods revise the social planner’s solution such that not only group benefits are increased, but also there exists no agent whose cooperative gain is less than his noncooperative gain. However, in most cases, utility is assumed to be transferrable and the literature has not sufficiently focused on non-transferrable utility games. In such games parties are willing to cooperate and coordinate their strategies to increase their benefits, but have no ability to compensate each other to promote cooperation. To a good extent, the transferrable utility assumption is due to the complexity of calculations to find the best response strategies of agents in non-cooperative and cooperative modes, especially in multi-period games. By combining Reinforcement Learning and Nash bargaining solution, this paper develops a new method for applying cooperative game theory to complex multi-period nontransferrable utility games. For illustration, the suggested

method is applied to two numerical examples in which two hydropower operators seek developing a fair and efficient cooperation mechanism to increase their gains. Keywords—agent-based modeling (ABM); reinforcement learning; game theory; reservoir operation; water resources

I. INTRODUCTION Efficient operation of multi-agent systems is always challenging due to the externalities resulting from the selfoptimizing attitudes of the agent and decision making based on individual rationality [1]. Such externalities normally result in deviation from the Pareto-optimal system’s solution that can be reached if decisions are based on group rationality. While agent-based modeling and non-cooperative game theory approaches [2-9] can help identifying the reachable/stable solution of multi-agent systems, by their nature, they are not very helpful in identifying the Pareto-optimal (efficient) solutions of multi-agent systems. On the other hand, the social planner (system-wide optimization) approaches that have been commonly used for finding optimal solutions of multi-agent systems normally fail to develop practical solutions [10]. This is because these methods inherently assume that the agents are willing to base their decision on group rationality, and fully cooperate to maximize the obtainable system’s benefits even if their cooperation results in decreased benefits for some agents. Cooperative game theory methods have been employed to address the major gaps in our understanding of multi-agent

systems and to make the social planner solutions practical [1114]. These methods help redistributing the total obtainable benefits under cooperation such that no agent has incentives to act non-cooperatively (as in the status quo) or to join any subcoalition of agents. While game theoretic allocation of incremental benefits of cooperation can make the social planner’s solution feasible, reallocation of benefits is only possible if utility is transferrable. However, utility is not necessarily transferrable in practice due to different reasons, including lack of a compensation mechanism, high transactions costs, and incompatible utility functions and units. The cooperative game theory literature mainly focuses on transferrable utility (TU) games and applications of cooperative game theory to non-transferrable utility (NTU) games are limited [15-18]. This is mainly because of the more complex nature of NTU problems and the challenge of calculating the obtainable benefits of all possible coalitions under their best response strategies. Therefore, by combining the Nash bargaining method [19] and reinforcement learning [20], this paper proposes a new method for facilitating the application of cooperative game theory to complex multi-agent NTU games. This work extends the scope of the recent work of the authors [21] on developing a game theory-reinforcement learning (GTRL) method for TU games. II. METHOD Nash bargaining solution [19] uses the following optimization model to allocate the benefits of cooperation among two agents: Max (u1*- u1)( u2*- u2)

(1)

subject to: ui*≥ ui

∀i =1, 2

∑i∈{1,2} ui* =

S

(2) (3)

ui*

is the cooperative share of agent i and ui is the where, non-cooperative (status-quo) utility of agent i. S represents the total obtainable benefits through cooperation, assuming that agents’ utilities are measured in the same unit. In case of TU games, the value of S is first calculated using the social planner (total welfare maximizing) approach which maximized the total utility of the players regardless of how the utilities are distributed among the agents. The Nash bargaining solution is then applied to redistribute the total benefits, ensuring that the optimal solution based on group rationality is also optimal based on individual rationality.

determine the upper bounds of players’ utilities based on the feasible actions set. ui values (agents’ status quo utilities) must be known to determine the fair distribution of utilities under cooperation. The non-cooperative utilities are what the players can gain under their best response strategies, i.e. the non-cooperative Nash equilibrium [22]. Identifying the best response strategies and the non-cooperative equilibrium is challenging in complex multi-step problems in which parties are trying to maximize their benefits over the course of game in an interactive environment. As suggested by [21], based on its nature, reinforcement learning [20] is an appropriate evolutionary optimization method to find the best response strategies of players in complex multi-period games. This method is used here to find the ui values. Reinforcement learning (RL) is a computational method in which a learner (agent) is trained to take optimal actions through interaction with its environment [20]. RL comprises two main elements, agent (learner) and learning environment. Through a learning process, the agent finds out the set of optimal actions that affect the environment. Environment is defined as the set of feasible states that the agent may visit. The aim of learning is finding the optimal actions for the agent in each feasible state. Q-Learning [23, 24] is the RL method used in this study. Example applications of Q-Learning include [2527, 21]. Once the ui values are determined, Nash bargaining solution can be applied to find the optimal agents’ strategies, leading to fair cooperative utility (ui*) allocations. Q-Learning can be applied to solve the non-linear optimization problem which has the Nash bargaining solution as its objective function. The cooperative solution developed based on the proposed N-RL method, results in increased total benefits by increasing the benefits of both parties (by satisfying Equation 2), with no need for side payments or utility transfer. For illustration, the proposed N-RL method is applied to two different examples with different complexity levels. III. NUMERICAL EXAMPLE I In this problem (Fig. 1), two independent hydropower reservoirs share the electricity market. While the two reservoir operating agents make operations decisions independently, the electricity prices are affected by the total hydroelectricity production. Therefore, externalities are produced from the decisions made by the other operator, affecting the total revenues of the agents.

When utility is not transferrable, the described procedure is not applicable as redistribution of utilities is not an option in this case. Instead, the Nash bargaining method can be solved directly where cooperative and non-cooperative utilities are defined as functions of the actions taken by the players in the cooperative and non-cooperative cases, respectively [17]. In this case, value of S is not a priori information. Therefore, Equation 3 cannot be used as a constraint. Other systemspecific constraints are needed to define the options’ dependency/feasibility which help the optimization model Fig. 1. Schematic of the hydroelectricity market sharing problem I.

Table 1 shows the storage capacities of two reservoirs, their maximum release (hydropower plant inlet capacity), and their annul inflows. In this problem, it is assumed that the hydropower generation is a linear function of release. In other words, the energy head is considered to be constant in both reservoirs, which is a reasonable assumption in high-elevation (high-head) hydropower facilities [28]. Based on the revenue curve concept [29, 30], total revenue in the shared market is assumed to be a function of total hydroelectric power generation (Fig. 2). This figure shows how the marginal benefit of electricity production decreases by increasing generation. TABLE I. STORAGE CAPACITY, MAXIMUM RELEASE AND INFLOW OF RESERVOIRS IN PROBLEM I (IN MILLION CUBIC METERS (MCM)) Reservoir 1

Reservoir 2

Storage Capacity

200

100

Maximum Release

150

100

Inflow

150

100

Fig. 2. Hydropower Generation Revenue Curve in the Shared Electricity Market (Problem I).

The net benefits of two reservoir operators (B1 and B2) are calculated using the following equations:

applied to determine the operation policies, which lead to increased benefits for both players based on the Nash bargaining solution. Coordination of hydroelectricity production strategies results in profit of $10.83 million for operator 1 and $5.72 million for operator 2. Under the cooperative case, the operators reduce the total annual electricity generation, leading to higher net benefit per unit electricity production to compensate for the reduction in energy production. To better understand the difference between the results of Nash bargaining solution under the TU and NTU conditions, the game is also solved with the TU assumption as done by [21]. In this case, the cooperative benefits are calculated using the social planner solution, in which the total profit of the system is maximized. The social planner solution leads to the total profit of $17.14 million, by gaining $15.01 million from generation by operator 1 and $2.13 million from generation by operator 2. Based on this solution, operator 2 has to significantly reduce its production due to its higher per unit production cost. This solution will not be acceptable by player 2 as it results in lower profit. However, in a TU game, to make the cooperative solution feasible, player 1 can compensate player 2. The level of compensation can be determined based using the Nash bargaining model (Equations 1-3) using profit values only. In this case, S is set equal to $17.14 million, indicating the total obtainable benefits under cooperation. Nash bargaining model allocates $11.20 million to player 1 and $5.94 million to player 2. Table 2 shows the shares of players 1 and 2 under different modes of cooperation and utility transferability assumptions. As expected, players’ profits are higher when utility is transferrable. Nevertheless, increased utilities can be also expected in NTU games. The proposed N-RL method can help with detecting operation policies which result in fair utility increases under cooperation without a need for side payments. TABLE II. RESERVOIR PROFITS ($ MILLION) UNDER DIFFERENT MODES OF COOPERATION AND UTILITY TRANSFERABILITY ASSUMPTIONS IN NUMERICAL EXAMPLE I

B1 = TR/(G1+G2)·G1 – c1G1

(4)

Method

Reservoir 1

Reservoir 2

Total

B2 = TR/(G1+G2)·G2 – c2G2

(5)

Non-Cooperative

9.75

4.50

14.25

Cooperative Nash (NTU)

10.83

5.72

16.55

Social Planner

15.01

2.13

17.14

Cooperative Nash (TU)

11.20

5.94

17.14

where G1 and G2 are the annual electricity generation by hydropower plants 1 and 2, respectively; TR is total hydroelectricity generation revenue in the shared market, determined using Fig. 2; and c1 and c2 are the unit hydroelectricity generation cost for reservoirs 1 and 2, respectively. The first term in Equations 4 and 5 determines the average annual price of electricity in the shared market. Here, c1 and c2 are assumed to be 2.1 and 2.3, respectively. The numerical example is solved using the proposed N-RL method. First, the problem is solved using the 2-agent RL algorithm in the non-cooperative mode [21] to find the noncooperative Nash equilibrium. In this case, both operators try to maximize their generation by passing all available water through the turbines, leading to the profit of $9.75 million for operator 1 and $4.5 million for operator 2. After determining the non-cooperative utilities of the players, the N-RL method is

IV. NUMERICAL EXAMPLE II For the second example, a complicated reservoir operation problem is solved by the proposed method. In this problem, there are two serial dams which generate hydroelectric energy for a single electricity market (Fig. 3). The properties of reservoirs are presented in the Table 3. It is assumed that the reservoirs have similar hydrological conditions. The average annual inflows into reservoir 1 and 2 are 2463 MCM and 1817 MCM, respectively. The monthly variation of inflows is shown in Fig. 4.

Fig. 3. Schematic of the hydroelectricity market sharing problem II.

TABLE III. STORAGE CAPACITY, MONTHLY MAXIMUM RELEASE AND MAXIMUM MONTHLY GENERATION OF RESERVOIRS IN PROBLEM II

Storage Capacity (MCM) Monthly Maximum Release (MCM) Maximum Monthly Generation (MWh)

Reservoir 1

Reservoir 2

1800

2200

600

1000

210

280

Fig. 4. Monthly variation of inflows in problem II.

In this example, per unit energy production cost has been assumed to be constant for the two reservoirs. Revenue as the surrogate for the operators’ profit [28-30] is calculated using the monthly hydroelectricity market revenue curves (Fig. 5), as explained in the previous example. Assuming that operators 1 and 2 have generated G1 and G2 unites of electricity in a given month, the monthly revenue of each operator is calculated using the following equation (i = 1, 2): Bi = TR/(G1+G2)·Gi

(6) Fig. 5. Monthly Hydropower Generation Revenue Curves in the Shared Electricity Market (Problem II).

Similar to the previous example, to find the best coordination strategies based on the Nash bargaining solution (Equation 1), the non-cooperative benefits of the operators under their best response strategies need to be first calculated using the 2-agent RL algorithm [21]. The major difference between this example and the previous example is the number of time steps. In this example, the agents try to maximize the sum of their monthly revenues where their release and storage decisions in earlier months can affect their revenues in the later months. The problem is assumed to be deterministic, giving a perfect foresight into future conditions to the two operating agents. In this case, the annual revenues of the operators under the non-cooperative Nash equilibrium are $102.24 million and $131.12 million for operators 1 and 2, respectively. After determination of the non-cooperative shares, the proposed N-RL method can be applied. Given that the operators consider the sum of the monthly revenues, Equation 1 finds the form of the Nash-solution for linked games [17] where bargaining is over 12 linked games. In other words, utility in Equation 1 represented the total revenue over the whole year (12 months). The revenues of the two operators in tis case are $105.2 million and $140.3 million for operators 1 and 2, respectively. To have a base for comparison of the results under the TU and NTU cases, the maximum obtainable benefit of the system was also calculated using a social planner RL agent [21]. The total benefit of the system in this case is $249.1 million. The contributions of reservoirs 1 and 2 to the overall revenue is $92.2 million and $156.9 million, respectively. Reallocation of the social planning mode benefits using the Nash bargaining solution (Equation 1) results in 110.1 of revenue for operator 1 and 139.0 of revenue for operators 2. Table 4 shows the shares of players 1 and 2 under different modes of cooperation and utility transferability assumptions. Similar to the previous example, operators’ profits are higher when utility is transferrable. Nevertheless, increased utilities can be also achieved through coordination of strategies in NTU games.

nonlinear utility functions. Using two numerical examples, it was shown how the proposed N-RL method could be used to solve NTU benefit allocation problems. Although the benefits of the players are normally lower when utility is not transferrable, NTU cooperative solutions are easier to implement because of lower transaction costs and no need for side payments. Therefore, the proposed method can develop practical solution for increasing the efficiency of multi-operator complex systems with minimal transactions costs. REFERENCES [1] [2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11] TABLE IV. RESERVOIR REVENUES ($ MILLION) UNDER DIFFERENT MODES OF COOPERATION AND UTILITY TRANSFERABILITY ASSUMPTIONS IN NUMERICAL EXAMPLE I Method

[12]

Reservoir 1 Reservoir 2 Total

Non-Cooperative

102.2

131.1

233.4

Cooperative Nash (NTU)

105.2

140.3

245.5

Social Planner

92.2

156.9

249.1

Cooperative Nash (TU)

110.1

139.0

249.1

[13]

[14]

[15]

V. CONCLUSIONS This paper developed a new method for application of Nash bargaining solution to NTU games. The proposed method combines the Nash bargaining solution with reinforcement learning to facilitate finding the best response strategies of the players in the non-cooperative mode. Finding such strategies is normally complicated in multi-period complex games with

[16] [17]

[18]

K. Madani, “Game theory and water resources,” J. Hydrol., vol. 381, no. 3-4, pp. 225–238, 2010. L. Fang, K.W. Hipel, and M. Kilgour, “The Graph Model approach to environmental conflict resolution,” J. Environ. Manag., vol. 27, no.2, pp. 195-212, 1988. E. Bonabeau, “Agent-based modeling: Methods and techniques for simulating human systems,” Proceedings of the National Academy of Sciences of the United States of America, vol. 99, no. 3, pp. 7280-7287, 2002. K. Madani, and K.W. Hipel, “Non-cooperative stability definitions for strategic analysis of generic water resources conflicts,” Water. Resour. Manag., vol. 25, no. 8, pp. 1949-1977, 2011. T.L. Ng, J.W. Eheart, X. Cai, and J.B. Braden, “An agent-based model of farmer decision-making and water quality impacts at the watershed scale under markets for carbon allowances and a second-generation biofuel crop,” Water Resour. Res., vol. 47, no. 9, W09519, 2011. K. Madani, and A. Dinar, “Non-cooperative Institutions for Sustainable Management of Common Pool Resources: Application to groundwater,” Ecol. Econ., vol. 74, pp. 34-45, 2012a. N. Nguyen, J. Shortle, P.M. Reed, and T. Nguyen, “Water Quality Trading with Asymmetric Information, Uncertainty, and Transaction Costs: A Stochastic Agent-based Simulation,” Resour. Energy Econ., vol. 35, no. 1, pp. 60-90, 2013. M. Giuliani, and A. Castelletti, “Assessing the value of cooperation and information exchange in large water resources systems by agent-based optimization,” Water Resour. Res., vol. 49, no. 7, pp. 3912-3926, 2013. E.M. Zechman, “Integrating evolution strategies and genetic algorithms with agent-based modeling for flushing a contaminated water distribution system,” J. Hydroinform., vol. 15, no. 3, pp .798-812, 2013. L. Read, K. Madani, and B. Inanloo, “Optimality versus stability in water resource allocation,” J. Environ. Manage., vol. 133, pp. 343-354, 2014. L. Wang, L. Fang, and K.W. Hipel, K. “Basin-wide cooperative water resources allocation,” Eur. J. Oper. Res., vol. 190, no. 3, pp. 798-817, 2008. K. Madani, and A. Dinar, “Cooperative Institutions for Sustainable Management of Common Pool Resources: Application to groundwater,” Water Resour. Res., vol. 48, no. 9, W09553, 2012b. M. Sadegh, N. Mahjouri, and R. Kerachian, “Optimal Inter-Basin Water Allocation Using Crisp and Fuzzy Shapley Games,” Water. Resour. Manag., vol. 24, no. 10, pp. 2291-2310, 2010. S. Asgary, A. Afshar, and K. Madani, “A Cooperative Game Theoretic Framework for Joint Resource Management in Construction”, J. Constr. Eng. Manage., vol. 140, no. 3, 04013066, 2014. A. Ratner, and D. Yaron, “Regional cooperation in the use of irrigation water, efficiency and game theory analysis of income distribution,” Agr. Econ., vol. 4, no. 1, pp. 45-58, 1990. A. Dinar, A. Ratner, and D. Yaron, “Evaluating cooperative game theory in water Resources,” Theor. Decis., vol. 32, no. 1, pp. 1-20, 1992. K. Madani, “Hydropower licensing and climate change: Insights from cooperative game theory,” Adv. Water Resour., vol. 34, no. 2, pp. 174183, 2011. A. Dinar, and G. Nigatu, “Distributional considerations of international water resources under externality: The case of Ethiopia, Sudan and

[19] [20] [21]

[22] [23] [24] [25]

[26]

[27]

[28]

[29]

[30]

Egypt on the Blue Nile,” Water Resources and Economics, vol. 2, no. 3, pp. 1-16, 2013. J.F. Nash, “The bargaining problem,” Econometrica, vol. 18, no. 2, pp. 155-162, 1953. R. Sutten, and A. Barto, “Reinforcement Learning: An Introduction,” MIT Press, Cambridge, Mass, 2000. K. Madani, M. Hooshyar, “A Game Theory-Reinforcement Learning (GT-RL) Method to Develop Optimal Operations Policies for MultiReservoir Multi-Operator Systems”, J. Hydrol., accepted. J.F. Nash, “Non-cooperative games,” Ann. Math., vol. 54, no. 2, pp. 286-295, 1951. C. Watkins, “Learning from delayed rewards”, PhD Thesis. University of Cambridge, England, 1989. C. Watkins, and P. Dayan, “Q-Learning,” Mach. Learn., vol. 8, pp. 279292, 1992. J.H. Lee, and J.w. Labadie, “Stochastic Optimization of Multireservoir Systems via Reinforcement Learning,” Water Resour. Res., vol. 43, no. 11, W11408, 2007. M. Mahootchi, “Storage System Management Using Reinforcement Learning Techniques and Nonlinear Models,” PhD Dissertation. University of Waterloo, Canada, 2009. A. Castelletti, S. Galelli, and M. Restelli, “Tree-Based Reinforcement Learning for Optimal Water Reservoir Operation,” Water Resour. Res., vol. 46, no. 6, W09507, 2010. K. Madani, and J.R. Lund, “Modeling California's high-elevation hydropower systems in energy units,” Water Resour. Res., vol. 45, no. 9, W09413, 2009. M. Guégan, C.B. Uvo, and K. Madani, “Developing a module for estimating climate warming effects on hydropower pricing in California,” Energ. Policy, vol. 42, pp. 261-271, 2012. K. Madani, M. Guégan, and C.B. Uvo, “Climate change impacts on high-elevation hydroelectricity in California,” J. Hydrol., vol. 510, pp. 153-163, 2014.