The Iterated Prisoners’ Dilemma 20 Years On
ADVANCES IN NATURAL COMPUTATION Series Editor:
Xin Yao (University of Birmingham, UK)
Assoc. Editors: Hans-Paul Schwefel (University of Dortmund, Germany) Byoung-Tak Zhang (Seoul National University, South Korea) Martyn Amos (University of Liverpool, UK)
Published Vol. 1:
Applications of Multi-Objective Evolutionary Algorithms Eds: Carlos A. Coello Coello (CINVESTAV-IPN, Mexico) and Gary B. Lamont (Air Force Institute of Technology, USA)
Vol. 2:
Recent Advances in Simulated Evolution and Learning Eds: Kay Chen Tan (National University of Singapore, Singapore), Meng Hiot Lim (Nanyang Technological University, Singapore), Xin Yao (University of Birmingham, UK) and Lipo Wang (Nanyang Technological University, Singapore)
Vol. 3:
Recent Advances in Artificial Life Eds: H. A. Abbass (University of New South Wales, Australia), T. Bossomaier (Charles Sturt University, Australia) and J. Wiles (The University of Queensland, Australia)
Vol. 4:
The Iterated Prisoners’ Dilemma Eds: Graham Kendall (The University of Nottingham, UK) Xin Yao (The University of Birmingham, UK)
A d v a nces in Natural Computation — Vol. 4
The Iterated Prisoners’ Dilemma 20 Years On
Graham Kendall The University of Nottingham, UK
Xin Yao Siang Yew Chong The University of Birmingham, UK
World Scientific NEW JERSEY
•
LONDON
•
SINGAPORE
•
BEIJING
•
SHANGHAI
•
HONG KONG
•
TA I P E I
•
CHENNAI
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
Advances in Natural Computation — Vol. 4 THE ITERATED PRISONERS’ DILEMMA 20 Years On Copyright © 2007 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN-13 978-981-270-697-3 ISBN-10 981-270-697-6
Printed in Singapore.
Contents
List of Contributors
vii
Chapter 1
The Iterated Prisoner’s Dilemma: 20 Years On Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei Li and Xin Yao
Chapter 2
Iterated Prisoner’s Dilemma and Evolutinary Game Theory Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei Li and Xin Yao
23
Chapter 3
Learning IPD Strategies Through Co-evolution Siang Yew Chong, Jan Humble, Graham Kendall, Jiawei Li and Xin Yao
63
Chapter 4
How to Design a Strategy to Win an IPD Tournament Jiawei Li
89
Chapter 5
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma Oscar Alonso and Fernando Ni˜ no
Chapter 6
Exponential Smoothed Tit-for-Tat Michael Filzmoser
Chapter 7
Opponent Modelling, Evolution, and The Iterated Prisoner’s Dilemma Philip Hingston, Dan Dyer, Luigi Barone, Tim French and Graham Kendall
Chapter 8
On Some Winning Strategies for the Iterated Prisoner’s Dilemma Wolfgang Slany and Wolfgang Kienreich v
1
105
127
139
171
vi
Chapter 9
Chapter 10
Contents
Error-Correcting Codes for Team Coordination within a Noisy Iterated Prisoner’s Dilemma Tournament Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn, Perukrishnen Vytelingum and Nicholas R. Jennings Is it Accidental or Intentional? A Symbolic Approach to the Noisy Iterated Prisoner’s Dilemma Tsz-Chiu Au and Dana Nau
205
231
List of Contributors
Oscar Alonso, Computer Systems and Industrial Engineering Department, National University of Colombia, Bogota Colombia Email:
[email protected] Tsz-Chiu Au, Department of Computer Science and Institute for Systems Research, University of Maryland, College Park, MD 20742 USA Email:
[email protected] Luigi Barone, Department of Computer Science and Software Engineering, The Univesity of Western Australia, 35 Stirling Highway, Crawley, WA, 6009 Australia Email:
[email protected] Siang Yew Chong, School of Computer Science, University of Birmingham, Birmingham, B15 2TT UK Email:
[email protected] vii
viii
List of Contributors
Rajdeep K. Dash, Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ UK Email:
[email protected] Dan Dyer, Department of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway, Crawley, WA, 6009 Australia Email:
[email protected] Michael Filzmoser, School of Business Administration, Economics, and Statistics, University of Vienna, Vienna, A-1210 Austria Email:
[email protected] Tim French, Department of Computer Science and Software Engineering, The University of Western Australia, 35 Stirling Highway, Crawley, WA, 6009 Australia Email:
[email protected] Philip Hingston, School of Computer and Information Science, Edith Cowan University - Mt Lawley Campus, 2 Bradford Street, Mt Lawley, WA 6050 Australia Email:
[email protected] List of Contributors
Jan Humble, School of Computer Science and Information Technology, University of Nottingham, Nottingham, NG8 1BB UK Email:
[email protected] Nicholas R. Jennings, Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK Email:
[email protected] Graham Kendall, School of Computer Science and Information Technology, University of Nottingham, Nottingham, NG8 1BB UK Email:
[email protected] Wolfgang Kienreich, Know-Center, Inffeldgasse 21a/II, 8010 Graz Austria Email:
[email protected] Jiawei Li, Robot Institute, Harbin Institute of Technology, Heilongjiang, 150001, P. R. China Email: lijiawei
[email protected] Dana Nau, Department of Computer Science and Institute for Systems Research, University of Maryland, College Park, MD 20742 USA Email:
[email protected] ix
x
List of Contributors
Fernando Ni˜ no, Computer Systems and Industrial Engineering Department, National University of Colombia, Bogota Colombia Email:
[email protected] Sarvapali D. Ramchurn, Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ UK Email:
[email protected] Alex Rogers, Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ UK Email:
[email protected] Wolfgang Slany, Institut f¨ ur Softwaretechnologie, Inffeldgasse 16b/II, TU Graz, A-8010 Graz Austria Email:
[email protected] Perukrishnen Vytelingum, Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ UK Email:
[email protected] Xin Yao, School of Computer Science, University of Birmingham, Birmingham, B15 2TT UK Email:
[email protected] Chapter 1 The Iterated Prisoner’s Dilemma: 20 Years On
Siang Yew Chong1 , Jan Humble2 , Graham Kendall2 , Jiawei Li2,3 , Xin Yao1 University of Birmingham1 , University of Nottingham2 , Harbin Institute of Technology3
1.1. Introduction In 1984, Robert Axelrod reported the results of two iterated prisoner’s dilemma (IPD) competitions [Axelrod (1984)]. The booked was to be a catalyst for much of the research in this area since that time. It is unlikely that you would write a scientific paper about IPD, without citing Axelrod’s 1984 book. The book is even more remarkable in that it is just as accessible to a general audience, as well as being an important source of inspiration for the scientific community. In 2001, whilst attending the Congress on Evolutionary Computation (CEC) conference, we were discussing some of the presentations we had seen which reported recent some of the latest work on the iterated prisoner’s dilemma. We were paying tribute to the fact that Axelrod’s book had stood the test of time when somebody made a casual comment suggesting that we should re-run the competition in 2004, to celebrate the 20th anniversary. And, so, this book was born. Of course, since the conversation in Hawaii and the publication of this book, there have been a lot of people doing a lot of work. Not least of all Robert Axelrod who was good enough to give up his time to present a plenary talk at the CEC conference in 2004. At that talk he presented his latest work which is investigating evolution on a grid based world. We owe a debt of thanks to the UK’s EPSRC (Engineering and Physical Sciences Research Council). This is the largest of the UK research councils which funds research in the UK. When we returned from Hawaii, we 1
2
S. Y. Chong et al.
submitted a proposal,a which requested a small amount of funds (£23,718) in order to re-run, and extend, the competitions that Axelrod had run 20 years earlier. The funds we received from EPSRC allowed us to run two competitions, one in 2004 and one in 2005. The entrants to the competitions were invited to submit a chapter for consideration in this book. These chapters underwent a peer review process (see later in this chapter for an acknowledgement of the reviewers) and those chapters that were successful form the latter part of this book. As editors, we feel fortunate to have several winners, second and third place entries reported in this book. This affords the reader the opportunity to learn, first hand from the authors, what made these strategies so successful and, perhaps, use some of the ideas and innovations in their own strategies for future competitions.
1.2. Iterated Prisoner’s Dilemma Almost every chapter in this book has its own description of the iterated prisoner’s dilemma. As each chapter can be read in isolation and, for completeness, we present our own interpretation of the IPD here, along with a short review of some of the important work in the area. The prisoner’s dilemma (PD) and iterated prisoners dilemma (IPD) have been a rich source of research material since the 1950’s. However, the publication of Axelrod’s book [Axelrod (1984)] in the 1980’s was largely responsible for bringing this research to the attention to other areas, outside of game theory, including evolutionary computing, evolutionary biology, networked computer systems and promoting cooperation between opposing countries [Goldstein (1991); Fogel (1993); Axelrod and D’Ambrosio (1995)]. Despite the large literature base that now exists (see, for example, [Poundstone (1992); Boyd and Lorberbaum (1987); Maynard Smith (1982); Davis (1997), Axelrod (1997)], this is an on-going area of research, with Darwen and Yao [Darwen and Yao (1995, 2001); Yao and Darwen (1999)] carrying out some recent work. Their 2001 work [Darwen and Yao (2001)] extends the prisoner’s dilemma by offering more choices, other than simply “cooperate” or “defect,” and by providing indirect interactions (reputation). When you play the prisoner’s dilemma you have to decide whether to cooperate with an opponent, or defect. Both you and your opponent make a a The
EPSRC grant reference numbers are GR/S63465/01 and GR/S63472/01.
The Iterated Prisoner’s Dilemma: 20 Years On
3
choice and then your decisions are revealed. You receive a payoff according to the following matrix (where the top line is the payoff to the column).
Cooperate Defect
Cooperate R=3 R=3 S=0 T =5
Defect T =5 S=0 P =1 P =1
• R is a Reward for mutual cooperation. Therefore, if both players cooperate then both receive a reward of 3 points. • If one player defects and the other cooperates then one player receives the T emptation to defect payoff (5 in this case) and the other player (the cooperator) receives the Sucker payoff (zero in this case). • If both players defect then they both receive the P unishment for mutual defection payoff (1 in this case). The question arises: what should you do in such a game? • Suppose you think the other player will cooperate. If you cooperate then you will receive a payoff of 3 for mutual cooperation. If you defect then you will receive a payoff of 5 for the Temptation to Defect payoff. Therefore, if you think the other player will cooperate then you should defect, to give you a payoff of 5. • But what if you think the other player will defect? If you cooperate, then you get the Sucker payoff of zero. If you defect then you would both receive the Punishment for Mutual Defection of 1 point. Therefore, if you think the other player will defect, you should defect as well. So, you should defect, no matter what option your opponent chooses. Of course, the same logic holds for your opponent. And, if you both defect you receive a payoff of 1 each, whereas, the better outcome would have been mutual cooperation with a payoff of 3. The payoff for an individual is less than that could have been achieved by two cooperating players, thus the dilemma and the research challenge of finding strategies that promote mutual cooperation. In defining a prisoner’s dilemma, certain conditions have to hold. The values we used above, to demonstrate the game, are not the only values that could have been used, but they do have to adhere to the conditions listed below.
4
S. Y. Chong et al.
Firstly, the order of the payoffs is important. The best a player can do is T (temptation to defect). The worst a player can do is to get the sucker payoff, S. If the two players cooperate then the reward for that mutual cooperation, R, should be better than the punishment for mutual defection, P . Therefore, the following must hold. T >R>P >S.
(1.1)
Secondly, players should not be allowed to get out of the dilemma by taking it in turns to exploit each other. Or, to be a little more precise, the players should not play the game so that they end up with half the time being exploited and the other half of the time exploiting their opponent. In other words, an even chance of being exploited or doing the exploiting is not as good an outcome as both players mutually cooperating. Therefore, the reward for mutual cooperation should be greater than the average of the payoff for the temptation and the sucker. That is, the following must hold. R > (S + T )/2 .
(1.2)
Playing a “one-shot” prisoners dilemma, it is not difficult to decide which strategy to adopt, but the question arises: can cooperation evolve from playing the game over and over again, against the same opponent? If you know how many times you are to play, then there is an argument that the game is exactly the same as playing the “one-shot” prisoners dilemma. This is based on the observation that you will defect on the last iteration as that is the sensible thing to do as, you are in effect playing a single iteration. Knowing this, it is sensible to defect on the second to last one as well; and this logic can be applied all the way to the first iteration. However, this reasoning cannot be used when the number of iterations is infinite as you know there is always another iteration. In practise, this translates to not knowing when the game will end. Experiments, using human players [Scodel (1962, 1963); Minas et al. (1960); Scodel and Philburn (1959), Scodel et al. (1959); Scodel et al. (1960)] showed that they, generally, did not cooperate even when it should have been obvious that the other person was going to cooperate, just as long as you do. It has been a long term aim to find strategies which causes players to cooperate. If players would only cooperate then their payoff, over an indefinite number of games could be maximised, rather than tending towards defection and hoping the other player would cooperate. In 1979 Axelrod organised a prisoner’s dilemma competition and invited game theorists to
The Iterated Prisoner’s Dilemma: 20 Years On
5
submit their strategies [Axelrod (1980a)]. Fourteen entries were received with an extra one being added (defect or cooperate with equal probability). The strategies were competed against each other, including itself. The winner was Anatol Rapoport who submitted the simple strategy (Tit-forTat) which cooperates on the first move, then does whatever your opponent did on the previous move. In a second tournament [Axelrod (1980b)], 62 entries were received but, again, the winner was Tit-for-Tat. These two competitions formed the basis of his important book [Axelrod (1984)]. The prisoners dilemma has a modern day version in the form of the TV show “Shafted” - a game show recently screened on terrestrial TV in the UK (note that this show is not a true prisoners dilemma as defined by Rapoport [Rapoport (1996)], but does demonstrate that the ideas have wider applicability). At the end of the show two contestants have accumulated a sum of money and they have to decide if to share the money or to try and get all the money for themselves. Their decision is made without the knowledge of what the other person has decided to do. If both contestants cooperate then they share the money. If they both defect then they both receive nothing. If one cooperates and the other defects, the one that defected gets all the money and the contestant that cooperated gets nothing. Although the prisoners dilemma, in the context of game theory, has been an active research area for at least 50 [Scodel (1962); Scodel (1963); Minas et al. (1960); Scodel and Philburn (1959); Scodel et al. (1959); Scodel et al. (1960)] years (it can be traced back to von Neumann and Morgenstern [von Neumann and Morgenstern (1944)] and, of course, John Nash [Nash (1950, 1953)]), it is still an active research area with, among other research aims, researchers trying to evolve strategies [O’Riordan (2000)] that promote cooperation. Recent research has also considered the prisoner’s dilemma where there are more than two choices and more than two players. Darwen and Yao have shown that offering more choices leads to less cooperation [Darwen and Yao (2001)], although reputation may help [Darwen and Yao (2002); Yao and Darwen (1999)]. Birk [Birk (1999)] used a multi-payer IPD. His model had continuous degrees of cooperation (as opposed to the binary; cooperate or defect). He used a robotic environment and showed that a justifiedsnobism strategy, that tries to cooperate slightly more than the average, is a successful strategy and is evolutionarily stable (that is, it cannot be invaded by another strategy). O’Riordan and Bradish (2000) also simulated a multi-player game where the players are involved in many types of games.
6
S. Y. Chong et al.
They show that cooperation can emerge in a high percentage of 2-player games. As well as the academic papers on the subject, there are many books devoted to game theory and/or the prisoners dilemma. The 1997 book by Axelrod (1997) re-produces a range of his papers (with commentary) ranging from 1986 through to 1997. The papers consider areas such as promoting cooperation using a genetic algorithm, coping with noise and promoting norms.
1.3. Contents of the Book This book does not have to be read from cover to cover. Each chapter can be read independently, with most of the chapters describing the IPD. This was a conscious decision by the editors as we realised that the book would be dipped into and we did not want to make any chapter dependent on any other. Also, each chapter has its own set of references, rather than having one complete list of references at the end of the book. The book is structured as follows Chapter 1 This chapter provides a general introduction to the book. In keeping with the rest of the book, we also briefly describe the IPD. As well as briefly describing each chapter. This chapter also presents the results of the two competitions that we ran in 2004 and 2005. Chapter 2 Chapter 2 (“Iterated Prisoner’s Dilemma and Evolutionary Game Theory”) reviews some of the important work in IPD, with particular emphasis (in the latter part of the chapter) on evolutionary game theory. The chapter contains over 250 references, which we hope will be a good starting point for other researchers who are looking to start work in this area. We have concentrated on the evolutionary aspects of IPD for two reasons. Firstly, this seemed to be an area that was exploited in the entries we received. Secondly, the literature on IPD is truly vast (perhaps only exceeded by literature on the traveling salesman problem), and we had to draw some boundaries and, given the close links that this competition had
The Iterated Prisoner’s Dilemma: 20 Years On
7
with the Congress on Evolutionary Computation, it seemed appropriate to report on the evolutionary aspects of IPD. We apologise to any authors who feel their work should have been included in this chapter. We hope you understand that we simply could not list every paper. However, if you would like to drop us an EMAIL, we would be happy to consider the inclusion of the reference in any later editions. Chapter 3 Chapter 3 (“Learning IPD Strategies Through Co-evolution”) reviews another area of IPD that has received scientific interest in recent years; that of co-evolution. This chapter also discusses an extension to the classic IPD formulation. That is when there are more than two players and when they have more than two choices. Similar to chapter two, there is an extensive list of references for the interested reader. Chapter 4 This chapter reports the winning strategy from competition 4, from the event held in 2005. This competition mimics the original ones held by Axelrod. Only one entry was allowed per person, to stop the cooperating strategies that had dominated the first competition. Although we believe that having cooperating strategies is a valid tactic, some competitors felt that this did not truly mimic the original competitions. For this reason we introduced an additional competition for the 2005 event. The result was a win for Jiawei Li, who details his winning strategy in chapter 4, which is entitled How to Design a Strategy to Win an IPD Tournament. Chapter 5 The strategy in this chapter attempts to model its opponent using an artificial immune system. It is interesting to see how relatively new methodologies are being used for problems such as IPD, demonstrating that there is a continuous flow of new ideas which might just be shown to be superior to all other methods so far. Whilst not appearing in the top ten of any of the competitions that it entered, it does present an exciting new research direction for IPD tournaments.
8
S. Y. Chong et al.
Chapter 6 Michael Filzmoser, reports on a variation of tit-for-tat, which he calls Exponential Smoothed Tit-for-Tat. Whereas tit-for-tat only considers the last move of the opponent, exponential smoothed tit-for-tat considers the complete history of the opponent. This discussion is extended to IPD with noise, as well as the more common IPD, where the actions by the player are reliably reported. Chapter 7 In chapter 7 (“Opponent modelling, Evolution, and the Iterated Prisoner’s Dilemma”), the authors explore the idea of modeling an opponent. It does this by playing tit-for-tat for the first 50 moves, whilst trying to model the moves played by the opponent. After 50 moves, subsequent moves are then based on the model that has been built. It is interesting to compare this strategy (which came 3rd in competition 4 in 2005), with the strategy described in chapter 4, which also uses a type of modeling but over a shorter time period. Perhaps this explains why it was able to achieve better payoffs, as it was able to exploit opponents much earlier in the game? Chapter 8 The strategies reported in this chapter were entered in both the 2004 and 2005 events, and performed well in many of the competitions, winning competition 1 in the 2005 event. This chapter, more than any other, touches on the debate about cooperating strategies, which is why we introduced competition 4 in the 2005 event. If you followed the discussion at the time, many entrants (with some justification) questioned if allowing multiple strategies from one person was in the spirit of the original Axelrod competitions. Whilst we agreed with this, so introduced a single entry rule in 2005, we also argue that these competitions were about the research that was being carried out and some of the chapters in this book report on those results. Of course, as the authors of chapter 8 admit, there are still ways of flouting the rules by submitting cooperating entries under different names. We hope that the other entrants will accept this in the spirit of research under which this was done. As the authors point out, the organisers failed to recognise that
The Iterated Prisoner’s Dilemma: 20 Years On
9
cooperating strategies had been submitted, but, as they also say, this is a theoretically difficult problem. We would also like to take this opportunity to the authors of chapter 8 for missing their OTFT strategy from some of the competitions. It is still unclear to us why this happened. Chapter 9 A team from Southampton, who took the first three places in competition 1, in the 2004 competition present chapter 9. Their chapter is an excellent example of how strategies can cooperate. As strategies have no mechanism to interact directly, the only way to recognise one of your collaborators is to somehow communicate through the defect/cooperate choices that you make. Chapter 10 One of the competitions that we run included noise, with some low probability. By noise, we mean that a defect or cooperate signal might be misinterpreted. This final chapter by Tsz-Chiu Au and Dana Nau explores this issue using a strategy they call Derived Belief Strategy. It attempts to model their opponent and then judge if their choice has been affected by noise. They performed very well in the competition, even when up against strategies which were cooperating.
1.4. Celebrating the 20th Anniversary: The Competitions We ran two events. The first was held during the Congress of Evolutionary Computation Conference in 2004 (June 19-23, Portland, Oregon, USA) and the next at the Computational Intelligence and Games Conference in 2005 (April 4-6, 2005, Essex UK). At the 2004 event we ran three competitions, with an additional competition being held in 2005. (1) The first competition aimed to emulate the original Axelrod competition. We received some enquiries about whether multiple entries were allowed. As we had not stated this as a restriction, we allowed it (but did state we had the right to limit the number, else running the competition may become intractable). At the time, we did not realise the
S. Y. Chong et al.
10
controversy that this decision would cause, which is why we modified the competitions in the 2005 event. (2) The second competition had noise in it. Each decision had a 0.1 probability of being mis-interpreted. (3) The third competition allowed competitors to submit a strategy to an IPD that has more than one player and more than one payoff, that is, multi player and multi-choice. (4) The fourth competition (which was only run in 2005) emulated the original Axelrod competition. The definition was exactly the same as competition 1, but we only allowed one entry per person. The payoff table we used for competitions 1, 2 and 4 is shown in table 1.1. The payoff table for competition 3 is shown in table 1.2. Table 1.1. Payoff table for all IPD competitions except for the IPD with multiple players and multiple choices.
Cooperate Defect
Cooperate R=3 R=3 S=0 T =5
Defect T =5 S=0 P =1 P =1
Table 1.2. Payoff table for IPD competition with multiple players and multiple payoffs Player BLevels of Cooperation. Player B Levels of Cooperation
1
3 4
1 2
1 4
0
1
4
3
2
1
0
3 4 1 2 1 4
4 14
3 14
2 14
1 14
4 12
3 12
2 12
1 12
4 34
3 34
2 34
0
5
4
3
1 34 2
1 4 1 2 3 4
Player A
1
To support the competitions, we developed a software framework. This is discussed in the Appendix, and a URL is supplied so that the software can be downloaded. 1.5. Competition Results In the following tables we present the top ten entries from each of the competitions. The full listings of the results can be seen at
The Iterated Prisoner’s Dilemma: 20 Years On
11
http://www.prisoners-dilemma.com. Also available on the web site is a log containing all the interactions that took place.
Table 1.3. Results from 2004 event, competition 1. There were 223 entries (19 web based entries, 195 java based entries and 9 standard entries (RAND, NEG, ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)). Rank 1 2 3
4
5 6 7 8
Strategy
Won
Drawn
Lost
Total Points
StarSN (StarSN)
105
21
98
117,057
StarS (StarS)
113
48
63
110,611
StarSL (StarSL)
115
46
63
110,511
GRIM (GRIM Trigger)
120
76
28
100,611
90
70
64
100,604
Player Gopal Ramchurn Gopal Ramchurn Gopal Ramchurn GRIM (GRIM Trigger) 1 Wolfgang Kienreich Wolfgang Kienreich Wolfgang Kienreich Emp 1 Bingzhong Wang
9
Hannes Payer
10
Nanlin Jin
OTFT (Omega tit for tat) ADEPT (ADEPT Strategy) EMP (Emperor)
95
72
57
96,291
90
73
61
95,927
()
31
94
99
94,161
95
75
54
94,123
27
95
102
93,953
PRobbary (PRobbary Historylength 2) HCO (HCO)
S. Y. Chong et al.
12
Table 1.4. Results from 2004 event, competition 2. There were 223 entries (19 web based entries, 195 java based entries and 9 standard entries (RAND, NEG, ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)). Rank 1 2 3 4 5 6 7 8 9 10
Won
Drawn
Lost
Total Points
StarSN (StarSN)
42
2
180
93,962
Mem1 (Mem1)
5
1
218
83,049
CoordinateCDCSIAN (CoordinateCDCSIAN)
158
6
60
83,015
PoorD (PoorD)
190
7
27
82,890
OTFT (Omega tit for tat)
158
8
58
82,838
ltft (ltft)
66
8
150
82,765
GRIM (GRIM Trigger)
184
7
33
82,591
MooD (MooD)
193
3
28
82,578
AITFT (AITFT)
60
9
155
82,504
GSTFT (GSTFT)
64
9
151
82,502
Player Gopal Ramchurn Colm O’Riordan Gopal Ramchurn Gopal Ramchurn Wolfgang Kienreich Wayne Davis GRIM (GRIM Trigger) 1 Gopal Ramchurn Gopal Ramchurn Gopal Ramchurn
Strategy
The Iterated Prisoner’s Dilemma: 20 Years On
13
Table 1.5. Results from 2004 event, competition 3. There were 15 entries. Note that there is only one round in this competition. Rank 1 2 3 4 5 6 7 8 9 10
Player Gopal Ramchurn Gopal Ramchurn Deirdre Murrihy Deirdre Murrihy Deirdre Murrihy Enda Howley Enda Howley Enda Howley Wolfgang Kienreich Wolfgang Kienreich
Strategy
Total Points
AgentSoton (SOTON AGENT)
3,756
HarshTFT (HarshTFT)
3,756
PCurvepower1Memory2 (Penalty Curve of 1 using opponent’s previous 2 moves) PCurvepower2Memory2 (Penalty Curve of 2 using opponent’s previous 2 moves) PCurvepower0.5Memory2 (Penalty Curve of 0.5 using opponent’s previous 2 moves) PCurvepower2 (Penalty Curve of 2 using opponent’s previous move) PCurvepower1 (Penalty Curve of 1 using opponent’s previous move) PCurvepower0.5 (Penalty Curve of 0.5 using opponent’s previous move)
3,738 3,738 3,738 3,738 3,738 3,738
CNHM (CosaNostra Hitman)
3,738
CNHM (CosaNostra Hitman)
3,738
S. Y. Chong et al.
14
Table 1.6. Results from 2005 event, competition 1. There were 192 entries (41 web based entries, 142 java based entries and 9 standard entries (RAND, NEG, ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)). Rank
Player
1
Wolfgang Kienreich
2
Jia-wei Li
3
Carlos G. Tardon
4
Perukrishnen Vytelingum
5 6
Constantin Ionescu GRIM (GRIM Trigger) 1
7
Tsz-Chiu Au
8
Tsz-Chiu Au
9 10
Richard Brunauer Carlos G. Tardon
Strategy CNGF (CosaNostra Godfather) IMM01 (Intelligent Machine Master 01) CLAS- (CLAS-) SWIN (Soton Agent RA Competition 1) LORD (the lord strategy) GRIM (GRIM Trigger) LSF (Learning of opponent strategy with forgiveness) DBStft (DBS with TFT) PRobberyL2 (PRobberyL2) CLAS2 (CLAS2)
Won
Drawn
Lost
Total Points
48
96
49
100,905
46
112
35
98,922
23
95
75
92,174
61
44
88
90,918
20
102
71
87,617
73
114
6
84,805
28
94
71
84,698
23
97
73
83,867
14
98
81
83,837
72
96
25
83,746
The Iterated Prisoner’s Dilemma: 20 Years On
15
Table 1.7. Results from 2005 event, competition 2. There were 165 entries (26 web based entries, 130 java based entries and 9 standard entries (RAND, NEG, ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)). Rank
Player
1
Perukrishnen Vytelingum
2
Jia-wei Li
3
Tsz-Chiu Au
4
Tsz-Chiu Au
5
Tsz-Chiu Au
6
Tsz-Chiu Au
7
Tsz-Chiu Au
8
Tsz-Chiu Au
9
Tsz-Chiu Au
10
Tsz-Chiu Au
Strategy BWIN (S2Agent1 ZEUS Competition 2) IMM01 (Intelligent Machine Master 01) DBSy (DBS (version y)) DBSz (DBS (version z)) DBSpl (DBS with learning prevention) DBSd (Derivative Belief Strategy (version d)) DBSx (DBS (version x)) TFTIc (TFT improved (ver. c)) DBSf (Derivative Belief Strategy (version f)) TFTIm (TFT improved (ver. m))
Won
Drawn
Lost
Total Points
85
1
80
73,330
108
7
51
70,506
35
3
128
68,370
27
3
136
68,339
37
2
127
67,979
42
6
118
67,392
19
9
138
66,719
41
4
121
66,409
48
2
116
66,269
38
3
125
66,239
S. Y. Chong et al.
16
Table 1.8. Results from 2005 event, competition 3. There were 34 entries. Note that there is only one round in this competition. Rank 1 2 3
Player Perukrishnen Vytelingum Deirdre Murrihy Deirdre Murrihy
4
Deirdre Murrihy
5
Enda Howley
6
Enda Howley
7
Enda Howley
8 9 10
Wolfgang Kienreich Wolfgang Kienreich Wolfgang Kienreich
Strategy $AgentSoton ($SOTON AGENT) PCurvepower1Memory2 (Penalty Curve of 1 using opponent’s previous 2 moves) PCurvepower2Memory2 (Penalty Curve of 2 using opponent’s previous 2 moves) PCurvepower0.5Memory2 (Penalty Curve of 0.5 using opponent’s previous 2 moves) PCurvepower2 (Penalty Curve of 2 using opponent’s previous move) PCurvepower1 (Penalty Curve of 1 using opponent’s previous move) PCurvepower0.5 (Penalty Curve of 0.5 using opponent’s previous move)
Total Points 7,558 7,521 7,521 7,521 7,521 7,521 7,521
CNHM (CosaNostra Hitman)
7,521
CNHM (CosaNostra Hitman)
7,521
CNHM (CosaNostra Hitman)
7,521
The Iterated Prisoner’s Dilemma: 20 Years On
17
Table 1.9. Results from 2005 event, competition 4. There were 50 entries (26 web based entries, 15 java based entries and 9 standard entries (RAND, NEG, ALLC, ALLD, TFT, STFT, TFTT, GRIM, Pavlov)). Rank 1 2 3 4 5 6 7 8 9 10
Player Jia-wei Li Wolfgang Kienreich Philip HingstonMod Bruno Beaufils Tim Romberg Richard Brunauer Hannes Payer Bennett McElwee Gerhard Mitterlechner Wayne Davis
Won
Drawn
Lost
Total Points
11
34
6
30,096
9
36
6
29,554
(Modeller)
7
36
8
29,003
GRAD (Gradual)
8
32
11
28,707
13
32
6
28,692
12
32
7
28,523
11
33
7
28,292
22
11
18
28,110
11
32
8
27,893
1
44
6
27,834
Strategy APavlov (Adaptive Pavlov) OTFT (Omega tit for tat)
tro1 (tro1) DETerminatorL6C4 (DETerminatorL6C4) DETerminatorL4C4 (DETerminatorL4C4) LOOKDB (LookaheadDB) PRobberyM5C4 (PRobberyM5C4) ltft (ltft)
S. Y. Chong et al.
18
1.6. Acknowledgements We would like to thank the following people who acted as reviewers for the chapters in this book. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • •
Muhammad A. Ahmad Oscar Alonso Dan Ashlock Tsz-Chiu Au Carlos Eduardo Rodriguez Calderon Michel Charpentier Wayne Davis J¨ org Denzinger Eugene Eberbach Michael Filzmoser Nelis Franken Nicholas Gessler Michal Glomba Philip Hingston Enda Howley Nick Jennings Nanlin Jin Jacint Jordana Wolfgang Kienreich Eun-Youn Kim Jia-wei Li Helmut A. Mayer Bennett McElwee Gerhard Mitterlechner Colm O’Riordan Sarvapali Ramchurn, Alex Rogers Tim Romberg Darryl A. Seale Wolfgang Slany Elpida Tzafestas Perukrishnen Vytelingum Georgios N. Yannakakis Lukas Zebedin
The Iterated Prisoner’s Dilemma: 20 Years On
19
Appendix: Software Framework A software library and corresponding application was developed to easily implement prisoner’s dilemma strategies and tournament competitions between populations of these. Although a vast array of software is available for the same purpose they did not contain all our feature requirements. For several of our experiments we required a game engine that would, among other things, handle a continuous [normalised] range of moves, arbitrarily sized payoff matrices, different types of signal noise, multiple (> 2) strategies per game, and logging of partial and completed game results. The software suite was developed in Java, allowing ease in development and web deployment. New strategies are easily implemented by implementing a subclass of the Strategy class. The principal requirements are the implementations of the getMove() and reset() methods which returns the current strategy move and clears the strategy state between games respectively. Currently we define two types of games: standard and multi-player. A standard game involves two competing strategies playing for a number of rounds, and should mimic the basic game mechanics in the competitions run by Axelrod. A multi-player game involves several competing strategies obtaining payoffs for every other opponent it plays against on each round. A tournament involves every participating strategy and differs for standard and multi-player type games. A standard tournament pits every strategy against every other (including self) in a standard game [a la Round Robin]. A multi-player tournament plays a single multi-player game. An option is available to introduce a Gaussian distributed random number of rounds to be played, so as to discourage strategies from using the knowledge of a predefined or static parameter for an unfair advantage. There is also an option to introduce noise into the output moves, in principle to test the robustness of the algorithms. Besides the programming API, a graphical user interface is available to set up and run PD tournament competitions (see Figure 1.1). The software monitors and allows users to log the output of a tournament with different degrees of detail. However, detailed logs will degrade performance. Besides the standard 2 × 2 payoff matrix for classic games, there is the ability to define an arbitrarily sized payoff matrix allowing for a wider range of allowable moves. Moves are normalised and payoffs are calculated from the closest allowable move in the payoff matrix.
S. Y. Chong et al.
20
Fig. 1.1.
IPD tournament application.
A number of standard classic strategies are included in the library. The software can be downloaded for http://prisoners-dilemma.com. References Axelrod, R. (1980a). Effective Choices in the Prisoner’s Dilemma, J. Conflict Resolution, 24, pp. 3-25. Axelrod, R. (1980b). More Effective Choices in the Prisoner’s Dilemma, J. Conflict Resolution, 24, pp. 379-403. Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York). Axelrod R. and D’Ambrosio L. (1995). Announcement for Bibliography on the Evolution of Cooperation, Journal of Conflict Resolution 39, pp. 190. Axelrod R. (1997). The Compleity of Cooperation (Princeton University Press). Birk A. (1999). Evolution of Continuous Degrees of Cooperations in an N-Player Iterated Prisoner’s Dilemma, Technical Report under review, Vrije Universiteit Brussel, AI-Laboratory. Boyd R. and Lorberbaum J. P. (1987). No Pure Strategy is Evolutionary Stable in the Repeated Prisoner’s Dilemma, Nature, 327, pp. 58-59. Darwen P and Yao X. (2002). Co-Evolution in Iterated Prisoners Dilemma with Intermediate Levels of Cooperation: Application to Missile Defense, International Journal of Computational Intelligence and Applications, 2, 1, pp. 83-107. Darwen P. and Yao X. (1995). On Evolving Robust Strategies for Iterated Prisoners Dilemma, In Progress in Evolutionary Computation, LNAI, 956, pp. 276-292.
The Iterated Prisoner’s Dilemma: 20 Years On
21
Darwen P. and Yao X. (2001). Why More Choices Cause Less Cooperation in Iterated Prisoner’s Dilemma, Proc. Congress of Evolutionary Computation, pp. 987-994. Davis M. Game Theory. (1997). A Nontechnical Introduction (Dover Publications). Fogel D. (1993). Evolving Behaviours in the Iterated Prisoners Dilemma. Evolutionary Computation, 1, 1, pp. 77-97. Goldstein J. (1991). Reciprocity in Superpower Relations: An Empirical Analysis, International Studies Quarterly, 35, pp. 195-209. Maynard Smith J. (1982). Evolution and the Theory of Games (Cambridge University Press). Minas J. S., Scodel A., Marlowe D. and Rawson H. (1960). Some Descriptive Aspects of Two-Person, Non-Zero-Sum Games, II, Journal of Conflict Resolution, 4, pp. 193-197. Nash J. (1950). The Bargaining Problem, Econometrica, 18, pp. 150-155. Nash J. (1953). Two-Person Cooperative Games, Econometrica, 21, pp. 128-140. O’Riordan and Bradish S. (2000). Experiments in the Iterated Prisoner’s Dilemma and the Voter’s Paradox. 11th Irish Conference on Artificial Intelligence and Cognitive Science. O’Riordan C. (2000). A Forgiving Strategy for the Iterated Prisoner’s Dilemma, Journal of Artificial Societies and Social Simulation, 3, 1. Poundstone W. (1992). Prisoner’s Dilemma, Doubleday Rapoport A. (1996). Optimal policies for the prisoners dilemma, Tech report No. 50, Psychometric Laboratory, Univ. North Carolina, NIH Grant, MH10006. Scodel A. and Philburn R. (1959). Some Personality Correlates of Decision Making under Conditions of Risk, Behavioral Science, 4, pp. 19-28. Scodel A., Minas J. S., Ratoosh P.and Lipetz M. (1959). Some Descriptive Aspects of Two-Person, Non-Zero-Sum Games, Journal of Conflict Resolution, 3, pp. 114-119. Scodel A. and Minas J. S. (1960). The Behavior of Prisoners in a “Prisoner’s Dilemma” Game, Journal of Psychology, 50, pp. 133-138. Scodel A. (1962). Induced Collaboration in Some Non-Zero-Sum Games, Journal of Conflict Resolution, 6, pp. 335-340. Scodel A. (1963). Probability Preferences and Expected Values. Journal of Psychology, 56, pp. 429-434. von Neumann J. and Morgenstern O. (1944). Theory of Games and Economic Behavior (Princeton University Press). Yao, X and Darwen P. (1999). How Important is Your Reputation in a MultiAgent Environment. Proc. Of the 1999 IEEE Conference on Systems, Man and Cybernetics, IEEE Press, Piscataway, NJ, USA, pp. II-575 – II-580, Oct.
This page intentionally left blank
Chapter 2 Iterated Prisoner’s Dilemma and Evolutionary Game Theory Siang Yew Chong1 , Jan Humble2 , Graham Kendall2 , Jiawei Li2,3 , Xin Yao1 University of Birmingham1 , University of Nottingham2 , Harbin Institute of Technology3
2.1. Introduction The prisoner’s dilemma is a type of non-zero-sum game in which two players try to maximize their payoff by cooperating with, or betraying the other player. The term non-zero-sum indicates that whatever benefits accrue to one player do not necessarily imply similar penalties imposed on the other player. The Prisoner’s dilemma was originally framed by Merrill Flood and Melvin Dresher working at RAND Corporation in 1950. Albert W. Tucker formalized the game with prison sentence payoffs and gave it the “Prisoner’s Dilemma” name. The classical prisoner’s dilemma (PD) is as follows: Two suspects, A and B, are arrested by the police. The police have insufficient evidence for a conviction, and, having separated both prisoners, visit each of them to offer the same deal: if one testifies for the prosecution against the other and the other remains silent, the betrayer goes free and the silent accomplice receives the full 10-year sentence. If both stay silent, the police can sentence both prisoners to only six months in jail for a minor charge. If each betrays the other, each will receive a two-year sentence. Each prisoner must make the choice of whether to betray the other or to remain silent. However, neither prisoner knows for sure what choice the other prisoner will make. So the question this dilemma poses is: What will happen? How will the prisoners act?
The general form of the PD is represented as the following matrix [Scodel et al. (1959)]: 23
S. Y. Chong et al.
24
Prisoner 1
Cooperate Defect
Prisoner 2 Cooperate Defect (R, R) (S, T ) (T, S) (P, P )
where R, S, T , and P denote Reward for mutual cooperation, Sucker’s payoff, Temptation to defect, and Punishment for mutual defection respectively, and T > R > P > S and R > 1/2(S + T ). The two constraints motivate each player to play noncooperatively and prevent any incentive to alternate between cooperation and defection [Rapoport (1966, 1999)]. Neither prisoner knows the choice of his accomplice. Even if they were able to talk to each other, neither could be sure that he could trust the other. The “dilemma” faced by the prisoners here is that, whatever the other does, each is better off confessing than remaining silent. However, the payoff when both confess is worse for each player than the outcome they would have received if they had both remained silent. Traditional game theory predicts the outcome of PD be mutual defection based on the concept of Nash equilibrium. To defect is dominant because if both players choose to defect, no player has anything to gain by changing their own strategy [Hardin (1968); Nash (1950, 1951, 1996)]. In the Iterated Prisoner’s Dilemma (IPD) game, two players have to choose their mutual strategy repeatedly, and have memory of their previous behaviors. Because players who defect in one round can be “punished” by defections in subsequent rounds and those who cooperate can be rewarded by cooperation, the appropriate strategy for self-interested players is no longer obvious in IPD games. If the precise length of an IPD is known to the players, then the optimal strategy is to defect on each round (often called All Defect of AllD) [Luce and Raiffa (1957)]. This single rational play strategy which is deduced from propagating the single stage Nash equilibrium of mutual defection backwards through every stage of the game prevents players from cooperating to achieve higher payoffs [Selten (1965, 1983, 1988); Noldeke and Samuelson (1993)]. If the game has infinite length or at least the players are not aware of the length of the game, backward induction is no longer effective and there exists the possibility that cooperation can take place. In fact, there is still controversy about whether or not backward induction can be applied to infinite (or finite) IPDs [Sobel (1975, 1976); Kavka (1986); Becker and Cudd (1990); Binmore (1997); Binmore et al. (2002); Bovens (1997)]. However, in IPD experiments, it was not uncommon to see people cooperate to gain a greater payoff not only in repeated
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
25
games but even in one-shot games [Cooper et al. (1996); Croson (2000); Davis and Holt (1999); Milinski and Wedekind (1998)]. Traditional game theory interprets the cooperation phenomena in IPDs by means of reputation [Fudenberg and Maskin (1986); Kreps and Wilson (1982); Milgrom and Roberts (1982)], incomplete information [Harsanyi (1967); Kreps et al. (1982; Sarin (1999)], or bounded rationality [Anthonisen (1999); Harborne (1997); Radner (1980, 1986); Simon (1955, 1990); Vegaredondo (1994)]. Evolutionary game theory differs from classical game theory in respect of focusing on the dynamics of strategy change in a population more than the properties of strategy equilibrium. In evolutionary game theory, IPD is an ideal experimental platform for the problem as to how cooperation occurs and persists, which is considered to be impossible in the static or deterministic environment. IPD attracted wide interest after Robert Axelrod’s famous book “The Evolution of Cooperation”. In 1979, Robert Axelrod organized a prisoner’s dilemma tournaments and solicited strategies from game theorists [Axelrod (1980a, 1980b)]. Each of the 14 entries competed against all others (including itself) over a sequence of 200 moves. The specific payoff function used is as follows.
Prisoner 1
Cooperate Defect
Prisoner 2 Cooperate Defect (3, 3) (0, 5) (5, 0) (1, 1)
The winner of the tournament was “tit-for-tat” (TFT) submitted by Anatol Rapoport. TFT always cooperate on the first move and then mimics whatever the other player did on the previous move. In a second tournament with 62 entries, again the winner was TFT. Axelrod discovered that “greedy” strategies tended to do very poorly in the long run while “altruistic” strategies did better when PD were repeated over a long period of time with many players. Then genetic algorithms were introduced to show how these altruistic strategies evolve in the populations that are initially dominated by selfishness. The prisoner’s dilemma is therefore of interest to the social sciences such as economics, politics and sociology, and to the biological sciences such as ethology and evolutionary biology, as well to the applied mathematics such as evolutionary computing. Many social and natural processes, for example arm race between states and price setting for duopolistic firms, have been abstracted into models in which independent groups or individuals are engaged in PD games [Brelis (1992); Bunn and Payne (1988); Hauser (1992); Hemelrijk (1991); Surowiecki (2004)].
26
S. Y. Chong et al.
The optimal strategy for the one-shot PD game is simply defection. However, in the IPD game the optimal strategy depends upon the strategies of the possible opponents. For example, the strategy of Always Cooperate (AllC) is dominated by the strategy of Always Defect (AllD), and AllD is optimal in a population consisting of AllD and AllC. However, in a population consisting of AllD, AllC, and TFT, AllD is not necessarily the optimal strategy. It appears that all the strategies in the population determine which strategy is optimal. Although TFT was proved to be efficient in lots of IPD tournaments and was long considered to be the best basic strategy, it could be defeated in some specific circumstances [Beaufils, Delahaye and Mathieu (1996); Wu and Axelrod (1994)]. Therefore, there is lasting interest for game theorists to find optimal strategies or at least novel strategies which outperform TFT in IPD tournaments. Since Axelrod, two types of approaches are developed to test the efficiency or robustness of a strategy and further to derive optimal strategies: (1) Round-robin tournaments. (2) Evolutionary dynamics. Round-robin tournament shows the efficiency of a strategy in competing with others, while ecological simulation illustrates the evolutionary robustness of a strategy in terms of the number of descendants or survivability in a certain environment. Lots of novel strategies have been developed and analyzed by means of these approaches. By using round-robin tournaments, the interactions between different strategies can be observed and analyzed. If the statistical distribution of opposing strategies can be determined an optimal counter-strategy can be derived mathematically. For example, if the population consists of 50% TFT and 50% AllC, the optimal strategy should cooperate with TFT and defect with AllC in order to maximize the payoff. It is easy to design such a strategy that defects in the first two moves, and then plays always C if the opponent defected on the second move, otherwise plays always D. A similar concept in analyzing optimal strategy is Bayesian Nash equilibrium which is widely used in experimental economics [Bedford and Meilijson (1997); Gilboa and Schmeidler (2001); Kagel and Roth (1995); Kalai and Lehrer (1993); Rubinstein (1998);]. In evolutionary dynamics, the processes like natural selection are simulated where individuals with low scores die off, and those with high scores flourish. The evolutionary rule that describes what future states follow from the current state is fixed and deterministic: for a given time interval only one future state follows from the current state
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
27
[Katok and Hasselblatt (1996)]. The common methodology of the evolution rule is replicator equations that assume infinite populations, continuous time, complete mixing and that strategies breed true. Given a population of strategies and the dynamic equations, the evolutionary process can be simulated, and how strategies evolve in the population over a short or long time period can be shown. Optimal strategies can be developed in this way [Axelrod (1987); Darwen and Yao (1995, 1996, 2001); Lindgren (1992); Miller (1996)]. 2.2. Strategies in IPD Tournaments Axelrod is the first who attempts to search for efficient strategies by means of IPD tournament [Axelrod (1980a, 1980b)]. TFT had long been studied as a strategy of IPD game [Komorita, Sheposh and Braver (1968); Rapoport and Chammah (1965)]. However, it is after Axelrods tournaments that TFT become well-known. According to Axelrod, several conditions are necessary for a strategy to be successful. These conditions include: Nice The most important condition is that the strategy must be “nice”. That is, it will not defect before its opponent does. Almost all of the top-scoring strategies are nice. Therefore a selfish strategy will never defect first. Retaliating Axelrod contended that a successful strategy must not be a blind optimist. It must always retaliate. An example of a non-retaliating strategy is AllC. This is a very bad choice, as “nasty” strategies will ruthlessly exploit such strategies. Forgiving Another quality of successful strategies is that they must be forgiving. Though they will retaliate, they will fall back to cooperating if the opponent does not continue to defect. This stops long runs of revenge and counter-revenge, thus maximising payoffs.
S. Y. Chong et al.
28
Clear The last quality is being clear, that is making it easier for other strategies to predict its behavior so as to facilitate mutually cooperation. Stochastic strategies, however, are not clear because of the uncertainty in their choice. In a further study, Axelrod noted that just a few of the 62 entries in the second tournament have reasonably influence on the performance of a given strategy. He utilized eight strategies as opponents for a simulated evolving population based on a genetic algorithm approach [Axelrod (1987)]. The population consisted of deterministic strategies that use outcomes of the three previous moves to determine a current move. The simulation was conducted using a population of 20 strategies from a total of 270 strategies executed repeatedly against the eight representatives. Mutation and crossover were used to generate new strategies. The typical results indicated that populations initially generated mutual defection, but subsequently evolved toward mutual cooperation. Moreover, most of the strategies that evolved in the simulation actually resemble TFT, having the properties of “Nice”, “Forgiving”, and “Retaliating”. Although TFT has been considered to be the most successful strategy in IPD for several decades, there still is some controversy about it. There seems to be a lack of theoretical explanation for the strategies like TFT in traditional game theory. TFT is not subgame perfect, and there are always subgame perfect equilibria that dominate TFT according to the Folk Theorem [Binmore (1992); Hargreaves and Varoufakis (1995); Myerson (1991); Rubinstein (1979); Selten (1965, 1975)]. On the other hand, whether or not TFT is the most efficient singleton strategy in IPD game is still unclear; therefore, many researchers are attempting to develop novel strategies that can outperform TFT. 2.2.1. Heterogeneous TFTs Since TFT had such success in IPD tournaments and experiments, it is natural to draw the conclusion that TFT may be improved by slightly modifying its rule. Many heterogeneous TFTs have been developed in order to overcome TFT’s shortcoming or to adapt to a certain environment, for example IPD with noise. Among these strategies, Tit-for-Two-Tats (TFTT), Generous TFT (GTFT), and Contrite TFT (CTFT) are examples.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
29
A situation that TFT does not handle well is a long series of mutual retaliations evoked by an occasional defection. The deadlock can be broken if the co-player behaves more generously than TFT and forgives at least one defection. TFTT retaliates with defection only after two successive defections and thus attempts to avoid becoming involved in mutual retaliations. Usually, TFTT performs well in a population with more cooperative strategies but does poorly in a population with more permanently defective strategies. Similar to TFTT, Benevolent TFT (BTFT) always cooperates after cooperation and normally defects after defection, but occasionally BTFT responds to defection by cooperation in order to break up a series of mutual obstruction [Komorita, Sheposh and Braver (1968)]. In experiments of Manarini (1998) and Micko (1997), fixed interval BTFT strategies were shown to be superior to, or at least equivalent to, TFT in terms of cooperation as well as in terms of cumulative pay-off. However, BTFT tends to produce irregularly alternating exploitations and sometimes resort mutual retaliations. Allowing some percentage of the other player’s defections to go unpunished has been widely accepted as a good way to cope with noise [Molander (1985); May (1987); Axelrod and Dion (1988); Bendor et al. (1991); Godfray (1992); Wu and Axelrod (1994)]. A reciprocating strategy such as TFT can be modified to forgive the other player’s defection with a certain ratio in order to decrease the influence of noise. GTFT behaves like TFT but cooperates with the probability of q = min[1−(T −R)/(R−S), (R−P )/(T −P )] when it would otherwise defect. This prevents a single error from echoing indefinitely. For example, in the case of T = 5, R = 3, P = 1, and S = 0, q = 1/3. GTFT is said to take over the dominant position of the population of homogeneous TFT strategies in an evolutionary environment with noise [Nowak and Sigmund (1992)]. In a noisy environment, retaliating unintended defection often leads to permanent bilateral retaliation. Therefore, forgiving defection evoked by unintended defection allows a quick way to recover from error. It is based upon the idea that one shouldn’t be provoked by the other player’s response to one’s own unintended defection [Sugden (1986); Boyd (1989)]. The strategy of CTFT has three states: “contrite”, “content” and “provoked”. It begins in a content state, with cooperation and stays there unless there is a unilateral defection. If it was the victim while content, it becomes provoked and defects until a cooperation from other player causes it to become content. If it was the defector while content, it becomes contrite and cooperates. When contrite, it becomes content only after it has successfully
30
S. Y. Chong et al.
cooperated. CTFT can correct its unintended defection in a noisy environment. If one of two CTFT players defects, the defecting player will contritely cooperate on the next move and the other player will defect, and then both will be content to cooperate on the following move. However, CTFT is not effective at correcting the other player’s error. For example, if CTFT is playing TFT and the TFT player defected by accident, the retaliation will continue until another error occurs. In an ecological simulation with noise, GTFT and CTFT competed with the 63 rules of the Second Round of the Computer Tournament for the Prisoner’s Dilemma [Axelrod (1984)]. CTFT is the dominant strategy, becoming 97% of the population at generation 2000 [Wu and Axelrod (1994)]. 2.2.2. Pavlov (Win-Stay Lose-Shift) A possible drawback of TFT is that it performs poorly in a noisy environment. Assume that a population of TFT strategies plays IPD with one another in a noisy environment, where every choice may be occasionally implemented in error. Although a TFT strategy cooperates with its twin at the beginning, it would get out of cooperation as soon as the other player’s action is misinterpreted, and then this induces the other player’s defection in the next round. Therefore, after an error, the result of the game turns out to be a CD, DC, CD . . . cycle. If a second error happens, the outcome is as likely to fall into defection as it is to resume cooperation. Cooperation between TFT strategies is easy to break even in the case of low noise frequency [Donninger (1986); Kraines and Kraines (1995)]. The Pavlov strategy, also known as Win-Stay Lose-Shift or Simpleton [Rapoport and Chammah (1965)], has been shown to outperform TFT in the environment with noise [Fudenberg and Maskin (1990); Kraines and Kraines (1995, 2000)]. Pavlov cooperates when both sides have cooperated or defected on the previous move, and defects otherwise. Pavlov, as well as TFT, are a type of memory-one strategies where players only remember and make use of their own move and their opponent’s move on the last round. The major difference between Pavlov and TFT is that Pavlov will choose COOPERATE after a defection as against TFT’s DEFECT, and this helps Pavlov resume cooperation with those cooperative strategies, such as TFT, in a noisy environment. When restricted to an environment of memory-one agents interacting in iterated Prisoners Dilemma games with a 1% noise level, Pavlov is the only cooperative strategy and one of the very few that cannot be invaded by a similar strategy [Nowak and Sigmund (1993, 1995)].
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
31
Simulation of evolutionary dynamics of win-stay lose-shift strategies shows that these strategies are able to adapt to the uncertain environment even when the noise level is high [Posch (1997)]. In simulated stochastic memory-one strategies for the IPD games, Nowak and Sigmund (1993, 1995) report that cooperative agents using a Pavlov type strategy eventually dominate a random population. Memory-one strategies can be expressed in the form of S(p1 , p2 , p3 , p4 ), where p1 denotes the probability of playing C (Cooperate) after a CC outcome, p2 denotes the probability of playing C after a CD outcome, p3 denotes the probability of playing C after a DC outcome, and p4 denotes the probability of playing C after a DD outcome. Most of the well-known strategies can be expressed in this form. For example, AllC = S(1, 1, 1, 1), AllD = S(0, 0, 0, 0), TFT = S(1, 0, 1, 0), Pavlov = S(1, 0, 0, 1). Noise is conveniently introduced by restricting the conditional probabilities pi to range between 0 and 1. For example, S(0.999, 0.001, 0.999, 0.001) is a TFT strategy with 0.001 probability of being misinterpreted. In a computer simulation with a population using the totally random strategy S(0.5, 0.5, 0.5, 0.5), win-stay lose-shift strategy shows its evolutionary robustness in noisy environment. After each 100 generations from a total of 107 generations, 105 mutant strategies that are generated at random are introduced. Simulation results show that the populations are dominated by win-stay lose-shift strategy in 33 of a total of 40 simulations. TFT strategies perform poorly in large part because they do not exploit overly cooperate strategies. Simulations reveal that Pavlov loses against AllD but can invade TFT, and that Pavlov cannot be invaded by AllD [Milinski (1993)]. 2.2.3. Gradual The Gradual strategies are like TFT but respond to the opponent with a gradual pattern. This strategy acts as TFT, except when it is time to forgive and remember the past. It uses cooperation on the first move and then continues to do so as long as the other player cooperates. Then after the first defection of the other player, it defects one time and cooperates two times; after the second defection of the opponent, it defects two times and cooperates two times, . . . after the nth defection it reacts with n consecutive defections and then calms down its opponent with two cooperations [Beaufils, Delahaye and Mathieu (1996)]. Both round-robin competitions and ecological evolution experiments are conducted in order to compare the performance of Gradual with TFT.
32
S. Y. Chong et al.
Gradual wins in experiments where round-robin competitions are conducted with several well-known strategies, such as TFT and GRIM. In ecological evolutionary experiments, gradual and TFT have the same type of evolution, with the difference of quantity in favor of gradual, which is far away in front of all other survivors when the population is stabilised. However, it is efficient to demonstrate that TFT is not always the best, but not efficient to prove that Gradual always outperforms TFT. Gradual receives fewer points than TFT while interacting with AllD because Gradual forgives too many defections. Therefore, if there are lots of defecting strategies like AllD in the competition, it would be possible that TFT outperforms Gradual in this case. Beaufils, Delahaye and Mathieu (1996) try to improve the performance of Gradual by using a genetic algorithm. 19 different genes are used and a fitness function evaluates the quality of the strategies. Several new strategies are found after 150 generations of evolution. One of them beats Gradual and TFT in round-robin tournament, as well as in an ecological simulation. In the two cases it has finished first just in front of Gradual, TFT being two or three places behind, with a wide gap in the score, or in the size of the stabilised population. The evolution dynamics of populations including Gradual has also been studied in Delahaye and Mathieu (1996), Doebeli and Knowlton (1998), Glomba, Filak, and Kwasnicka (2005), Beaufils, Delahaye, and Mathieu (1996). 2.2.4. Adaptive strategies From the viewpoint of automation, the strategies in IPD games can be regarded as automatic agents with or without feedback mechanisms. Most well-known IPD strategies are not adaptive because their responses to any certain opponent are fixed. It is impossible to improve their performance since the parameters of their responding mechanism cannot be adjusted. However, there are still some strategies in IPDs which are adaptive. Although there is still no experimental evidence of adaptive strategies outperforming non-adaptive ones in IPD games, adaptive strategies are worth studying since creatures with higher intelligence are all adaptive. There have been two approaches to developing adaptive strategies. Firstly, adaptive mechanisms can be implemented by making the parameters of a non-adaptive strategy adjustable. Secondly, new adaptive strategies can be developed by using evolutionary computation, reinforcement
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
33
learning, and other computational techniques [Darwen and Yao (1995, 1996)]. Tzafestas (2000a, 2000b) introduced adaptive tit-for-tat (ATFT) strategy that embedded an adaptive factor into the conventional TFT strategy. ATFT keeps the advantages of tit-for-tat in the sense of retaliating and forgiving, and implements some behavioural gradualness that would show as fewer oscillations between Cooperate and Defect. It uses an estimate of the opponent’s behavior, whether cooperative or defecting, and reacts to it in a tit-for-tat manner. To represent degrees of cooperation and defection, a continuous variable named “world ” which ranges from 0 (total defection) to 1 (total cooperation) is applied. The ATFT strategy can then be formulated as a simple model: If (opponent played C in the last cycle) then world = world + r*(1-world) else world = world + r*(0-world) If (world >= 0.5) play C, else play D r is the adaptation rate here. The TFT strategy corresponds to the case of r = 1 (immediate convergence to the opponent’s current move). Clearly, ATFT is an extension of the conventional TFT strategy. By simulating the spatial IPD games between ATFT, AllD, AllC, and TFT on 2D grid, it shows that ATFT is fairly stable and resistant to perturbations. Since the use of a fairly small adaptation rate r will allow more gradual behavior, ATFT tends to be more robust than TFT in a noisy environment. Since evolutionary computation has been widely used in simulating the dynamics of IPD games, it is natural to consider obtaining IPD strategies directly by using evolutionary approaches [Lindgren (1991); Fogel (1993); Darwen and Yao (1995, 1996)]. Axelrod (1987) studied how to find effective strategies by using genetic algorithms as simulation method. He established an initial population of strategies that is deterministic and uses the outcome of the three previous moves to make a choice in the current move. By means of playing IPD games between one another, successful strategies are selected to have more offspring. Then the new population will display patterns of behavior that are more like those of the successful strategies of the previous population, and less like those of the unsuccessful ones. As the evolution process continues, the strategies with relatively high scores will flourish while the unsuccessful strategies die out. Simulation results show
34
S. Y. Chong et al.
that most of the strategies that were evolved in the simulation actually resemble TFT and does substantially better than TFT. However, it would not be accurate to say that these strategies are better than TFT because they are probably not very robust in other environments [Axelrod (1987)]. Many researchers have found that evolved strategies may lack robustness, i.e., the strategies did well against the local population, but when something new and innovative appeared they fail [Lindgren (1991); Fogel (1993)]. Darwen and Yao (1996) applied a technique to prevent the genetic algorithm from converging to a single optimum and attempted to develop new IPD strategies without human intervention. It concludes that adding static opponents to the round robin tournament improves the results of final population. Optimal strategies can be determined only if the strategy of the opponent is known. By means of reinforcement learning, model-based strategies with the ability of on-line identification of an opponent can be built [Sandholm and Crites (1996); Freund et al. (1995); Schmidhuber (1996)]. How can a player acquire a model of its opponent’s strategy? One possible source of information available for the player is the history of the game. Another possible source of information is observed games between the opponent and other agents. In the case of IPD games, a player can infer an opponent’s model based on the outcome of the past moves and then adapts its strategy during the game. Reinforcement learning (RL) is based on the idea that the tendency to produce an action should be strengthened if it produces favorable results, and weakened if it produces unfavorable results [Watkins (1989); Watkins and Dayan (1992); Kaelbling and Moore (1996)]. A modelbased RL approach generates expectation about the opponent’s behavior by making use of a model of its strategy [Carmel and Markovitch (1997, 1998)]. It is well suited for use in IPD tournament against an unknown opponent because of its small computational complexity. The major problem in designing a model-based strategy (MBS) is the risk involved in the exploration, and thus the issue of exploitation versus exploration. An exploring action taken by the MBS tests unfamiliar aspects of the opponent which can yield a more accurate model of the opponent. However, this action also carries the risk of putting the MBS into a much worse position. For example, in order to distinguish the strategy ALLC from GRIM and TFT in IPD tournament, a MBS has to defect at least once and therefore loses the chance to cooperate with GRIM. The exploratory action affects not only the current payoff but also the future rewards [Berry and Fristedt (1985)]. There have been several approaches developed to solve this
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
35
problem [Berry and Fristedt (1985); Gittins (1989); Sutton (1990); Narendra and Thathachar (1989); Kaelbling (1993); Moore and Atkeson (1993); Carmel and Markovitch (1998)]. Since possible strategies for a repeated game is usually infinite, computational complexity is another problem that needs to be addressed [Ben-porath (1990); Carmel and Markovitch (1998)]. There is seldom a record of an effective MBS in round-robin IPD tournaments. However, the strategy that won competition 4 in 2005 IPD tournament, Adaptive Pavlov, is such a strategy [Prisoner’s dilemma tournament result (2005)]. Furthermore, it seems that each of the strategies that ranked above TFT incorporated a mechanism to explore the opponent. 2.2.5. Group strategies In the 2004 IPD competition [20th-anniversary Iterated Prisoner’s Dilemma competition], a team from Southampton University led by Professor N. Jennings introduced a group of strategies, which proved to be more successful than Tit-for-Tat (see chapter 9). The group of strategies were designed to recognise each other through a known series of five to ten moves at the start. Once two Southampton players recognized each other, they would act as their “master” or “slave” roles – a master will always defect while a slave will always cooperate in order for the master to win the maximum points. If the program recognized that another player was not a Southampton entry, it would immediately defect to minimise the score of the oppositions. The Southampton group strategies succeeded in defeating any non-grouped strategies and won the top three positions in the competition [Prisoner’s dilemma tournament result (2004)]. According to Grossman (2004), it was difficult to tell whether a group strategy would really beat TFT because most of the “slave” group members received far lower scores than the average level and were ranked at the bottom of the table. The average score of the group strategies is not necessarily higher than that of TFT. The significance of group strategies maybe lies in their evolutionarily characters. None of known strategies in IPD games is an evolutionarily stable strategy. [Boyd and Loberbaum (1987)] The strategies that are most likely to be evolutionarily stable, such as AllD or GRIM, can resist the invasion of some types of strategies but cannot resist the invasion of others. For example, a small group of TFT strategies can not invade a large population of AllD; however, STFT can do. There exists the possibility
36
S. Y. Chong et al.
that TFT can successfully invade a population of AllD indirectly. Suppose that a large population of AllD is continuously attacked by small groups of STFT. Because every invasion makes a small positive proportion of STFT remain in the population of AllD, the number of STFT increases gradually. When the number of STFT is large enough, a small group of TFT can successfully invade and AllD will die out. However, group strategies may be evolutionarily stable. By means of cooperating with group members and defecting against non-group members, a population of group strategies can prevent any foreigner from successfully invading. This is, perhaps, the real value of group strategies. 2.3. Evolutionary Dynamics in Games Traditional game theorists have developed several effective approaches to study static games based on the assumption of rationality. By using Neumann-Morgenstern utility, refinement of Nash equilibrium, and reasoning, both cooperative and non-cooperative games are analyzed within a theoretical framework. However, in the area of repeated games, especially in games where dynamics are concerned, few approaches from traditional game theory are available. Evolutionary game theory provides novel approaches to solve dynamic games. If the precise length of an IPD is known to the players, then the optimal strategy is to defect on each round. If the game has infinite length or at least the players are not aware of the length of the game, there exists the possibility that cooperation happens [Dugatkin (1989); Darwen and Yao (2002); Akiyama and Kaneko (1995); Doebeli, Blarer, and Ackermann (1997); Axelrod (1999); Glance and Huberman (1993, 1994); Ikegami and Kaneko (1990); Schweitzer (2002)]. Nowak and May (1992, 1993) showed that cooperators and defectors coexist in certain circumstances by introducing spatial evolutionary games, in which two types of players – cooperators who always cooperate and defectors who always defect are placed in a two-dimensional spatial array. In each round, every individual plays the PD game with its immediate neighbors. The selection scheme is that each lattice is occupied either by its original owner or by one of the neighbors, depending on who scores the highest total in that round, and so on to the next round of the game. Simulation results show that cooperators remain a considerable percentage of the population in some cases, and defector can invade any a lattice but can not occupy the whole area.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
37
When the parameters of the payoff matrix are set to be T = 2.8, R = 1.1, P = 0.1, and S = 0 and the initial state is set to be a random mixture of the two types of strategies, the evolutionary dynamics of the local interaction model lead to a state where each player chooses the strategy Defect, the only ESS in the prisoner’s dilemma. Figure 2.1 shows that the population converges to a state where everyone defects and no Cooperate strategy survives after 5 generations.
Generation 1
Generation 2
Generation 3
Generation 6
Fig. 2.1. Spatial Prisoner’s Dilemma with the values T = 2.8, R = 1.1, P = 0.1, and S = 0 [Nowak and May (1993)].
However, when the parameters of the payoff matrix are set to T = 1.2, R = 1.1, P = 0.1, and S = 0, the evolutionary dynamics do not converge to the stable state of defection. Instead, a stable oscillating state where cooperators and defectors coexist and some regions are occupied in turn by different strategies.
Generation 1
Generation 2
Generation 19
Generation 20
Fig. 2.2. Spatial Prisoner’s Dilemma with the values T = 1.2, R = 1.1, P = 0.1, and S = 0 [Nowak and May (1993)].
Moreover, when the parameters of payoff matrix are set to be T = 1.61, R = 1.01, P = 0.01, and S = 0, the evolutionary dynamics lead to a chaotic state: regions occupied predominantly by Cooperators may be successfully
S. Y. Chong et al.
38
invaded by Defectors, and regions occupied predominantly by Defectors may be successfully invaded by Cooperators.
Generation 1
Generation 3
Generation 13
Generation 15
Fig. 2.3. Spatial Prisoner’s Dilemma with the values T = 1.61, R = 1.01, P = 0.01, and S = 0 [Nowak and May (1993)].
If the starting configurations are sufficiently symmetrical, this spatial version of the PD game can generate chaotically changing spatial patterns, in which cooperators and defectors both persist indefinitely. For example, if we set R = 1, P = 0.01, S = 0.0 and T = 1.4, and initial state is that every individual in a square 69 × 69 lattice is a cooperator except a defector in the middle of the lattice. The structure of the evolving lattice varies like a kaleidoscope, and the ever-changing sequences of spatial patterns can be very beautiful, as shown in Fig. 2.4. The role of the spatial interaction in the evolution of cooperation is further studied by Durrett and Levin (1998), Schweitzer, Behera, and M¨ uhlenbein (2002), Ifti, Killingback, and Doebeli (2004).
Generation 10
Generation 40
Generation 4000
Generation 6000
Fig. 2.4. Spatial Prisoner’s Dilemma with the values T = 1.4, R = 1, P = 0.01, and S = 0, where Blue, Red, Green, and Yellow denote cooperators, defectors, new cooperators, and new defectors respectively [Nowak and May (1993)].
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
39
2.3.1. Evolutionary stable strategy Just like the Nash equilibrium in traditional game theory, Evolutionarily Stable Strategy (ESS) is an important concept used in theoretical analysis of evolutionary games. According to Maynard Smith (1982), an ESS is a strategy such that, if all the members of a population adopt it, then no mutant strategy could invade the population under the influence of natural selection. ESS can be seen as an equilibrium refinement to the Nash equilibrium. Suppose that a player in a game can choose between two strategies: I and J. Let E(J, I) denote the payoff he receives if he chooses the strategy J while all other players choose I. Then, the strategy I is evolutionarily stable if either (1) E(I, I) > E(J, I), or (2) E(I, I) = E(J, I) and E(I, J) > E(J, J) is true for all I 6= J [Maynard Smith and Price (1973); Maynard Smith (1982)]. Thomas (1985) rewrites the definition of ESS in a different form. Following the terminology given in the first definition above, we have (1) E(I, I) ≥ E(J, I), and (2) E(I, J) > E(J, J) From this alternative form of definition, we find that ESS is just a subset of Nash equilibrium. The benefit of this refinement of Nash equilibrium is not just to eliminate those weak Nash equilibrium, but to provide an efficient mathematical tool for dynamic games. Following the concept of ESS, two approaches to evolutionary game theory have been developed. The first approach directly applies the concept of ESS to analyze static games. The second approach simulates the evolutionary process of dynamic games by constructing a dynamic model, which may take into consideration the factors of the population, replication dynamics, and strategy fitness. As an example of using ESS in static games, consider the problem of the Hawk-Dove game. Two types of animals employ different means to obtain resources (a favorable habitat, for example) — Hawk always fights for some resources while Dove never fights. Let V denote the value of the resources, which can be considered the Darwinian fitness of an individual obtaining the resource, described by Maynard Smith (1982). Let E(H, D) denote the payoff to a Hawk against a Dove opponent. If we assume that (1) whenever two Hawks meet, conflict eventually results and the two individuals are
S. Y. Chong et al.
40
equally likely to be injured, (2) the cost of the conflict reduces individual fitness by some constant value C, (3) when a Hawk meets a Dove, the Dove immediately retreats and the Hawk obtains the resource, and (4) when two Doves meet the resource is shared equally between them, the payoff matrix for Hawk-Dove game will look like this, Hawk Dove
Hawk Dove ((V -C/2, (V -C)/2) (V, 0) (0, V ) (v/2, V /2)
Then, it is easy to verify that the strategy Dove is not an ESS because there is E(D, D) < E(H, D), which means that a pure population of Doves can be invaded by a Hawk mutant. In the case that the value V of the resource is greater than the cost C of injury, the strategy Hawk is an ESS because there is E(H, H) > E(D, H), which means that a Dove mutant can not invade a group of Hawks. If V < C is true, the Hawk-Dove game becomes the game of Chicken originated from the 1955 movie Rebel without a cause. Neither pure Hawk nor pure Dove is ESS in this game. However, there is an ESS if mixed strategies are permitted [Bishop and Cannings (1978)]. An evolutionarily stable state is a dynamical property of a population to return to using a strategy, or mix of strategies, if it is perturbed from that strategy, or mix of strategies [Maynard Smith (1982)]. A population of ESS must be evolutionarily stable because it is impossible for any mutant to invade it. Many biologists and sociologists attempt to explain animal and human behavior and social structures in terms of ESS [Cohen and Machalek (1988); Mealey (1995)]. However, a dynamic game is not necessarily converging to a stable state in which ESS is prevalent. For example, using a spatial model in which each individual plays the Prisoner’s Dilemma with his or her neighbors, Nowak and May (1992, 1993) show that the result of the game depends on the specific form of the payoff matrix. Now imagine a population of players in a society where each one has to play Prisoner’s Dilemma with another and whether or not one can survive and breed is determined by his payoff in the game. How will the population evolve? In order to show the evolutionary process of the population, a model of dynamics that takes time t into consideration is needed. 2.3.2. Genetic algorithm A genetic algorithm maintains a population of sample points from the search space. Each point is represented by a string of characters, known
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
41
as genotype [Holland (1975, 1992, 1995)]. By defining a fitness function to evaluate them, genetic algorithm proceeds to initialize a population of solutions randomly, and then improve it through repetitive application of mutation, crossover, and selection operators. The common methodology to study the evolutionary dynamics in games is through replicator equations. Replicator equations usually assume infinite populations, continuous time, complete mixing and that strategies breed true [Taylor (1979); Maynard Smith (1982); Weibull (1995); Hofbauer and Sigmund (1998)]. Originated from biology and then introduced into evolutionary game theory by Taylor and Jonker (1978), replicator equations provide a continuous dynamic model for evolutionary games. Consider a population of n types of strategies, and let xi be the frequency of type i. Let A be the n × n payoff matrix. With the assumptions that the population is infinitely large and strategies are completely mixed and xi are differentiable functions of time t, a strategy’s fitness, or expected payoff can be written as (Ax)i if strategies meet one another randomly. The average fitness of the population as a whole can be written as xT Ax. Then, the replicator equation is x˙ i = xi ((Ax)i − xT Ax)
(2.1)
Evolutionary games with a replicator dynamic as described in (2.1) will converge to a result that strategies with strong fitness bloom in the population. For the Prisoner’s Dilemma, the expected fitness of the strategies Cooperate and Defect, EC and ED respectively, are EC = x C R + x D S ,
and ED = xC T + xD P
(2.2)
where xC and xD denote the proportions of the strategies of Cooperate and ¯ denote the average fitness of Defect in the population respectively. Let E the entire population, there is ¯ = x C Ec + x D ED E
(2.3)
Then, the replicator equations for this game are dxC ¯ , dxD = xD (ED − E) ¯ = xC (Ec − E) (2.4) dt dt Since there is T > R and P > S, ED −EC = xC (T −R)+xD (P −S) > 0 C holds, and there must be ED > E¯ > EC . Therefore, there are dx dt < 0 and dxD dt > 0. This means that the number of the strategies of Cooperate will always decline while the number of the strategies of Defect increases as the
42
S. Y. Chong et al.
game goes on. Sooner or later, the proportion of the population choosing the strategy Cooperate will, in theory, become extinct. Besides replicator dynamics, there exist other types of dynamics equations that can be used in modeling evolutionary systems [Akin (1993); Thomas (1985); Bomze (1998, 2002); Balkenborg and Schlag (2000); Cressman, Garay and Hofbauer (2001); Weibull (1995); Hofbauer (1996); Gilboa and Matsui (1991); Matsui (1992); Fudenberg and Levine (1998); Skyrms (1990); Swinkels (1993); Smith and Gray (1994)]. Lindgren (1995) and Hofbauer and Sigmund (2003) have given a comprehensive review of them. In general, dynamic games are of great complexity. How an evolutionary system evolves depends not only on the population and dynamic structures but also on where the evolution starts. Because of dynamic interactions between multiple players, especially those players with intelligence, genetic algorithms may converge towards local optima rather than the global optimum. Also, operating on dynamic data sets is difficult as genomes begin to converge early on towards solutions which may no longer be valid for later data [Michalewicz (1999); Schmitt (2001)]. Analysis of the evolutionary dynamic systems is not just a problem of evolutionary game theory, but a new direction in applied mathematics [Garay and Hofbauer (2003); Gaunersdorfer (1992); Gaunersdorfer, Hofbauer, and Sigmund (1991); Hofbauer (1981, 1984, 1996); Krishna and Sj¨ ostr¨ om (1998); Plank (1997); Smith (1995); Zeeman (1993), Zeeman and Zeeman (2002, 2003)].
2.3.3. Strategies What strategies should be involved in evolutionary dynamics is a difficult question. One approach is to take into consideration lots of representative strategies, for example Axelord (1984), Dacey and Pendegraft (1988), and Akimov and Soutchanski (1994), since it is impossible to enumerate all possible strategies. However, it is difficult to say what strategy should be included and which ones not, and there is little comparability between evolutionary processes with different strategies because the selection of strategies may have great influence on the outcome of the dynamics. Another approach is to study the interactions between specific strategies, for example Nowak and Sigmund (1990, 1992) and Goldstein and Freeman (1990). In this way, it is convenient to make clear the relationship between strategies in the evolutionary process; however, generality of complex evolutionary systems loses to some extent.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
43
Strategies in PD games (or in non-PD games) can be characterized as either deterministic or stochastic. Deterministic strategies leave nothing to change and respond to the opponent with predetermined actions; stochastic strategies, however, leave some uncertainty in their choices. Oskamp (1971) presents a thorough review of the early studies on the strategies involved in PD games and non-PD games, for example AllD, TFT, and lots of stochastic strategies that play C or D with some certain probabilities [Lave (1965); Bixenstine, Potash, and Wilson (1963); Solomon (1960); Crumbaugh and Evans (1967); Wilson (1969); Oskamp and Perlman (1965); Sermat (1967); Heller (1967); Knapp and Podell (1968); Lynch (1968); Swingle and Coady (1967); Whitworth and Lucker (1969)]. After Axelord’s IPD tournament, memory-one strategies that interact with the opponent according to both sides’ behavior in the previous move become prevalent. TFT, Pavlov, Grim Trigger, and many other memoryone strategies are analyzed in varies of environment: round-robin tournaments, evolutionary dynamics with or without noise [Nowak and Sigmund (1990, 1992, 1993); Pollock (1989); Wedekind and Milinski (1996); Milinski and Wedekind (1998); Sigmund (1995); Stephens (2000); Stephens, Mclinn and Stevens (2002); Sandholm and Crites (1996); Doebeli and Knowlton (1998); Brauchli, Killingback and Doebeli (1999); Sasaki, Taylor and Fudenberg (2000)]. No strategy has been shown to be superior in a dynamic environment, and even deterministic cooperators can invade defectors in specific circumstances. It is not sensible to discuss which strategy is best unless the context is defined. Comparing TFT with GTFT, Grim (1995) suggests that, in the non-stochastic Axelrod models, it is TFT that is the general winner; within a purely stochastic model, the greater generosity of GTFT pays off; in a model with both stochastic and spatial elements, a level of generosity twice that of GTFT proves optimal. Pavlov has an obvious advantage over TFT in noisy environments [Nowak and Sigmund (1993); Kraines and Kraines (1995)]. In an evolutionary process where AllC, AllD, TFT, and GTFT strategies are involved, evolution starts off toward defection but then veers toward cooperation. TFT strategies play a key role in invading the population of defectors. However, GTFT strategies and then more generous AllCs gradually become dominant once cooperation is widely established, and this provides an opportunity to AllD to invade again [Nowak and Sigmund (1992)]. Additionally, Selten and Stoecker (1986) have studied the end game behavior in finite IPD supergames, and find that cooperative behaviors last until shortly before the end of the supergame.
44
S. Y. Chong et al.
Machine Learning approaches have been introduced into evolutionary game theory to develop adaptive strategies, especially those for IPD games [Carmel and Markovitch (1996, 1997, 1998); Littman (1994); Tekol and Acan (2003); Hingston and Kendall (2004)]. Adaptive strategies, at least in theory, have obvious advantages over fixed strategies. Among the set of adaptive strategies, there may be an evolutionarily stable strategy for IPD games and potential winner of future IPD tournaments. 2.3.4. Population Population size and structure are of great importance in evolutionary dynamics. In general, evolutionary processes in a large population are quite different from that in small populations [Maynard Smith (1982); Fogel and Fogel (1995); Fogel, Fogel and Andrew (1997, 1998); Ficici and Pollack (2000)]. Young and Foster (1991) have studied stochastic effects in a population consisting of three strategies: AllD, AllC, and TFT. They show that the outcome of the evolutionary process depends crucially on the amount of noise, which is inversely proportional to the population size. The more people there are, the more that random variations in their behavior are smoothed out in the population proportions. For large populations, the system tends to drift from TFT to AllC, which is then invaded by AllD. As a result, most of the players behave as AllD, even though initially most players may have started as TFT. They conclude that cooperation is viable in the short run, but not stable in the long run in a large population. Boyd and Richerson (1988, 1989) suggest that reciprocity is unlikely to evolve in large groups as a result of natural selection because reciprocators punish defection by withholding future cooperation which will penalize other cooperators in the group. Boyd and Richerson (1990, 1992) analyze a model in which the punishment response to defection is directed solely at defectors. In this model, cooperation reinforced by retribution can lead to the evolution of cooperation in different ways. There is the possibility that strategies which cooperate and punish defectors, strategies which cooperate only if punished, and strategies which cooperate but do not punish coexist in the long run, as well as the possibility that only one type exists. As the group size grows larger, however, the conditions for co-operators’ surviving becomes more difficult. Glance and Huberman (1994) discuss how to achieve cooperation in groups of various sizes in n-person PD games and find that there are two
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
45
stable points in large groups: either there is a great deal or very little cooperation. Cooperation is more likely in smaller groups than in larger ones and there is greater cooperation when players are allowed more communication with each other. Large random fluctuations are related to group size. Groups beyond a certain size may experience increased difficulty of informational exchange and coordination; further, reneging on contracts is possible to be prevalent as each member may expect that the effect of his/her action on other members will be diluted. However, Dugatkin (1990) finds that cooperation may invade large populations more easily than smaller ones, but it is likely to represent a smaller proportion of the population in larger groups. In order to consider the potential importance of the relationship between population size and cooperative behaviour, two N-person game theoretical models are presented. The results show that cooperation is frequently not a pure evolutionarily stable strategy, and that many metapopulations should be polymorphic for both cooperators and defectors. It is well accepted that communication among members of a society leads to more cooperative behaviors [Insko et al. (1987); Orbell, Kragt, and Dawes (1988)]. Insko et al. (1987, 1988, 1990, 1993) explore the role of communication on interindividual-intergroup discontinuity in the context of the extended PD game that adds a third withdrawal choice to the usual cooperative and uncooperative choices, and interindividualintergroup discontinuity is the tendency of intergroup relations to be more competitive and less cooperative than interindividual relations. The lesser tendency of individuals to cooperate when there is no communication with the opponent partially explains the group discontinuity. Choice and refusal of partners may accelerate the emergence of cooperation. Experiments have shown that people who are given the option of playing or not are more likely to choose to play if they are themselves planning to cooperate. More cooperative players are more likely to anticipate that others will be cooperative [Orbell and Dawes (1993)]. Defecting players are possible to be alienated by cooperators [Schuessler (1989); Kitcher (1992); Batali and Kitcher (1994)]. In the N-person PD game, it may be that players can change groups if they don’t satisfy the size of their groups [Hirshleifer and Rasmusen (1989)]. The option of choice and refusal of partners in IPD means that players will attempt to select partners rationally. Analytical studies reveal that the subtle interplay between choice and refusal in N-player IPD games can result in various long-run player interaction patterns: mutual cooperation; mixed mutual cooperation and mutual defection; parasitism; and wallflower seclusion. Simulation studies
46
S. Y. Chong et al.
indicate that choice and refusal can accelerate the emergence of cooperation in evolutionary IPD games [Stanley, Ashlock, and Tesfatsion (1994); Stanley, Ashlock and Smucker (1995)]. The effects of freedom to play, reciprocity and interchange, coalitions and alliances, and various sizes of groups on evolution are also studied [Orbell and Robyn (1993); Alexander and Frans (1992); Glance and Bernardo (1994); Hemelrijk (1991)]. In a specific scenario, the prestructuration of the population may determine the evolution of the patterns of interaction that constitute the final social structure [Eckert, Koch, and Mitl¨ ohner (2005)]. 2.3.5. Selection scheme Evolutionary selection schemes can be characterized as either generational or steady-state schemes [Thierens (1997)]. Generational schemes that are widely used in evolutionary game theory mean that each generation of a population is replaced in one step by a new generation. In a system with a steady-state scheme only a small percentage of the population is replaced in each generation. Evolutionary selection schemes can be further subdivided as pure or elitist selection schemes in terms of whether or not there is an overlap between successive generations. Pure selection schemes allow no overlap between successive generations: all parents from previous generation are discarded and the next generation is filled entirely with offspring from these parents. In elitist schemes, subsequent generations may be the same: parents with higher fitness are transferred to the next generation and only poorly performing parents are replaced [Mitchell (1996)]. Pure selection schemes are commonly used in IPD research [Axelord (1987); Axelrod and Dion (1988); Huberman and Glance (1993); Akimov and Soutchanski (1994); Mill (1996)]. These schemes use fitnessproportional selection of the parents in combination with single-point crossover or use a random uniform simple set to select the fittest agent to produce offspring. A robust society of cooperators emerges only if the level of competition between the players is neither too small nor too large. In elitist selection schemes, the population is firstly shuffled randomly and partitioned into pairs of parents. Then, each pair of parents creates two offspring, and a local competition between parents and their offspring is held. Finally, the best two players of each pair of parents are transferred to the next generation [Thierens and Goldberg (1994)]. In this case, stable societies of highly cooperative players evolve. It shows that a suitable model of the selection process is of crucial importance in terms of simulating
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
47
real-world economic situations [Ficici, Melnik, and Pollack (2000); Bragt, Kemenade and Poutr´e (2001)]. Selection is clearly an important genetic operator, but opinion is divided over the importance of crossover versus mutation. Some argue that crossover is the most important, while mutation is only necessary to ensure that potential solutions are not lost [Grefenstette, Ramsey and Schultz (1990); Wilson (1987)]. Others argue that crossover in a largely uniform population only serves to propagate innovations originally found by mutation, and in a non-uniform population crossover is nearly always equivalent to a very large mutation [Spears (1992)]. 2.4. Evolution of Cooperation A fundamental problem in evolutionary game theory is to explain how cooperation can emerge in a population of self-interested individuals. Axelrod (1984, 1987) attributes the reason of emergence of cooperation to the “shadow of the future”: the likelihood and importance of future interaction. This implies that rewards from cooperation should be mutually expected payoff and to cooperate is a rational choice for self-interested individuals [Martinez-Coll and Hirshleifer (1991)]. Axelrod’s work has been subjected to a number of criticisms because his conclusions obviously conflict with traditional game theory [Binmore (1994, 1998)], as Nachbar’s criticism that “Axelrod mistakenly ran an evolutionary simulation of the finitely repeated Prisoners’ Dilemma. Since the use of a Nash equilibrium in the finitely repeated Prisoners’ Dilemma necessarily results in both players always defecting, we then wouldn’t need a computer simulation to know what would survive if every strategy were present in the initial population of entries. The winning strategies would never co-operate.” [Nachbar (1992)]. There are also arguments that the conflict stems from the assumption of Von Neumann-Morgenstern utility. According to Spiro (1988), the problem with Axelrod’s argument is the oft-discussed problem of interpersonal utility comparison. Axelrod’s argument, and all game theoretic modeling, welfare economics, and utilitarian moral philosophy, in fact, would require that it be possible for one to measure and compare the utilities of different people. The problem with this assumption is that it is quite impossible to construct a scale of measurement for human preferences [Rothbard (1997)]. Although evolutionary game theory is aimed primarily towards dynamic games, while traditional game theory deals with non-dynamic games, there are still area of intersection, for instance in the field of repeated games.
48
S. Y. Chong et al.
Furthermore, although evolutionary game theory mainly depends on experiments and computer simulations, its theoretical foundations, i.e. individual utility (or preference) and payoff-maximizing, stem from traditional game theory. Controversies about Axelrod’s work reflect the bifurcation between evolutionary approaches and the basic assumptions of game theory. Based on the assumption of “rational players”, traditional game theory regards a finite repeated game as a combination of many singleton games. “Backward induction” is applied in order to dissect the link between these singleton games, and then each of them can be analyzed statically [Harsanyi and Selten (1988)]. The concept of backward induction was first employed by Von Neumann and Morgenstern (1944) and then developed by Selten (1965, 1975) based on Nash equilibrium. First, one determines the optimal strategy of the player who makes the last move of the game. Then, the optimal action of the next-to-last moving player is determined taking the last player’s action as given. The process continues in this way backwards through time until all players’ actions have been determined. Subgame perfect Nash equilibrium deduced directly from backward induction is an equilibrium such that players’ strategies constitute a Nash equilibrium in every subgame of the original game [Aumann (1995)]. Selten proved that any game which can be broken into “sub-games” containing a sub-set of all the available choices in the main game will have a subgame perfect Nash equilibrium. In the case of a finite number of iterations in IPD games, the unique subgame perfect Nash equilibrium is AllD. However, many psychological and economic experiments have shown that subjects would not necessarily apply a strategy like AllD [Kahn and Murnighan (1993); McKelvey and Palfrey (1992); Cooper et al. (1996)]. Game theorists explain these experimental results in terms of incomplete information, reputation, and bounded rationality, which are all based on theoretical analysis [Harsanyi (1967); Kreps et al. (1982); Simon (1990); Bolton (1991); Bolton and Ockenfels (2000); Binmore et al. (2002); Samuelson (2001)]. In some sense, Axelrods work is a parallel of these explanations, but it seems that his approach is absolutely different. Before a soundly theoretical explanation can be established, the problem of how cooperation emerges is left unsolved. As to the problem of how cooperation can persist during evolution, sufficient evidence has been provided to support the point that cooperation can survive and flourish in a wide range of circumstances if only some conditions are satisfied. Nowak and Sigmund (1990) have shown that cooperation can emerge among a population of randomly chosen reactive strategies, as long as a stochastic version of TFT is added to the population. If cooperators
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
49
can recognize each other with the help of some label they can increase their payoff by interacting selectively with one another [Frank (1988)]. Social norms aid in cooperation in many ways [Bendor and Mookherjee (1990); Kandori (1992); Sethi and Somanathan (1996)]. As to the influence of payoff variations, Mueller (1988) finds that payoff settings with increasing values of T relative to P promote cooperative behaviour; while Fogel (1993) regards that smaller values for T promote the evolution of cooperative behaviour. Nachbar (1992) selects a payoff setting strongly favouring the relative reward of cooperating and finds that this setting elicits an increased degree of cooperation. Kirchkamp (1995) finds that the value of S becomes less important with longer memory. Also, the effects of population structure, repetition, and noise have been studied [Hirshleifer and Coll (1988); Mueller (1988); Boyd (1989); Marinoff (1992); Hoffmann (2001)]. To end, we note that Binmore (1998) stated: “. . .One simply cannot get by without learning the underlying theory. Without any knowledge of the theory, one has no way of assessing the reliability of a simulation and hence no idea of how much confidence to repose in the conclusions that it suggests”.
There is still a need for an underlying theory for IPD tournaments. Evolutionary game theory has provided us with many experimental approaches; however, better theoretical explanations are still needed. Even though IPD tournaments have been run for over 40 years, we suspect there will be more as we search for new strategies and new theories which explain the complex interactions that take place. Finally, this review has been restricted to the IPD literature. Even so, we have not been able to include every article and there are, no doubt, omissions. However we hope that this chapter has provided enough information for the interested reader to follow up on. References Akimov V. and Soutchanski M. (1994) Automata simulation of N-person social dilemma games, Journal of Conflict Resolution, 38, pp. 138-148. Akin E. (1993) The general topology of dynamical systems, American Mathematics Society, Providence. Akiyama E. and Kaneko K. (1995) Evolution of cooperation, differentiation, complexity and diversity in an iterated three-person game, Artificial Life, 2, pp. 293-304. Alexander H. and Frans B. (1992) Coalitions and Alliances in Humans and Other Animals. Oxford: Oxford University Press.
50
S. Y. Chong et al.
Anthonisen N. (1999) Strong rationalizability for two-player noncooperative games, Economic Theory, 13, pp. 143-169. Aumann R. (1995) Backward Induction and Common Knowledge of Rationality, Games and Economic Behavior, 18, pp. 6-19. Axelrod R. (1980a) Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution, 24, pp. 3-25. Axelrod R. (1980b) More effective choice in the prisoner’s dilemma, Journal of Conflict Resolution, 24, pp. 379-403. Axelrod R. M. (1984). The Evolution of Cooperation (BASIC Books, New York). Axelrod R. (1987) The evolution of strategies in the iterated prisoner’s dilemma, In Davis L., Genetic Algorithms and Simulated Annealing, pp. 32-41. Axelrod R. (1999) The Complexity of Cooperation: Agent-based Models of Competition and Collaboration. University Press, Princeton, NJ. Axelrod R. and Dion D. (1988) The further evolution of cooperation, Science, 242, pp. 1385-1390. Axelrod R. and Hamilton W. (1981) The evolution of cooperation, Science, 211, 4489, pp. 1390-1396. Balkenborg D. and Schlag K. (2000) Evolutionarily stable sets, International Journal of Game Theory, 29, pp. 571-595. Batali J. and Kitcher P. (1994) Evolutionary dynamics of altruistic behaviour in optional and compulsory versions of the iterated prisoner’s dilemma, In Rodney A. and Maes P. Artificial Life IV. MIT Press, pp. 343-348. Beaufils B., Delahaye J., and Mathieu P. (1996) Our meeting with gradual: A good strategy for the iterated prisoner’s dilemma, Proceedings of the Artificial Life V, pp. 202-209. Becker N. and Cudd A. (1990) Indefinitely repeated games: a response to Carroll, Theory and Decision, 28, pp. 189-195. Bendor J. and Mookherjee D. (1990) Norms, third-party sanctions, and cooperation, Journal of Law, Economics, and Organization, 6, pp. 33-63. Bendor R., Kramer M., and Stout S. (1991) When in doubt: cooperation in a noisy prisoner’s dilemma, Journal of Conflict Resolution, 35, pp. 691-719. Ben-porath E. (1990) The complexity of computing a best response automaton in repeated games with mixed strategies, Games and Economic Behavior, 2, pp. 1-12. Berry D. and Fristedt B. (1985) Bandit problems: sequential allocation of experiments. Chapman and Hall, London. Binmore K. (1992) Fun and games. Lexington, MA: D.C. Heath and Company. Binmore K. (1994) Playing fair game theory and the social contract I. MIT Press. Binmore K. (1997) Rationality and backward induction, Journal of Economic Methodology, 4, pp. 23-41. Binmore K. (1998) Review of R. Axelrod’s ‘The complexity of cooperation: agent based models of competition and collaboration’, Journal of Artificial Societies and Social Simulation, 1, 1. Binmore K., McCarthy J., Ponti G., Samuelson L. and Shaked A. (2002) A backward induction experiment, Journal of Economic Theory, 104, pp. 48-88.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
51
Bishop, D. and Cannings, C. (1978) A generalized war of attrition, Journal of Theoretical Biology, 70, pp. 85-124. Bixenstine V., Potash H., and Wilson K. (1963) Effects of level of cooperative choice by the other player on choices in a Prisoner’s Dilemma game, Journal of Abnormal and Social Psychology, 66, pp. 308-313. Bolton G. (1991) A comparative model of bargaining: theory and evidence, The American Economic Review, 81, 5, pp. 1096-1136. Bolton G. and Ockenfels A. (2000) ERC: a theory of equity, reciprocity, and competition, The American Economic Review, 90, pp. 166-193. Bomze I. (1998) Uniform barriers and evolutionarily stable sets, Game Theory, Experience, Rationality, pp. 225-244. Bomze I. (2002) Regularity vs. degeneracy in dynamics, games, and optimization: a unified approach to different aspects, SIAM Review, 44, pp. 394-414. Boyd R. (1989) Mistakes allow evolutionary stability in the repeated prisoner’s dilemma game, Journal of Theoretical Biology, 136, 11, pp. 47-56. Boyd R. (1992) The evolution of reciprocity when conditions vary, Harcourt A. and Frans B. (eds.) Alliance formation among male baboons: shopping for profitable partners. Oxford: Oxford University Press, pp. 473-489. Boyd R. and Loberbaum J. (1987) No pure strategy is evolutionarily stable in the repeated Prisoner’s Dilemma game, Nature, 327, pp. 58-59. Boyd R. and Richerson P. (1988) The evolution reciprocity in sizable groups, Journal of Theoretical Biology, 132, pp. 337-356. Boyd R. and Richerson P. (1989) The evolution of indirect reciprocity, Social Networks, 11, pp. 213-236. Boyd R. and Richerson P. (1990) Group selection among alternative evolutionarily stable strategies. Journal of Theoretical Biology, 145, pp. 331-342. Boyd R. and Richerson P. (1992) Punishment allows the evolution of cooperation (or anything else) in sizable groups, Ethology and Sociobiology, 13, pp. 171195. Bovens L. (1997) The backward induction argument for the finite iterated prisoners dilemma and the surprise exam paradox, Analysis, 57, 3, pp. 179-186. Bragt D., Kemenade C. and Poutr´e H. (2001) The influence of evolutionary selection schemes on the iterated prisoner’s dilemma, Computational Economics, 17, pp. 253-263. Brauchli K., Killingback T. and Doebeli M. (1999) Evolution of cooperation in spatially structured populations, Journal of Theoretical Biology, 200, pp. 405-417. Brelis M. (1992) Reputed mobster defends his honor. Boston Globe, 1, pp. 23. Bunn G. and Payne R. (1988) Tit-for-tat and the negotiation of nuclear arms control, Arms Control, 9, pp. 207-233. Carmel D. and Markovitch S. (1996) Learning models of intelligent agents, Proceedings of the 13th National Conference on Artificial Intelligence and the 8th Innovative Applications of Artificial Intelligence Conference, 2, pp. 6267.
52
S. Y. Chong et al.
Carmel D. and Markovitch S. (1997) Model-based learning of interaction strategies in multi-agent systems, Journal of Experimental and Theoretical Artificial Intelligence, 10, 3, pp. 309-332. Carmel D. and Markovitch S. (1998) How to explore your opponent’s strategy (almost) optimally, Proceedings of the International Conference on Multi Agent Systems, pp. 64-71. Cohen L. and Machalek R. (1988) A general theory of expropriative crime: an evolutionary ecological approach, American Journal of Sociology, 94, 3, pp. 465-501. Cooper R., Jong D., Forsythe R., and Ross T. (1996) Cooperation without reputation: experimental evidence from prisoner’s dilemma games, Games and Economic Behavior, 12, 2, pp. 187–218. Cressman R., Garay J. and Hofbauer J. (2001) Evolutionary stability concepts for N-species frequency-dependent interactions, Journal of Theoretical Biology, 211, pp. 1-10. Croson R. (2000) Thinking like a game theorist: Factors affecting the frequency of equilibrium play, Journal of Economic Behavior and Organization, 41, 3, pp. 299–314. Crumbaugh C. and Evans G. (1967) Presentation format, other-person strategies, and cooperative behaviour in the prisoner’s dilemma, Psychological Reports, 20, pp. 895-902. Dacey R. and Pendegraft N. (1988) The optimality of Tit-For-Tat, International Interactions, 15, pp. 45-64. Darwen P. and Yao X. (1995) On evolving robust strategies for iterated prisoner’s dilemma, Progress in Evolutionary Computation, volume 956 in Lecture Notes in Artificial Intelligence, Springer, pp. 276-292. Darwen P. and Yao X. (1996) Automatic modularization by speciation, IEEE International Conference on Evolutionary Computation, pp. 88-93. Darwen P. and Yao X. (2001) Why more choices cause less cooperation in Iterated Prisoner’s Dilemma, Proceedings of the 2001 IEEE Congress on Evolutionary Computation. Darwen P. and Yao X. (2002) Coevolution in iterated prisoner’s dilemma with intermediate levels of cooperation: Application to missile defense, International Journal of Computational Intelligence and Applications, 2, 1, pp. 83107. Davis D. and Holt C. (1999) Equilibrium cooperation in two-stage games: Experimental evidence, International Journal of Game Theory, 28, 1, pp. 89-109. Delahaye J. and Mathieu P. (1996) Etude sur les dynamiques du Dilemme It´er´e des Prisonniers avec un petit nombre de strat´egies : Y a-t-il du chaos dans le Dilemme pur?, Publication Interne IT-294, Laboratoire d’Informatique Fondamentale de Lille. Doebeli M., Blarer A., and Ackermann M. (1997) Population dynamics, demographic stochasticity, and the evolution of cooperation, Proceedings of National Academy Society of USA, 94: 5167–5171. Doebeli M. and Knowlton N. (1998) The evolution of interspecific mutualisms, Proceedings of the National Academy of Sciences, 95(15): 8676-8680.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
53
Donninger C. (1986) Is it always efficient to be nice?, In Paradoxical effects of social behavior, edited by Dickmann A. and Mitter P., Heidelberg, Germany: Physica Verlag, pp. 123-134. Dugatkin L. (1989) N-person games and the evolution of cooperation: a model based on predator inspection in fish, Journal of Theoretical Biology, 142, pp. 123–135. Dugatkin L. (1990) N-person Games and the Evolution of Co-operation: A Model Based on Predator Inspection in Fish, Journal of Theoretical Biology, 142, pp. 123-135. Durrett R. and Levin S. (1998) Spatial aspects of interspecific competition, Theoretical Population Biology, 53, 1, pp. 30-43. Eckert D., Koch S., and Mitl¨ ohner J. (2005) Using the iterated prisoner’s dilemma for explaining the evolution of cooperation in open source communities, Proceedings of the First Conference on Open Source System, pp. 186-191. Ficici S., Melnik O., and Pollack J. (2000) A game-theoretic investigation of selection methods used in evolutionary algorithms, Proceedings of the 2000 Congress on Evolutionary Computation, 2, pp. 880-887. Ficici S. and Pollack J. (2000) Effects of finite populations on evolutionary stable strategies, Proceedings of the 2000 Genetic and Evolutionary Computation, pp. 927-934. Fogel D. (1993) Evolving behaviors in the iterated prisoners dilemma, Evolutionary Computation, 1, 1, pp. 77-97. Fogel D. and Fogel G. (1995) Evolutionary stable strategies are not always stable under evolutionary dynamics, Evolutionary Programming IV, pp. 565-577. Fogel D., Fogel G., and Andrew P. (1997) On the instability of evolutionary stable strategies, BioSystems, 44, pp. 135-152. Fogel G., Andrew P., and Fogel D. (1998) On the instability of evolutionary stable strategies in small populations, Ecological Modelling, 109, pp. 283-294. Frank R. (1988) Passions within reason. The strategic role of the emotions, New York: W.W. Norton & Co. Freund Y., Kearns M., Mansour Y., Ron D., Rubinfeled R., and Schapire R. (1995) Efficient algorithms for learning to play repeated games against computationally bounded adversaries, Proceedings of the Annual Symposium on the Foundations of Computer Science, pp. 332–341. Fudenberg D. and Maskin E. (1986) The Folk Theorem in repeated games with discounting and incomplete information, Econometrica, 54, pp. 533–554. Fudenberg D. and Maskin E. (1990) Evolution and cooperation in noisy repeated games, New Developments in Economic Theory, 80, pp. 274-279. Fudenberg D. and Levine D. (1998) The theory of learning in games. MIT Press. Garay B. and Hofbauer J. (2003) Robust permanence for ecological differential equations: minimax and discretizations, SIAM Journal on Mathematical Analysis, 34, pp. 1007-1093. Gaunersdorfer A. (1992) Time averages for heteroclinic attractors, SIAM Journal on Applied Mathematics, 52, pp. 1476-1489. Gaunersdorfer A., Hofbauer J., and Sigmund K. (1991) On the dynamics of asymmetric games, Theoretical Population Biology, 39, pp. 345-357.
54
S. Y. Chong et al.
Gilboa I. and Matsui A. (1991) Social stability and equilibrium, Econometrica, 59, pp. 859-867. Gilboa I. and Schmeidler D. (2001) A theory of case-based decisions. Cambridge University Press. Gittins J. (1989) Multi-armed bandit allocation indices. Wiley, Chichester, NY. Glance N. and Huberman B. (1993) The outbreak of cooperation, Journal of Mathematical sociology, 17, 4, pp. 281–302. Glance N. and Huberman B. (1994) The dynamics of social dilemmas, Scientific American, 270, pp. 76-81. Glomba M., Filak T., and Kwasnicka H. (2005) Discovering effective strategies for the iterated prisoner’s dilemma using genetic algorithms, 5th International Conference on Intelligent Systems Design and Applications, pp. 356-363. Godfray H. (1992) The evolution of forgiveness, Nature, 355, pp. 206-207. Goldstein J. and Freeman J. (1990) Three-Way Street: Strategic Reciprocity in World Politics. Chicago: University of Chicago Press. Grefenstette J., Ramsey C., and Schultz A. (1990) Learning sequential decision rules using simulation models and competition, Machine Learning, 5, pp. 355-381. Grim P. (1995) The greater generosity of the spatialized prisoner’s dilemma, Journal of Theoretical Biology, 173, pp. 242-248. Grossman W. (2004) New tack wins Prisoner’s Dilemma, Wired News, Lycos. Harborne S. (1997) Common belief of rationality in the finitely repeated prisoners’ dilemma, Games and Economic Behavior, 19, 1, pp. 133-143. Hardin G. (1968) The tragedy of the commons, Science, 162, pp. 1243-1248. Hargreaves H. and Varoufakis Y. (1995) Game theory: a critical introduction. Routledge, London. Harsanyi J. (1967) Games with incomplete information played by Bayesian players, Management Science, 14, 3, pp. 159-182. Harsanyi, J., and Selten, R. (1988) A General Theory of Equilibrium Selection in Games. Cambridge: MIT Press. Hauser M. (1992) Costs of deception: cheaters are punished in rhesus monkeys (Macaca mulatta). Proceedings of the National Academy of Sciences, 89, pp. 12137-12139. Heller J. (1967) The effects of racial prejudice, feedback strategy, and race on cooperative-competitive behaviour, Dissertation Abstracts, 27, pp. 25072508. Hemelrijk C. (1991) Interchange of ’Altruistic’ Acts as an Epiphenomenon. Journal of Theoretical Biology, 153, pp. 131-139. Hingston P. and Kendall G. (2004) Learning versus evolution in iterated prisoner’s dilemma, Proceedings of Congress on Evolutionary Computation, pp. 364372. Hirshleifer J. and Coll J. (1988) What strategies can support the evolutionary emergence of cooperation?, Journal of Conflict Resolution, 32, 2, pp. 367398. Hirshleifer D. and Rasmusen E. (1989) Cooperation in a repeated prisoner’s dilemma with ostracism, Journal of Economic Behavior and Organization, 12, pp. 87-106.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
55
Hofbauer J. (1981) On the occurrence of limit cycles in the Volterra-Lotka equation, Nonlinear Analysis, 5, pp. 1003-1007. Hofbauer J. (1984) A difference equation model for the hypercycle, SIAM Journal on Applied Mathematics, 44, pp. 762-772. Hofbauer J. (1996) Evolutionary dynamics for bimatrix games: a Hamiltonian system, Journal of Mathematical Biology, 34, pp. 675-688. Hofbauer J. and Sigmund K. (1998) Evolutionary games and population dynamics. Cambridge University Press. Hofbauer J. and Sigmund K. (2003) Evolutionary game dynamics, Bulletin of the American Mathematical Society, 40, pp. 479-519. Hoffmann R. (2001) The ecology of cooperation, Theory and Decision, 50, pp. 101-118. Holland J. (1975) Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. Holland J. (1992) Genetic algorithm, Scientific American, 267, 4, pp. 44-50. Holland J. (1995) Hidden Order - How adaptation builds complexity, Reading, Mass.: Addison-Wesley. Huberman B. and Glance N. (1993) Evolutionary games and computer simulations, Proceedings of the National Academy of Sciences, 90, pp. 7716-7718. Ifti M., Killingback T., and Doebeli M. (2004) Effects of neighborhood size and connectivity on the spatial continuous prisoner’s dilemma, Journal of Theoretical Biology, 231, pp. 97-106. Ikegami T. and Kaneko K. (1990) Computer symbiosis - emergence of symbiotic behavior through evolution, Physica D, 42, pp. 235-243. Insko C., Pinkley R., Hoyle R., Dalton B., Hong G., Slim R., Landry P., Holton B., Ruffin P., and Thibaut J. (1987) Individual-group discontinuity: the role of intergroup contact, Journal of Experimental Social Psychology, 23, pp. 250-267. Insko C., Hoyle R., Pinkley R., and Hong G. (1988) Individual-group discontinuity: the role of a consensus rule, Journal of Experimental Social Psychology, 24, pp. 505-519. Insko C., Schopler J., Hoyle R., Dardis G., and Graetz K. (1990) Individual-group discontinuity as a function of fear and greed, Journal of Personality and Social Psychology, 58, pp. 68-79. Insko C., Schopler J., Drigotas S., Graetz K., Kennedy J., Cox C., and Bornstein G. (1993) The role of communication in interindividual-intergroup discontinuity, Journal of Conflict Resolution, 37, pp. 108-138. Kaelbling L. (1993) Learning in embedded systems. The MIT Press, Cambridge, MA. Kaelbling L. and Moore A. (1996) Reinforcement learning: a survey, Journal of Artificial Intelligence Research, 4, pp. 237-285. Kagel J. and Roth A. (1995) The Handbook of Experimental Economics. Princeton University Press. Kahn L. and Murnighan J. (1993) Conjecture, uncertainty, and cooperation in Prisoners’ Dilemma games: Some Experimental Evidence, Journal of Economic Behavior and Organisms, 22, pp. 91–117.
56
S. Y. Chong et al.
Kalai E. and Lehrer E. (1993) Rational learning leads to Nash equilibrium Econometrica, 61, 5, pp. 1019-1045. Kandori M. (1992) Social norms and community enforcement, The Review of Economic Studies, 59, 1, pp. 63-80. Katok A. and Hasselblatt B. (1996) Introduction to the modern theory of dynamical systems. Cambridge ISBN 0521575575. Kavka G. (1986) Hobbesean Moral and Political Theory. Princeton: Princeton University Press. Kirchkamp O. (1995) Spatial Evolution of Automata in the Prisoners’ Dilemma. University of Bonn SFB 303, Discussion Paper B-330. Kitcher P. (1992) Evolution of altruism in repeated optional games, Working Paper of University of California at San Diego. Knapp W. and Podell J. (1968) Mental patients, prisoners, and students with simulated partners in a mixed-motive game, Journal of Conflict Resolution, 12, pp. 235-241. Komorita S., Sheposh J., and Braver S. (1968) Power, the use of power, and cooperative choice in a two-person game, Journal of Personality and Social Psychology, 8, pp. 134-142. Kraines D. and Kraines V. (1995) Evolution of learning among Pavlov strategies in a competitive environment with noise, The Journal of Conflict Resolution, 39, 3, pp. 439-466. Kraines D. and Kraines V. (2000) Natural selection of memory-one strategies for the iterated Prisoner’s Dilemma, Journal of Theoretical Biology, 203, pp. 335-355. Kreps D., Milgrom P., Roberts J., and Wilson R. (1982) Rational cooperation in the finitely repeated prisoner’s dilemma, Journal of Economic Theory, 27, pp. 245–252. Kreps, D., and Wilson R. (1982) Reputation and imperfect information, Journal of Economic Theory, 27, pp. 253–279. Krishna V. and Sj¨ ostr¨ om T. (1998) On the convergence of fictitious play, Mathematics Operations Research, 23, pp. 479-511. Lave L. (1965) Factors affecting cooperation in the prisoner’s dilemma, Behavioral Science, 10, pp. 26-38. Lindgren K. (1991) Evolutionary phenomena in simple dynamics, In Christopher G., et al. Santa Fe Institute Studies in the Sciences of Complexity. 10, pp. 295-312. Lindgren K. (1992) Evolutionary phenomena in simple dynamics, In Langton C. (ed.) Artificial Life II. Addison-Wesley. Lindgren K. (1995) Evolutionary dynamics in game-theoretic models, The economy as an evolving complex system II, Santa Fe Institute. Littman M. (1994) Markov games as a framework for multiagent reinforcement learning, Proceedings of the 11th International Conference on Machine Learning, pp. 157-163. Luce R. and Raiffa H. (1957) Games and decisions. New York: Wiley. Lynch G. (1968) Defense preference and cooperation and competition in a game, Dissertation Abstracts, 29, pp. 1174.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
57
Manarini S. (1998) The prisoner’s dilemma, experiments for the study of cooperation. Strategies, theories and mathematical models, Ph.D. thesis, University of Padova. Marinoff L. (1992) Maximizing expected utilities in the Prisoner’s Dilemma, Journal of Conflict Resolution, 36, 1, pp. 183-216. Martinez-Coll J. and Hirshleifer J. (1991) The limits of reciprocity, Rationality and Society, 3, pp. 35-64. Matsui A. (1992) Best response dynamics and socially stable strategies, Journal of Economic Theory, 57, pp. 343-362. May R. (1987) More evolution of cooperation, Nature, 327, pp. 15-17. Maynard Smith J. and Price G. (1973) The logic of animal conflict, Nature, 246, pp. 15-18. Maynard Smith J. (1982) Evolution and the Theory of Games, Cambridge University Press. McKelvey R. and Palfrey T. (1992) An experimental study of the centipede game, Econometrica, 60, pp. 803-836. Mealey L. (1995) The sociobiology of sociopathy: an integrated evolutionary model, Behavioral and Brain Sciences, 18, 3, pp. 523-599. Michalewicz Z. (1999) Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag. Micko H. (1997) Benevolent tit for tat strategies with fixed intervals between offers of cooperation, Meeting of Experimental Psychologists, pp. 250-256. Micko H. (2000) Experimental Matrix games, In Open and Distance Learning-Mathematical Psychology, Institut f¨ ur Sozial- und Pers¨ onlichkeitspsychologie, Universit¨ at Bonn. Milgrom, P. and Roberts J. (1982): Predation, reputation and entry deterrence, Journal of Economic Theory, 27, pp. 280-312. Milinski M. (1993) Cooperation wins and stays, Nature, 364, pp. 12-13. Milinski M. and Wedekind C. (1998) Working memory constrains human cooperation in the prisoner’s dilemma, Proceedings of the National Academy of Sciences of the United States of America, 95, 23, pp. 13755-13758. Miller J. (1996) The coevolution of automata in the repeated prisoner’s dilemma, Journal of Economic Behavior and Organization, 29, pp. 87-112. Mitchell M. (1996) An introduction to Genetic Algorithms. The MIT Press, Cambridge MA. Molander P. (1985) The optimal level of generosity in a selfish, uncertain environment, Journal of Conflict Resolution, 29, pp. 611-618. Moore A. and Atkeson C. (1993) Prioritized sweeping: reinforcement learning with less data and less real time, Machine Learning, 13, pp. 103-130. Mueller U. (1988) Optimal retaliation for optimal cooperation, Journal of Conflict Resolution, 31, 4, pp. 692-724. Myerson R. (1991) Game Theory, Analysis of Conflict. Cambridge, Harvard University Press. Nachbar J. (1992) Evolution in the finitely repeated Prisoners’ Dilemma, Journal of Economic Behavior and Organization, 19, pp. 307-326.
58
S. Y. Chong et al.
Narendra K. and Thathachar M. (1989) Learning automata: an introduction. Prentice-Hall, Englewood Cliffs, NJ. Nash J. (1950) Equilibrium points in n-person games, Proceedings of the National Academy of the USA, 36, 1, pp. 48-49. Nash J. (1951) Non-cooperative games, The Annals of Mathematics, 54, 2, pp. 286-295. Nash J. (1996) Essays on Game Theory. Elgar. Cheltenham. Noldeke G. and Samuelson L. (1993) An evolutionary analysis of backward and forward induction, Games and Economic Behaviour, 5, pp. 425-454. Nowak M., Bonhoeffer S., and May R. (1994) More spatial games, International Journal of Bifurcation and Chaos, 4, 1, pp. 33-56. Nowak M. and May R. (1992) Evolutionary games and spatial chaos, Nature, 359, pp. 826-829. Nowak M. and May R. (1993) The spatial dilemmas of evolution, International Journal of Bifurcation and Chaos, 3, pp. 35-78. Nowak M. and Sigmund K. (1990) The evolution of stochastic strategies in the prisoner’s dilemma, Acta Applicandae Mathematicae, 20, pp. 247-265. Nowak M. and Sigmund K. (1992) Tit for tat in heterogeneous populations, Nature, 359, pp. 250-253. Nowak M. and Sigmund K. (1993) A strategy of win-stay lose-shift that outperforms Tit-for-Tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58. Nowak M., Sigmund K. and El-Sedy E. (1995) Automata, repeated games, and noise, Journal of Mathematical Biology, 33, pp. 703-722. Orbell J., Kragt A., and Dawes R. (1988) Explaining discussion-induced cooperation, Journal of Personality and Social Psychology, 54, pp. 811-819. Orbell J. and Dawes R. (1993) Social welfare, cooperator’s advantage, and the option of not playing the game, American Sociological Review, pp. 787-800. Orbell J. and Robyn M. (1993) Social welfare, cooperators’ advantage, and the option of not playing the game. American Sociological Review, 58, pp. 787800. Oskamp S. (1971) Effects of programmed strategies on cooperation in the prisoner’s dilemma and other mixed-motive games, The Journal of Conflict Resolution, 15, 2, pp. 225-259. Oskamp S. and Perlman D. (1965) Factors affecting cooperation in a prisoner’s dilemma game, Journal of Conflict Resolution, 9, pp. 359-374. Plank M. (1997) Some qualitative differences between the replicator dynamics of two player and n player games, Nonlinear Analysis, 30, pp. 1411-1417. Pollock G. (1989) Evolutionary Stability of Reciprocity in a Viscous Lattice. Social Networks, 11, pp. 175-212. Posch M. (1997) Win Stay–Lose Shift: An Elementary Learning Rule for Normal Form Games, Working Paper of Santa Fe Institute, http://ideas.repec.org/p/wop/safire/97-06-056e.html. Prisoner’s dilemma tournament result (2004) http://www.prisoners-dilemma. com/results/cec04/ipd cec04 full run.html. Prisoner’s dilemma tournament result (2005) http://www.prisoners-dilemma. com/results/cig05/cig05.html.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
59
Radner R. (1980) Collusive behaviour in non-cooperative epsilon-equilibria in oligopolies with long but finite lives, Journal of Economic Theory, 22, pp. 136-154. Radner R. (1986) Can bounded rationality resolve the prisoner’s dilemma, In Mas- Colell A. and Hildenbrand W. Essays in Honor of Gerard Debreu, pp. 387-399. Rapoport A. (1966) Optimal policies for the prisoner’s dilemma, Technical Report No. 50 Psychometric Laboratory, University of North California, MH-10006. Rapoport A. (1999) Two-person Game Theory. Dover Publications, New York. Rapoport and Chammah (1965) Prisoner’s dilemma: a study in conflict and cooperation. Ann Arbor: University of Michigan Press. Rothbard M. (1997) Toward a Reconstruction of Utility and Welfare Economics, In The Logic of Action One: Method, Money, and the Austrian School, pp. 211-55. Rubinstein A. (1979) Equilibrium in super games with the overtaking criterion, Journal of Economic Theory, 21, pp. 1-9. Rubinstein A. (1998) Modeling bounded rationality. The MIT Press, 1998. Samuelson L. (2001) Introduction to the evolution of preferences, Journal of Economic Theory, 97, pp. 225-230. Sandholm T. and Crites R. (1996) Multiagent reinforcement learning in the iterated Prisoner’s Dilemma, Biosystems, 37, 1-2, pp. 147-66. Sarin R. (1999) Simple play in the prisoner’s dilemma, Journal of Economic Behavior and Organization, 40, 1, pp. 105–113. Sasaki A., Taylor C. and Fudenberg D. (2000) Emergence of cooperation and evolutionary stability in finite populations, Nature, 428, pp. 646-650. Schmidhuber J. (1996) A general method for multi-agent learning and incremental self-improvement in unrestricted environments, In Yao X. (ed.) Evolutionary Computation: Theory and Applications. Scientific Publications Co. Schmitt L. (2001) Theory of genetic algorithms, Theoretical Computer Science, 259, pp. 1-61. Schuessler R. (1989) Exit threats and cooperation under anonymity, Journal of Conflict Resolution, 33, pp. 728-749. Schweitzer F. (2002) Modeling Complexity in Economic and Social Systems. World Scientific, Singapore. Schweitzer F., Behera L., and M¨ uhlenbein H. (2002) Evolution of cooperation in a spatical prisoner’s dilemma, Advances in Complex Systems, 5, 2-3, pp. 269-299. Scodel A., Minas J., Ratoosh P., and Lipetz M. (1959) Some descriptive aspects of two-person non-zero sum games, Journal of Conflict Resolution, 3, pp. 114119. Selten, R. (1965) Spieltheoretische behandlung eines oligopolmodells mit nachfragetragheit, Zeitschrift fur die Gesamte Staatswissenschaft, 12, pp. 301324. Selten, R. (1975) Reexamination of the perfectness concept for equilibrium points in extensive games, International Journal of Game Theory, 4, pp. 25-55.
60
S. Y. Chong et al.
Selten R. (1983) Evolutionary stability in extensive two-person games, Mathematical Social Science, 5, pp. 269-363. Selten R. (1988) Evolutionary stability in extensive two-person games: correction and further development, Mathematical Social Science, 16, pp. 223-266. Selten R. and Stoecker R. (1986) End behaviour in sequences of finite Prisoner’s Dilemma supergames: a learning theory approach, Journal of Economic Behaviour and Organisation, 7, pp. 47-70. Sethi R. and Somanathan E. (1996) The evolution of social norms in common property resource use, The American Economic Review, 86, 4, pp. 766-788. Simon H. (1955) A behavioral model of rational choice, Quarterly Journal of Econometrics, 69, 1, pp. 99-118. Simon H. (1990) A mechanism for social selection and successful altruism, Science, 250, 4988, pp. 1665-1668. Sermat V. (1967) Cooperative behaviour in a mixed-motive game, Journal of Social Psychology, 62, pp. 217-239. Sigmund K. (1995) Games of Life: Explorations in Ecology, Evolution and Behaviour. Penguin, Harmondsworth. Skyrms B. (1990) The Dynamics of Rational Deliberation. Harvard UP. Smith H. (1995) Monotone dynamical systems: an introduction to the theory of competitive and cooperative systems, AMS Mathematical Surveys and Monographs, 41. Smith R. and Gray B. (1994) Co-adaptive genetic algorithms: an example in Othello strategy, Proceedings of the 1994 Florida Artificial Intelligence Research Symposium, pp. 259-264. Sobel J. (1975) Reexamination of the perfectness concept of equilibrium in extensive games, International Journal of Game Theory, 4, pp. 25-55. Sobel J. (1976) Utility maximization in iterated Prisoner’s Dilemmas, Dialogue, 15, pp. 38-53. Solomon L. (1960) The influence of some types of power relationships and game strategies upon the development of interpersonal trust, Journal of Abnormal and Social Psychology, 61, pp. 223-230. Spears W. (1992) Crossover or mutation? Foundations of Genetic Algorithms. 2, FOGA-92, edited by Whitley D., California: Morgan Kaufmann. Spiro D. (1988) The state of cooperation in theories of state cooperation: the evolution of a category mistake, Journal of International Affairs, 42, pp. 205225. Stanley E., Ashlock D., and Smucker M. (1995) Iterated prisoner’s dilemma with choice and refusal of partners: Evolutionary results, Lecture Notes in Artificial Intelligence, 929, pp. 490-502. Stanley E., Ashlock D., and Tesfatsion L. (1994) Iterated prisoner’s dilemma with choice and refusal of partners, In Christopher G. Artificial Life III. Addison-Wesley, pp. 131-176. Stephens D. (2000) Cumulative benefit games: achieving cooperation when players discount the future, Journal of Theoretical Biology, 205, 1, pp. 1-16. Stephens D., Mclinn C., and Stevens J. (2002) Discounting and Reciprocity in an Iterated Prisoner’s Dilemma, Science, 298, 5601, pp. 2216-2218.
Iterated Prisoner’s Dilemma and Evolutionary Game Theory
61
Sugden, R. (1986) The Economics of Cooperation, Rights and Welfare. Basil Blackwell. Surowiecki J. (2004) The Wisdom of Crowds: Why the Many Are Smarter Than the Few and How Collective Wisdom Shapes Business, Economies, Societies and Nations. Little, Brown. Sutton R. (1990) Integrated architectures for learning, planning, and reacting based on approximating dynamic programming, Proceedings of the 7th International Conference on Machine Learning, pp. 216-224. Swingle P. and Coady H. (1967) Effects of the partner’s abrupt strategy change upon subject’s responding in the prisoner’s dilemma, Journal of Personality and Social Psychology, 5, pp. 357-363. Swinkels J. (1993) Adjustment dynamics and rational play in games, Games and Economic Behavior, 5, pp. 455-84. Taylor, P. D. (1979). Evolutionarily stable strategies with two types of players, Journal of Applied Probability, 16, pp. 76-83. Taylor, P. and Jonker, L. (1978) Evolutionary stable strategies and game dynamics, Mathematical Biosciences, 40, pp. 145-156. Tekol Y. and Acan A. (2003) Ants can play Prisoner’s Dilemma, Proceedings of the 2003 Congress on Evolutionary Computation, pp. 1151-1157. Thierens D. (1997) Selection schemes, elitist recombination, and selection intensity, Proceedings of the 7th International Conference on Genetic Algorithms, pp. 152-159. Thierens D. and Goldberg D. (1994) Elitist recombination: an integrated selection recombination GA, Proceedings of the First IEEE Conference on Evolutionary Computation, pp. 508-512. Thomas B. (1985) On evolutionarily stable sets, Journal of Mathematical Biology, 22, pp. 105-115. Tzafestas E. (2000a) Toward adaptive cooperative behavior, Proceedings of the Simulation of Adaptive Behavior Conference, pp. 334-340. Tzafestas E. (2000b) Spatial games with adaptive tit-for-tats, Proceedings of the 6th Parallel Problem Solving from Nature (PPSN-VI), pp. 507-516. Young H. and Foster D. (1991) Cooperation in the Short and in the Long Run, Games and Economic Behavior, 3, pp. 145-156. Vegaredondo F. (1994) Bayesian boundedly rational agents play the finitely repeated prisoner’s dilemma, Theory and Decision, 36, 2, pp. 187–206. Von Neumann J. and Morgenstern O. (1944) Theory of Games and Economic Behavior. Princeton UP. Watkins C. (1989) Learning from delayed rewards. Ph.D. thesis, King’s College, Cambridge, UK. Watkins C. and Dayan P. (1992) Q-learning, Machine Learning, 8, 3, pp. 279-292. Wedekind C. and Milinski M. (1996) Human cooperation in the simultaneous and the alternating Prisoner’s Dilemma: Pavlov versus Generous Tit-forTat, Proceedings of the National Academy of Sciences of the United States of America, 93, 7, pp. 2686-2689. Weibull J. (1995) Evolutionary Game Theory. MIT Press, Cambridge, Mass.
62
S. Y. Chong et al.
Wilson W. (1969) Cooperation and the cooperativeness of the other player, Journal of Conflict Resolution, 13, pp. 110-117. Wilson W. (1987) Classifier systems and the animat problem, Machine Learning, 2, pp. 199-228. Whitworth R. and Lucker W. (1969) Effective manipulation of cooperation with college and culturally disadvantaged populations, Proceedings of 77th Annual Convention of American Psychological Association, 4, pp. 305-306. Wu J. and Axelrod R. (1995) How to cope with noise in the Iterated Prisoner’s Dilemma, Journal of Conflict Resolution, 39, pp. 183-189. Zeeman M. (1993) Hopf bifurcations in competitive three dimensional LotkaVolterra systems, Dynamics and Stability of Systems, 8, pp. 189-217 Zeeman E., Zeeman M. (2002) An n-dimensional competitive Lotka-Volterra system is generically determined by its edges, Nonlinearity, 15, pp. 2019-2032. Zeeman E., Zeeman M. (2003) From local to global behavior in competitive Lotka-Volterra systems, Transaction of American Mathematical Society, 355, pp. 713-734.
Chapter 3 Learning IPD Strategies Through Co-evolution
Siang Yew Chong1 , Jan Humble2 , Graham Kendall2 , Jiawei Li2,3 , Xin Yao1 University of Birmingham1 , University of Nottingham2 , Harbin Institute of Technology3
3.1. Introduction Complex behavioral interactions can be abstracted and modelled using a game. One particular aspect in modelling interactions that is of great interest is in understanding the specific conditions that lead to cooperation between selfish individuals. The iterated prisoner’s dilemma (IPD) game is one famous example. In its classical form, two players engaged in repeated interactions, are given two choices: cooperate and defect [Axelrod (1984)]. The dilemma of the game is captured by having both players who are better off mutually cooperating than mutually defecting being vulnerable to exploitation by one of the party who defects. Although the IPD game has become a popular model to study conditions for cooperation to occur among selfish individuals, which was due in large part to a series of tournaments reported in [Axelrod (1980a,b)], it has also received much attention in many other areas of study, and used to model social, economic, and biological interactions [Axelrod (1984)]. The classical IPD can be easily defined as a nonzero-sum, noncooperative, two-player game [Chellapilla and Fogel (1999)]. It is nonzero-sum because the benefits that a player obtains do not necessarily lead to similar penalties given to the other player. It is noncooperative because it assumes no preplay communication between the two players. The IPD game can be formulated by considering a predefined payoff matrix that specifies the payoff that a player receives for the choice it makes for a particular move given the choice that the opponent makes. Referring 63
S. Y. Chong et al.
64
to the payoff matrix given by figure 3.1, both players receive R (reward ) units of payoff if both cooperates. They both receive P (punishment) units of payoff if they both defect. However, when one player cooperates while the other defects, the cooperator will receive S (sucker ) units of payoff while the defector receives T (temptation) units of payoff. With the IPD game, the values R, S, T , and P must satisfy the constraints; T > R > P > S and R > (S + T )/2. Axelrod in [Axelrod (1980a,b)] used the following set of values: R = 3, S = 0, T = 5, and P = 1. However, any set of values can be used as long as they satisfy the IPD constraints. The game is played when both players choose between the two alternative choices over a series of moves (i.e., repeated interactions). Note that the game is fully symmetric, i.e., the same payoff matrix is applied to both players.
Cooperate
Defect R
T
Cooperate R
S S
P
Defect T
P
Fig. 3.1. The payoff matrix framework of a two-player, two-choice game. The payoff given in the lower left-hand corner is assigned to the player (row) choosing the move, while that of the upper right-hand corner is assigned to the opponent (column).
For the simple case of the one-shot prisoner’s dilemma (both players only get to make one move), the rational play will be to defect [Chellapilla and Fogel (1999)]. This can be viewed by considering the obtained payoff for a choice made by a player in light of the opponent’s. For example, a cooperating player will receive either R (opponent cooperates) or S (opponent defects). A defecting player will receive either T (opponent cooperates) or P (opponent defects). As such, from the player’s point of view (i.e., selfinterested), the rational play will be to defect because regardless of the opponent’s play, a higher payoff is obtained (T > R and P > S). However, when the game is iterated over many rounds of moves and that players can adopt game strategies where a response is based on what happened in the previous moves, defection is not necessarily the best choice of play. Instead, many studies have shown cooperative play to be a viable
Learning IPD Strategies Through Co-evolution
65
strategy, starting with the tournaments organized by Axelrod (reported in [Axelrod (1980a,b)]). More importantly, later studies (of which Axelrod himself is one of the early pioneers) showed that cooperative strategies can be learned from an initial, random population using evolutionary algorithms [Axelrod (1987); Fogel (1991, 1993); Darwen and Yao (1995)]. In particular, studies made in [Axelrod (1987); Fogel (1991, 1993); Darwen and Yao (1995)] (and many others) used a co-evolutionary learning approach. The motivation for the co-evolutionary learning approach is the learning of strategy behaviors through an adaptation process on strategy representations based solely on interactions (i.e., game-play). This approach is different compared to the classical evolutionary game approach (and also the ecological game approach used in [Axelrod (1980b); Axelrod and Hamilton (1981)]) that is mainly concerned with frequency dependent reproductions of fixed and predetermined strategies. As such, the use of co-evolutionary learning approach allows for one to construct a game (i.e., specifying the possible interactions between players, the rules that govern the interactions, and the payoffs) and then to search for effective game strategies without the need of human intervention (e.g., specify viable strategies) [Chellapilla and Fogel (1999)]. Within the framework of the co-evolutionary learning of game strategies, it is natural to explore more complex interactions that is closer to real-world interactions compared to highly abstracted models like the classical IPD. This review aims to provide a survey of studies using the co-evolutionary learning approach of more complex IPD games since the tournaments organized by Axelrod that were held almost 20 years ago. In particular, focus is placed on the motivations of certain extensions to the classical IPD and the general observations made when co-evolutionary learning systems are used. The following section describes the framework of co-evolutionary learning and the general issues of co-evolving IPD strategies. Section 3.2 surveys studies that extend the classical IPD with more choices, noise, N-players, and others. The review concludes with some remarks on the future directions for research in co-evolutionary learning of IPD strategies. It is emphasized again that this review focusses on the co-evolutionary learning approach to IPD games, rather than all possible work related to IPD games.
S. Y. Chong et al.
66
3.2. Co-evolving Strategies for the IPD Game 3.2.1. Co-evolutionary Learning Framework Co-evolutionary learning refers to a broad class of population-based, stochastic search algorithms that involves the simultaneous evolution of competing solutions (to a problem) with coupled fitness [Yao (1994)]. A coevolutionary learning system can be implemented using evolutionary algorithms (EAs) [Fogel (1994a); B¨ ack et al. (1997)]. That is, a co-evolutionary learning system iteratively apply the process of variation (e.g., mutation, crossovers, and others) and selection (e.g., choosing solutions to procreate in the next iterative step) on the competing solutions in the population. With this view, the framework of co-evolutionary learning (and also that of EAs) can be illustrated using figure 3.2.
(1) Initialize the population, X(t=0) (2) Evaluate the fitness of each individual through a comparison process with other individuals in X(t) (3) Select parents from X(t) based on their evaluated fitness (4) Generate offsprings from parents to produce X(t+1) (5) Repeat steps (2-4) until some termination criteria are reached
Fig. 3.2.
The general framework of co-evolutionary learning.
Co-evolutionary learning is different from EAs in the sense of assigning fitness, i.e., the quality or worth of a solution (Step 2 in Fig. 3.2). EAs are often viewed and constructed in terms of an optimization context, whereby an absolute fitness function is required to assign fitnesses to contending solutions. With co-evolutionary learning, the fitness of a solution is obtained through its interactions with other contending solutions in the population. That is, a solution fitness in a co-evolutionary learning system is relative and dynamic because a solution’s fitness not only depends on the population, but also changes as the composition of solutions in the population changes. Although the difference between co-evolutionary learning systems and traditional EAs appear to be small at first, from the contexts of certain problems, it can lead to significantly different outcomes. For example, consider the problem of searching for optimal solutions. In many real-world
Learning IPD Strategies Through Co-evolution
67
problems, designing a suitable fitness function that can lead to the search of solutions can be very difficult, if possible [Yao (1994)]. However, with co-evolutionary learning, this need of having a fitness function is essentially removed. Instead, a co-evolutionary learning system only needs to be able to rank contending solutions based on how they compared to one another. Here, games are well-suited, natural problem applications for coevolutionary learning systems. In particular, although games can be approached from an optimization context, it may not be possible to construct a fitness function that fully represent the problem of the game and fully discriminate solutions found through optimization algorithms. With coevolutionary learning, however, the search can be directed to find for better game strategies (e.g., defeat more strategies) as the evolutionary process continues [Chellapilla and Fogel (1999)]. In particular, for the IPD game, there have been many different approaches since Axelrod’s early study in [Axelrod (1987)] that investigated a particular co-evolutionary learning system. Like the study of EAs (commonly known as Evolutionary Computation) [Yao (1994); Fogel (1994a); B¨ ack et al. (1997); Fogel (1995); B¨ ack (1996)], there are a wide variety of specific strategy representations, selection and variation operators in the co-evolutionary learning approach used for the IPD game. A complete survey is beyond the scope of this chapter. Instead, the more popular choices will be reviewed here. The important thing to note is that all the coevolutionary learning systems used were based on the framework illustrated in figure 3.2, i.e., they involved an adaptation process on IPD strategies in some form of representations (involves variations and selection) based on interactions (game-play between strategies). For strategy representations, particularly on deterministic and reactive IPD strategies that were mostly studied, Axelrod and Lindgren [Axelrod (1987); Lindgren (1991)] were among the first few who used binary strings of ones (cooperation) and zeroes (defection) encoding for a look-up table (essentially a binary decision tree) representation. The look-up table in particular determines the outcome for the strategy based on the pairs of previous moves made by the strategy and the opponent. Since the strategies require histories of previous moves in order to make a response, they are encoded with the necessary histories for previous moves. We [Chong and Yao (2005)] recently introduced a look-up table representation that directly represents IPD strategies based on responses to previous moves. For the case of looking back the previous pair of moves made by the strategy and the opponent, direct look-up table represents the strategy responses as a
68
S. Y. Chong et al.
two-dimensional table. Each table element represents the response based on the pair of previous moves. Instead of some fictitious histories required to start the game, the direct look-up table specifies the first move directly. Fogel among many others [Fogel (1991, 1993, 1996); Miller (1989); Stanley et al. (1995)] used finite state machines (FSMs) for their capability of representing complex behaviors of IPD strategies. With FSMs, behavioral responses of an IPD strategy based on previous moves depend on the states and the next-state transitions. The motivation for using FSM compared to look-up table is to have a behavioral representation of IPD strategies instead of the look-up table representation of responses based on histories of previous moves (see [Fogel (1993)] for the full discussion on the origin of using FSM and evolution to simulate intelligent behaviors). In addition to the simple look-up table and FSM, neural network representations had also been experimented with and studied [Harrald and Fogel (1996); Darwen and Yao (2000); Chong and Yao (2005); Franken and Engelbrecht (2005)]. Although neural networks are primarily used for their ability of providing nonlinear input-output responses [Chellapilla and Fogel (1999)], the initial motivation to representing IPD strategies also include the capability of neural networks to process and represent a continuous range of behaviors [Harrald and Fogel (1996)]. After selecting a strategy representation, the next step is to consider the design of variation operators that are aimed at providing variations of IPD strategies in the population. In most cases, variation operators are dependent of the strategy representation considered. For example, look-up table encoded as binary strings can use crossovers and bit-flip mutation as in the case of standard genetic algorithms [Axelrod (1987)]. For the case of FSMs, variation operators may include altering a next-state transition, adding or removing states, and altering the output symbol (corresponding to making a choice). With neural networks, especially those that are realvalued representations, self-adapting mutations based on some probability distribution (i.e., Gaussian or Cauchy) can be used [Chong and Yao (2005)] (one of us has provided a comprehensive review on evolving neural networks in [Yao (1999)]). As for designing the process of selecting IPD strategies for the next generation, many other selection operators can be used (those found in EAs [Fogel (1994a); B¨ ack et al. (1997)]) and not just limited to proportional selection used by Axelrod in the first study of co-evolving IPD strategies [Axelrod (1987)]. For the case of obtaining the fitness for a particular IPD strategy in the population, payoffs obtained from the IPD game are usually
Learning IPD Strategies Through Co-evolution
69
used. In particular, many studies considered calculating the expected IPDpayoff-based-fitness using a round robin tournament whereby all pairs of strategies compete, including the pair where a strategy plays itself. 3.2.2. Shadow of the Future In the IPD game, the shadow of the future refers to the situation whereby the number of moves of a game is known in advance. In this situation, there is no incentive to cooperate in the last move because there is no risk of retaliation from the opponent. However, if every player defects on the last move, then there is no incentive to cooperate in the move prior to the last one. If every player defects in the last two moves, then there is no incentive to cooperate in the move before that, and so forth. As such, we would end up with mutual defection in all moves. One popular way to address this issue and to allow for cooperation to emerge is to have a fixed probability in ending the game on every move, thereby keeping the game length uncertain. Most of the studies that used the co-evolutionary learning approach considered a fixed game length (number of moves) in all game plays. For example, Axelrod [Axelrod (1987)] and others such as [Fogel (1991, 1993); Chong and Yao (2005)] used 150 moves (move start from 0). Other game lengths can be used, although the choice depends on the motivation of the study, e.g., a sufficiently long game length to allow for strategies to reciprocate cooperation. In any case, the fixed game length is used because the strategy representation cannot count the number of moves that have been played and how many more remain. 3.2.3. Issues for Co-evolutionary Learning of IPD Strategies For the IPD game, there are two main contexts in which co-evolutionary learning can be considered. First, co-evolutionary learning can be used to search for effective strategies, given the specific the rules of the game that govern the complexity of strategy interactions. Second, a co-evolutionary learning system can serve as a model for investigating how certain conditions (e.g., game rules, co-evolutionary learning system setup, or others) can lead to the evolution of certain behaviors. For the context of using co-evolutionary learning to search for effective strategies, the main issue is to evolve IPD strategies that perform well (e.g., defeat) against a large number of opponents. Axelrod [Axelrod (1987)] used a co-evolutionary learning system and compared the evolved strategies
70
S. Y. Chong et al.
with the representative strategies (e.g., tit for tat) obtained from his earlier tournaments that accounted for average performance of all strategies that participated the tournaments [Axelrod (1980a,b)]. He noted that some of the evolved strategies outperformed these representative strategies. Although results obtained from evolving effective IPD strategies were promising, the study in [Axelrod (1987)] had the important implication on specifying a principled method to determine the effectiveness (or robustness [Axelrod and Hamilton (1981)]) of evolved IPD strategies by testing them against some representative strategies. One of us (Yao) first framed this particular study in the context of generalization [Darwen and Yao (1995); Yao et al. (1996)]. In particular, co-evolutionary learning is a machine learning system that can be analyzed for its generalization performance. Here, the generalization performance of a co-evolutionary learning system for the IPD game can be thought of as the performance of the best strategy in the population or the population itself (e.g., using a gating algorithm that effectively combines different IPD strategies of the population as a single strategy entity [Darwen (1996); Darwen and Yao (1997)]) against a large number of IPD strategies, especially those that the evolved strategies have yet to play with during evolution. For the context of using co-evolutionary learning as a model to understand the conditions of how, why, and what IPD strategy behaviors are evolved, there are many issues that can be studied. First, one can consider the impact of specific IPD game specifications (e.g., payoff matrices [Fogel (1993)] and duration of interactions or game length [Fogel (1996)]) on evolved IPD strategy behaviors. Second, there are also studies that have focused on the impact of the interaction or game-play itself, which are not just limited to noisy interactions [Julstrom (1997)], continuous behavioral responses [Harrald and Fogel (1996)], and the possibility of refusal to interact [Stanley et al. (1995)]. Third, the specific the design of the co-evolutionary learning system itself can have an impact whereby certain IPD behaviors are favored and persist for a long period (e.g., investigating whether systems that provided genotypic diversity actually lead to a diverse population of IPD strategies with a variety of behaviors [Darwen and Yao (2000, 2001, 2002)]). 3.3. Extending the IPD Game The primary motivation in most studies that extend the classical IPD game is to model more complex IPD interactions that are closer to real-world
Learning IPD Strategies Through Co-evolution
71
interactions. This section describes some of the extended IPD games that have been investigated using the co-evolutionary learning approach. Each subsection starts with the motivation for extending the IPD game in a specific manner, and the important issues of studying the more complex IPD games. Each subsection discusses and concludes general observations obtained from the co-evolutionary learning of the particular extended IPD game. 3.3.1. Extending the IPD with More Choices Several studies have extended the classical IPD with more than two extreme choices that are available for play. That is, there are intermediate choices between full cooperation and full defection that strategies can response with. Fogel [Harrald and Fogel (1996)] investigated a continuous IPD game. We have investigated the IPD with multiple, discrete levels of cooperation [Darwen and Yao (2000, 2001, 2002); Chong and Yao (2005)], which could be use to approximate the continuous IPD game when the number of levels is sufficiently large. The main motivation of extending the IPD with more choices is to allow for the modelling of subtle behavioral interactions that are not possible with only two extreme choices. With the classical IPD game, the possible behaviors that strategies can exhibit are severely limited. For example, a strategy for the classical IPD game cannot play intermediate choices that allow for some degree of exploitation of the opponent without risking retaliation from an otherwise cooperative opponent [Harrald and Fogel (1996)]. The co-evolutionary learning approach usually considers a neural network strategy representation because it can be used to process a continuous range of behaviors (i.e., real numbers for representing the degree of cooperation) easily. Furthermore, for the case of IPD games with multiple, discrete levels of cooperation, a neural network is scalable to the number of levels considered. Fogel [Harrald and Fogel (1996)] showed that for the extended IPD a continuous range of choices, the evolution of cooperation is unstable, with fluctuations of average scores representing short periods of cooperation and defection. We have further shown that with increasingly higher number of choices to play in the IPD game with multiple, discrete levels of cooperation, evolution to cooperation are more difficult to achieve [Darwen and Yao (2000, 2001, 2002)]. From these studies, it appears that a co-evolving population of IPD
72
S. Y. Chong et al.
strategies has a higher tendency of evolving to play full defection. However, this does not mean that evolution to cooperation is not possible, or that cooperative behaviors that persist cannot be evolved. For example, it has been shown that evolving cooperative behaviors depends on the complexity of strategy representation that is used. In the case of neural networks, the number of nodes in the hidden layer can affect the co-evolutionary learning system to produce IPD strategies with cooperative responses [Harrald and Fogel (1996)]. In addition to the complexity of strategy representation, another important factor for evolving cooperative strategies is that of behavioral diversity. Early studies [Darwen and Yao (2000, 2001)] have shown that genetic diversity (i.e., variations at the genotypic level of strategy representations) does not equate to behavioral diversity (i.e., variations of IPD strategy responses) in the population. Without sufficient behavioral diversity, the co-evolving population can overspecialize to a specific strategy behavior that is vulnerable to invasion (e.g., cycles between tit for tat, naive cooperators, and defectors). As such, increasing the level of genetic diversity in the co-evolutionary learning system does not necessarily lead to an increase in behavioral diversity that can help with the evolution of cooperative strategies. We have recently further shown that strategy representation also plays an important factor in introducing behavioral diversity in the coevolutionary learning system [Chong and Yao (2005)]. We considered the n-choice IPD game, which was obtained based on the following linear interpolation: pA = 2.5 − 0.5cA + 2cB ,
− 1 ≤ cA , cB ≤ 1,
where pA is the payoff to player A, given that cA and cB are the cooperation levels of the choices that players A and B make, respectively. Fogel [Harrald and Fogel (1996)] also considered a similar interpolation process. However, we considered multiple, discrete levels of cooperation. For example, we used the four -choice IPD game, where the four cooperation levels are represented as +1 (full cooperation), +1/3, −1/3, and −1 (full defection). These choices can be used with the linear interpolation equation shown above to obtain the payoff. Figure 3.3 illustrates the payoff matrix of a four -choice IPD game that was used [Chong and Yao (2005)]. Note that in generating the payoff matrix for a n-choice IPD game, the following conditions must be satisfied [Chong and Yao (2005)]:
Learning IPD Strategies Through Co-evolution
73
PLAYER B 1
1
+1
+3
í3
+1
4
23
2
13
1
0
1 3
4
1 3
3
2 3
1 3
1
í1
PLAYER
+
A
í3
1
43
2
33
1
2
2 3
í1
5
33
2
23
1
1
Fig. 3.3. The payoff matrix for the two-player f our-choice IPD used in [Chong and Yao (2005)]. Each element of the matrix gives the payoff for Player A.
(1) For cA < c0A and constant cB : pA (cA , cB ) > pA (c0A , cB ), (2) For cA ≤ c0A and cB < c0B : pA (cA , cB ) < pA (c0A , c0B ), and (3) For cA < c0A and cB < c0B : pA (c0A , c0B ) > 21 (pA (cA , c0B ) + pA (c0A , cB )). These conditions are analogous to those for the classical IPD’s. The first condition ensures that defection always pays more. The second condition ensures that mutual cooperation has a higher payoff than mutual defection. The third condition ensures that alternating between cooperation and defection does not pay in comparison to just playing cooperation. We investigated two strategy representation: neural networks and direct look-up table. We considered these two strategy representations because they allow the investigation on the impact of strategy representation on the introduction and maintenance of variations of behavioral responses in the population of IPD strategies. On the one hand, the neural network indirectly represents the input-output response mappings of IPD strategies, with possibilities of many-to-one mappings between representations and actual behavioral responses [Fogel (1994b); Atmar (1994)]. On the other hand, the direct look-up table directly represents the input-output response mappings of IPD strategies. We hypothesized that a more direct representation of IPD strategies will allow more behavioral variations to be introduced and maintained in the population through co-evolution. For the neural network representation, we used a fixed-architecture feedforward multilayer perceptron (MLP) [Chong and Yao (2005)]. Specifically, the neural network consists of an input layer, a single hidden layer of ten nodes, and an output node. The network is fully connected and strictly layered (i.e., no short-cut connection from the input layer to the output node. The transfer (activation) function used for all nodes is the hyperbolic
74
S. Y. Chong et al.
tangent function, tanh(x). The input layer consists of the following four input nodes: (1) The neural network’s previous choice, i.e., level of cooperation, in [−1, +1]. (2) The opponent’s previous level of cooperation. (3) An input of +1 if the opponent played a lower cooperation level compared to the neural network, and 0 otherwise. (4) An input of +1 if the neural network played a lower cooperation level compared to the opponent, and 0 otherwise. The input layer is a function of two variables (e.g., neural network’s previous choice and the opponent’s previous choice) since the last two inputs are derived from the first two inputs. These additional inputs are to facilitate learning the recognition of being exploited and exploiting. Given the inputs, the neural network’s output determines the choice for its next move. The output is a real value between +1 and −1 that is discretized to either +1, +1/3, −1/3 or −1, depending on which discrete value the neural network output is closest to. We considered self-adaptive mutation for variation operators for the real-valued representation of neural networks that we used [Chong and Yao (2005)]. This approach associates a neural network with a self-adaptive parameter vector [σi (j)] that controls the mutation step size of the respective weights and biases of the neural network [wi (j)]. Offspring neural networks ([wi0 (j)] and [σi0 (j)]) are generated from parent neural networks ([wi (j)] and [σi (j)]) through mutations. Two different mutations based on Gaussian and Cauchy distributions were used in order to further investigate the impact of indirect strategy representation on variation operators that could increase genetic diversity but not necessarily lead to increase in behavioral diversity. For the self-adaptive Gaussian mutation, offspring neural networks are generated according to the following equations: σi0 (j) = σi (j) ∗ exp(τ ∗ Nj (0, 1)); i = 1 . . . 15, j = 1, . . . , Nw , wi0 (j) = wi (j) + σi0 (j) ∗ Nj (0, 1); i = 1 . . . 15, j = 1, . . . , Nw , where Nw = 63, τ = (2(Nw )0.5 )−0.5 = 0.251, and Nj (0, 1) is a Gaussian random variable (zero mean and standard deviation of one) resampled for every j. Nw is the total number of weights, biases, and the pre-game inputs required for an IPD strategy based on memory length of one.
Learning IPD Strategies Through Co-evolution
75
For the self-adaptive Cauchy mutation that is known to provide bigger changes to the neural network weights (i.e., provide more genetic diversity) [Yao et al. (1999)], the following equations are used: σi0 (j) = σi (j) ∗ exp(τ ∗ Nj (0, 1)); i = 1 . . . 15; j = 1, . . . , Nw , wi0 (j) = wi (j) + σi0 (j) ∗ Cj (0, 1); i = 1 . . . 15; j = 1, . . . , Nw , where Cj (0, 1) is a Cauchy random variable (centered at zero and with a scale parameter of 1) resampled for every j. All other variables remain the same as those in the self-adaptive Gaussian mutation. For the direct look-up table representation, the details can be illustrated by figure 3.4 [Chong and Yao (2005)], which shows the behavioral response of a four -choice IPD strategy. mij specifies the choice to be made, given the inputs i (player’s own previous choice) and j (opponent’s previous choice). Rather than using pre-game inputs (two for memory length one strategies), the first move is specified independently. Each of the table elements can take any of the possible four choices (+1, +1/3, −1/3, −1). Opponent’s Previous Move
Player’s Previous Move
1
1
+1
+3
í3
í1
+1
m11
m12
m13
m14
+
1 3
m21
m22
m23
m24
í
1 3
m31
m32
m33
m34
í1
m41
m42
m43
m44
Fig. 3.4. The look-up table representation for the two-player IPD with four choices and memory length one [Chong and Yao (2005)].
A simple mutation operator was used to generate offspring. Mutation replaces the original element, mij , by one of the other three possible choices with an equal probability. For example, if mutation occurs at m13 = +1/3, then the mutated element m013 can take either +1, −1/3, or −1 with an equal probability. Each table element has a fixed probability, pm , of being replaced by one of the remaining three choices. The value pm is not optimized. Crossover is not used in any of the experiments. With a direct representation of IPD strategy behaviors, a simple mutation is more than sufficient to provide behavioral diversity in the population.
76
S. Y. Chong et al.
The following co-evolutionary procedure was used [Chong and Yao (2005)]: (1) Generation step, t = 0: Initialize N/2 parent strategies, Pi , i = 1, 2, ..., N/2, randomly. (2) Generate N/2 offspring, Oi , i = 1, 2, ..., N/2, from N/2 parents using a variation. (3) All pairs of strategies compete, including the pair where a strategy plays itself (i.e., round-robin tournament). For N strategies in a population, every strategy competes a total of N games. (4) Select the best N/2 strategies based on total payoffs of all games played. Increment generation step, t = t + 1. (5) Step 2 to 4 are repeated until termination criterion (i.e., a fixed number of generation) is met. In particular, we used N = 30, and repeated the co-evolutionary process for 600 generations (which is sufficiently long to observe an evolutionary outcome, e.g., persistent cooperation). A fixed game length of 150 iterations is used for all games. Experiments are repeated for 30 independent runs. Note that additional steps were taken to ensure that the initial population has sufficient behavioral diversity in addition to genotypic diversity [Darwen and Yao (2000)] to avoid early convergence of results. All details are available in [Chong and Yao (2005)]. The procedure involves setting particular parameters for specific strategy representation and resampling for new strategies to make sure that the frequency at which each of the four choices (+1, +1/3, −1/3, −1) is played is approximately similar so that there is no bias to play a particular choice early in the evolution. Results showed that there were fewer number of runs where the population evolved to play mutual cooperation in experiments that used neural network representations [Chong and Yao (2005)]. For example, some runs had intermediate outcomes while a few had defection outcomes (Fig. 3.5). This is quite different from the case for classical IPD games [Axelrod (1987); Darwen and Yao (1995)] where each run converged to mutual cooperation quite consistently and quickly. Increasing genetic diversity (e.g., using selfadaptive Cauchy mutation) do not necessarily lead to more behavioral diversity in the population since some runs still evolved to intermediate or defection outcomes (Fig. 3.6). The results further illustrates that more choices have made cooperation more difficult to evolve. However, when direct look-up table representation was used, results
Learning IPD Strategies Through Co-evolution
77
4
Average Payoff
3.5
3
2.5
2
1.5
1 0
100
200
300
400
500
600
Generation
Fig. 3.5. Five sample runs of a co-evolutionary learning system that used neural network representation with a self-adaptive Gaussian mutation in the four -choice IPD [Chong and Yao (2005)].
4
Average Payoff
3.5
3
2.5
2
1.5
1 0
100
200
300
400
500
600
Generation
Fig. 3.6. Five sample runs of a co-evolutionary learning system that used neural network representation with a self-adaptive Cauchy mutation in the four -choice IPD [Chong and Yao (2005)].
showed that the evolution to cooperation was not difficult [Chong and Yao (2005)]. For example, results showed that even when a simple mutation with a low probability of mutation (e.g., pm = 0.05) was used, no run evolved to mutual defection even though intermediate outcomes were obtained (Fig. 3.7). However, increasing the probability of mutation resulted with all populations in all runs evolving to mutual cooperation play. The results showed that the choice of strategy representation can have an impact on the evolution of cooperation if it allows for greater behavioral diversity in the population. 3.3.2. IPD with Noise A natural extension to the classical IPD is to consider the impact of noisy interactions on the evolution of certain behaviors. Axelrod noted two types
S. Y. Chong et al.
78
4
Average Payoff
3.5
3
2.5
2
1.5
1 0
100
200
300
400
500
600
Generation
Fig. 3.7. Five sample runs of a co-evolutionary learning system that used direct look-up table representation with a simple mutation at pm = 0.05 in the four -choice IPD [Chong and Yao (2005)].
of noise, i.e., misimplementation and misperception, that can affect a strategy’s response to the opponent’s choice of play [Axelrod and Dion (1988)]. With misimplementation, the strategy knows a mistaken play but the opponent does not know. With misperception, one or both interacting strategies may not know that a different choice was made. The main motivation for this extension is to study the impact of noise on the learning of certain behaviors through co-evolution when interactions can be noisy. In particular, one issue that can be considered is whether cooperative strategies based on reciprocity (such as tit for tat) can still perform well when noise, which affects strategy behavioral response based on previous moves, is present. Julstrom [Julstrom (1997)] investigated the effects of noise in the twochoice IPD through a co-evolutionary learning system. In particular, noise was modelled as mistakes. That is, there is a probability that the choice played by a strategy is changed to the other choice (e.g., defection is played instead of the original cooperation, and vice versa). Results from the experiments showed that noise (starting around 2%) can reduce the level of cooperation in the population. Recently, we further extended the IPD game with more choices by introducing noise and used a co-evolutionary learning system as a model for investigations [Chong and Yao (2005)], which we have detailed in the earlier subsection. We also modelled noise as mistakes that a player makes. For the four -choice IPD game, there is a certain probability of occurrence, pn , and is fixed throughout a game where a strategy intends to play a particular choice but ends up with a different choice instead. For example, with pn = 0.05, there will be a 0.05 probability that if 1/3 is intended to be played, one of the other three possible cooperation levels, i.e., +1, −1/3,
Learning IPD Strategies Through Co-evolution
79
and −1, will be chosen uniformly at random. Results from experiments again showed the importance of behavioral diversity for the evolution of cooperation for noisy IPD games with more choices. For noise introduced at very low probabilities (less than 1.5% or pn = 0.0015), evolution to cooperation is more likely than the case when noise was not introduced. Strategies were observed to be more forgiving, confirming the predictions of other studies noted in [Axelrod and Dion (1988); Wu and Axelrod (1995)]. However, when noise was introduced at high probabilities (starting around 5% or pn = 0.05), evolution to cooperation was more difficult. The population was more likely to evolve to defection. Despite this, if the co-evolutionary learning system has sufficient behavioral diversity (e.g., using direct look-up table representation that allows for behavioral diversity to be introduced and maintained more easily and effectively), evolution of cooperation is not greatly affected [Chong and Yao (2005)]. Evolved strategies still played high levels of cooperation even when there are more choices to play and that the interactions can be noisy, both which can contribute to more difficulty of evolving cooperative behaviors. For example, table 3.1 compares different co-evolutionary learning system with different levels of behavioral diversity, e.g., C-CEP (neural network and self-adaptive Gaussian mutation), C-FEP (neural network and selfadaptive Cauchy mutation), C-PM05 (direct look-up table and mutation at pm = 0.05) for different noise levels (%) [Chong and Yao (2005)]. Results show the number of runs for each experiment that evolved to mutual defection, e.g., average payoff less than 1.5. The table showed that no runs evolved to mutual defection when direct look-up table representation was used in the co-evolutionary learning system [Chong and Yao (2005)]. Table 3.1. Comparison of results for three different co-evolutionary learning systems. Noise (%) C-CEP C-FEP C-PM05 0 4 1 0 5 4 9 0 10 7 11 0 15 8 17 0 20 18 26 0
It should be noted that although both mutation and noise can be considered as sources of behavioral variations in models that encourage coopera-
80
S. Y. Chong et al.
tion [Mcnamara et al. (2004)], they produce behavioral diversity differently. Mutation introduces strategies with different behaviors into the population. Noise allows other parts of a strategy’s behavior that are not played otherwise in a noiseless IPD game to be accessed. Our results [Chong and Yao (2005)] showed that noise does not necessarily promote behavioral diversity in the population that lead to a stable evolution to cooperation, although noise at low levels does help. With higher levels of noise, closer inspection of evolved strategies showed the population to overspecialize to a specific behavior that is vulnerable to invasion, leading to cyclic dynamics in the evolutionary process between cooperation and defection. In particular, noise and mutation have different impacts on the evolutionary process [Chong and Yao (2005)]. For example, increasingly higher levels of noise lead to mutual defection outcomes. Given a very noisy environment, strategies overspecialized to play defection only. This was not observed in the noiseless case of the IPD with increasingly more mutations. For example, increasingly higher mutation rates in the co-evolutionary learning system that used direct look-up table representation did not lead to mutual defection outcomes. Strategies were not observed to overspecialized to play defection, or any specific play. 3.3.3. N-Player IPD Real-world interactions may involve more than two players. One famous example is the “tragedy of the commons” [Hardin (1968)], which illustrates the problem of self-interested actions of players for a particular public goods for initial rewards leading to a situation where everyone loses out in the end. For the case of the IPD, N-player interactions can be extended to the original formulation of two-player game [Axelrod and Dion (1988)]. This allows for the study of whether cooperative behaviors are possible when interactions involve more than two players since strategies that are effective for the two-player case may not be effective (or worse, fail) in large group interactions [Glance and Huberman (1994)]. One of us formulated an N-player IPD or NIPD game for investigations using the co-evolutionary learning approach [Yao and Darwen (1994)] (other studies include [Bankes (1994); Lindgren and Johansson (2001)]). The NIPD game is defined by the following three properties [Colman (1982)] (page 159): • Each player faces two choices between cooperation and defection.
Learning IPD Strategies Through Co-evolution
81
• Defection is dominant for each player, i.e., each player is better off defecting than cooperating regardless of how many other players that cooperate. • The dominant defection strategies intersect in a deficit equilibrium. In particular, the outcome if all players choose their non-dominant cooperation strategies is preferable from every player’s point of view to the one in which everyone chooses defection, but no one is motivated to deviate unilaterally from defection. The payoff matrix (Fig. 3.8) for the NIPD game can then be constructed based on the following conditions that must be satisfied [Yao and Darwen (1994)]: • Di > Ci for 0 ≤ i ≤ n − 1. • Di+1 > Di and Ci+1 > Ci for 0 ≤ i ≤ n − 1. • Ci > (Di + Ci−1 )/2 for 0 ≤ i ≤ n − 1. A large number values satisfy these conditions. For the study in [Yao and Darwen (1994)], the values are chosen such that if nc is the number of cooperators in the NIPD game, then the payoff for cooperation is 2n c − 2 and the payoff for defection is 2nc + 1 (Fig. 3.9). For this payoff matrix, the average per-move payoff a can be calculated as follows if Nc cooperative moves are made out of N moves: a=1+
Nc (2n − 3), N
which will allow the measurement of how common cooperation was by examining the average per-round payoff.
Player A
Number of cooperators among the remaining n-1 players
0
1
2
n-1
C
C0
C1
C2
…
Cn-1
D
D0
D1
D2
…
Dn-1
Fig. 3.8. The payoff matrix for the NIPD game. The value in the table gives the payoff to the player based on its choice of play [Yao and Darwen (1994)].
S. Y. Chong et al.
82
Player A
Number of cooperators among the remaining n-1 players
Fig. 3.9.
0
1
2
n-1
C
0
2
4
…
2(n-1)
D
1
3
5
…
2(n-1)+1
An example of the payoff matrix for the NIPD game [Yao and Darwen (1994)].
NIPD game interactions were in the form of a large number of random selection of groups of N players with replacement (e.g., 1000 NIPD games for a population of 100 strategies). Results from the experiments in [Yao and Darwen (1994)] showed the group size (i.e., the value of N in the NIPD game) has a negative impact on the evolution of cooperation. As N increases, there are fewer number of runs where the population evolved to play cooperation. For example, in the case of memory two strategies, only one out of 20 runs had defection outcomes for 3IPD. However, the number of runs with defection outcomes increased to nine for 6IPD. Increasing N to 16 (i.e., 16IPD) resulted with all runs evolved to defection outcomes [Yao and Darwen (1994)]. 3.3.4. Other Extensions There are many other extensions to the classical IPD game, or even further extensions to already extended IPD games (such as the NIPD) that can be studied through a co-evolutionary learning approach. For example, we examined the impact of localized interactions of the NIPD games in [Seo et al. (1999, 2000)]. The earlier study for the NIPD [Yao and Darwen (1994)] showed that the evolution of cooperation is more difficult to achieve through a co-evolutionary learning process as N increases. However, in some real-world interactions, it is unlikely that a player interacts with everybody (or that it has equal probability of interacting with anyone in the population). Instead, a player might interact with other specific players (e.g., neighbours, relatives, or at the workplace). Such localized interactions may involve spatial models [Nowak and May (1992); Ishibuchi and Namikawa (2005)]. In particular, localized interactions can have a positive impact on the evolution of cooperation in the NIPD game. That is, population
Learning IPD Strategies Through Co-evolution
83
structured in a spatial model is more likely to evolve cooperation [Seo et al. (1999); Lindgren and Johansson (2001)]. Another extension that can be considered is to incorporate indirect interactions to the IPD game that originally only considers direct interactions between strategies. Most of the previous studies have focused on modelling direct interactions (e.g., cooperative behaviors through direct reciprocity that involves repeated encounters, i.e., IPD games [Axelrod (1984)]) or indirect interactions (e.g., cooperative behaviors through mechanisms of indirect reciprocity such as reputation where an individual receives cooperation from third parties due to the individual’s cooperative behaviors to others in the case of indirect reciprocity [Nowak and Sigmund (1998b)]). However, it has been suggested that complex real-world interactions involve both direct and indirect interations (although for simplicity for modelling and analysis, only one of the interactions is considered at one time) [Nowak and Sigmund (1998a)]. For this aspect, we have investigated a model with both direct and indirect interactions [Yao and Darwen (1999)]. In particular, each strategy is tagged with a reputation score, which is calculated based on payoffs received from a small random sample of pre-games. A co-evolutionary approach to show that with the addition of reputation, cooperative outcomes are possible and more likely even for the case of the IPD with more choices and shorter game durations [Yao and Darwen (1999)]. In addition to that, another extension will be to consider the adaptation of payoff matrices. We recently conducted a preliminary study on evolving strategy payoff matrices, and how such an adaptation process can affect the learning of strategy behaviors [Chong and Yao (2006)]. The motivation for the study is to relax the assumption of having fixed, symmetric payoff matrix for all evolving strategies. This assumption may not be realistic, considering that not all players are similar in real-world interactions. We focus specifically on an adaptation process of payoff matrix based on past behavioral interactions. In particular, a simple update rule that provides a reinforcement feedback process between strategy behaviors and payoff matrices during the co-evolutionary process is used. Results from experiments [Chong and Yao (2006)] showed that the evolutionary outcome is dependent on the adaptation process of both behaviors (i.e., strategy behavioral responses) and utility expectations that determine how behaviors are rewarded (i.e., strategy payoff matrices). Defection outcomes are more likely to be obtained if IPD-like update rules that favor the exploitation of opponents are used. However, cooperative outcomes can be easily obtained when mutualism-like update rules that favor mutual cooperation are used.
84
S. Y. Chong et al.
3.4. Conclusion and Future Directions
The greatest advantage and the most important feature of co-evolutionary learning is that of the process of adaptation on representation that is dependent on the interactions between members of the population. In this aspect, the co-evolutionary learning approach is well-suited to solving the problem of IPD games in two contexts. First, co-evolutionary learning can be used as a search algorithm for effective strategies without requiring human knowledge. All that is required is the rules of the game. Second, the adaptation process of strategy behaviors based on interactions in coevolution provides a natural way to investigate conditions that lead to the evolution of certain behaviors. In both of these contexts, the advantage of co-evolutionary learning to other approaches is that strategy behaviors are not fixed or predefined. Instead, co-evolutionary learning provides a means to realize strategy behavioral responses that are not necessarily bounded by expert human knowledge, thus providing new insight to the problem. Since the first study of co-evolutionary learning on the classical IPD by Axelrod [Axelrod (1987)], there had been a wide-range of studies that further extended the classical IPD game with additional features such as, but not limited to, continuous or multiple levels of cooperation, noisy interactions, N-player interactions, spatial interactions, and indirect interactions. The motivation in all of these studies is to bridge the gap between the abstract IPD interactions with the complex real-world interactions. As such, by understanding the specific conditions that lead to the evolution of specific IPD strategy behaviors, these studies have further helped to provide a more in-depth view on complex real-world interactions such as those found in the human society. There are still much more that can be explored using the co-evolutionary learning approach. One direction will be to further extend the more complex IPD games and investigate the impact of the additional extension. This is important because the extensions might interact with one another in some unknown and nonlinear fashion. Understanding these interactions will help to further unravel complex human interactions. Another direction will be to investigate a more rigorous approach to determine the robustness of evolved strategy behaviors. In this particular aspect, the notion of generalization might provide a more natural approach for co-evolutionary learning in addition to classical evolutionary game theory approach of the evolutionarily stable strategies.
Learning IPD Strategies Through Co-evolution
85
References Atmar, W. (1994). Notes on the simulation of evolution, IEEE Transactions on Neural Networks 5, 1, pp. 130–147. Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, The Journal of Conflict Resolution 24, 1, pp. 3–25. Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, The Journal of Conflict Resolution 24, 3, pp. 379–403. Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York). Axelrod, R. (1987). The evolution of strategies in the iterated prisoner’s dilemma, in L. D. Davis (ed.), Genetic Algorithms and Simulated Annealing, chap. 3 (Morgan Kaufmann, New York), pp. 32–41. Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science 242, 4884, pp. 1385–1390. Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science 211, pp. 1390–1396. B¨ ack, T. (1996). Evolutionary Algorithms in Theory and Practice (Oxford University Press, New York). B¨ ack, T., Hammel, U. and Schwefel, H. P. (1997). Evolutionary computation: Comments on the history and current state, IEEE Transactions on Evolutionary Computation 1, 1, pp. 3–17. Bankes, S. (1994). Exploring the foundations of artificial societies: Experiments in evolving solutions to iterated n-player prisoner’s dilemma, in R. Brookes and P. Maes (eds.), Artificial Life IV (Addison-Wesley), pp. 337–342. Chellapilla, K. and Fogel, D. B. (1999). Evolution, neural networks, games, and intelligence, Proc. IEEE 87, 9, pp. 1471–1496. Chong, S. Y. and Yao, X. (2005). Behavioral diversity, choices, and noise in the iterated prisoner’s dilemma, IEEE Transactions on Evolutionary Computation 9, 6, pp. 540–551. Chong, S. Y. and Yao, X. (2006). Self-adaptive payoff matrices in repeated interactions, in 2006 IEEE Symposium on Computational Intelligence and Games (CIG’06) (IEEE Press, Piscataway, NJ), pp. 103–110. Colman, A. M. (1982). Game Theory and Experimental Games (Pergamon Press, Oxford). Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated prisoner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in Artificial Intelligence, Vol. 956, pp. 276–292. Darwen, P. and Yao, X. (2000). Does extra genetic diversity maintain escalation in a co-evolutionary arms race, International Journal of Knowledge-Based Intelligent Engineering Systems 4, 3, pp. 191–200. Darwen, P. and Yao, X. (2001). Why more choices cause less cooperation in iterated prisoner’s dilemma, in Proc. 2001 Congress on Evolutionary Computation (CEC’01) (IEEE Press, Piscataway, NJ), pp. 987–994. Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with intermediate levels of cooperation: Application to missile defense, International Journal of Computational Intelligence and Applications 2, 1, pp.
86
S. Y. Chong et al.
83–107. Darwen, P. J. (1996). Co-evolutionary Learning by Automatic Modularization with Speciation, Ph.D. thesis, University of New South Wales, Sydney, Australia. Darwen, P. J. and Yao, X. (1997). Speciation as automatic categorical modularization, IEEE Transactions on Evolutionary Computation 1, 2, pp. 101–108. Fogel, D. B. (1991). The evolution of intelligent decision making in gaming, Cybernetics and Systems: An International Journal 22, pp. 223–236. Fogel, D. B. (1993). Evolving behaviors in the iterated prisoner’s dilemma, Evolutionary Computation 1, 1, pp. 77–97. Fogel, D. B. (1994a). An introduction to simulated evolutionary optimization, IEEE Transactions on Neural Networks 5, 1, pp. 3–14. Fogel, D. B. (1994b). An introduction to simulated evolutionary optimization, IEEE Transactions on Neural Networks 5, 1, pp. 3–14. Fogel, D. B. (1995). Evolutionary Computation: Toward a New Philosophy of Machine Intelligence (IEEE Press, Piscataway, NJ). Fogel, D. B. (1996). On the relationship between the duration of an encouter and the evolution of cooperation in the iterated prisoner’s dilemma, Evolutionary Computation 3, 3, pp. 349–363. Franken, N. and Engelbrecht, A. P. (2005). Particle swarm optimization approaches to coevolve strategies for the iterated prisoner’s dilemma, IEEE Transactions on Evolutionary Computation 9, 6, pp. 562–579. Glance, N. S. and Huberman, B. A. (1994). The dynamics of social dilemmas, Scientific American , pp. 58–63. Hardin, G. (1968). The tragedy of the commons, Science 162, pp. 1243–1248. Harrald, P. G. and Fogel, D. B. (1996). Evolving continuous behaviors in the iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s Dilemma 37, pp. 135–145. Ishibuchi, H. and Namikawa, N. (2005). Evolution of iterated prisoner’s dilemma game strategies in structured demes under random pairing in game playing, IEEE Transactions on Evolutionary Computation 9, 6, pp. 552–561. Julstrom, B. A. (1997). Effects of contest length and noise on reciprocal altruism, cooperation, and payoffs in the iterated prisoner’s dilemma, in Proc. 7th International Conf. on Genetic Algorithms (ICGA’97) (Morgan Kauffman, San Francisco, CA), pp. 386–392. Lindgren, K. (1991). Evolutionary phenomena in simple dynamics, in C. G. Langton, C. Taylor, J. D. Farmer and S. Rasmussen (eds.), Artificial Life II (Addison-Wesley), pp. 295–312. Lindgren, K. and Johansson, J. (2001). Coevolution of strategies in n-person prisoner’s dilemma, in J. Crutchfield and P. Schuster (eds.), Evolutionary Dynamics - Exploring the Interplay of Selection, Neutrality, Accident, and Function (Addison-Wesley). Mcnamara, J. M., Barta, Z. and Houston, A. I. (2004). Variation in behaviour promotes cooperation in the prisoner’s dilemma, Nature 428, pp. 745–748. Miller, J. (1989). The coevolution of automata in the iterated prisoner’s dilemma, Tech. Rep. 89-003, Santa Fe Institute Report. Nowak, M. A. and May, R. M. (1992). Evolutionary games and spatial chaos,
Learning IPD Strategies Through Co-evolution
87
Nature 355, pp. 250–253. Nowak, M. A. and Sigmund, K. (1998a). The dynamics of indirect reciprocity, Journal of Theoretical Biology 194, pp. 561–574. Nowak, M. A. and Sigmund, K. (1998b). Evolution of indirect reciprocity by image scoring, Nature 393, pp. 573–577. Seo, Y. G., Cho, S. B. and Yao, X. (1999). Emergence of cooperative coalition in nipd game with localization of interaction and learning, in Proc. IEEE 1999 Congress on Evolutionary Computation (CEC’99) (IEEE Press, Piscataway, NJ), pp. 877–884. Seo, Y. G., Cho, S. B. and Yao, X. (2000). Exploiting coalition in co-evolutionary learning, in Proc. IEEE 2000 Congress on Evolutionary Computation (CEC’00) (IEEE Press, Piscataway, NJ), pp. 1268–1275. Stanley, E. A., Ashlock, D. and Smucker, M. D. (1995). Prisoner’s dilemma with choice and refusal of partners: Evolutionary results, in Proc. Third European Conf. on Advances in Artificial Life, pp. 490–502. Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s dilemma, The Journal of Conflict Resolution 39, 1, pp. 183–189. Yao, X. (1994). Introduction, Informatica (Special Issue on Evolutionary Computation) 18, pp. 375–376. Yao, X. (1999). Evolving artificial neural networks, Proc. IEEE 87, 9, pp. 1423– 1447. Yao, X. and Darwen, P. (1999). How important is your reputation in a multiagent environment, in Proc. 1999 Conf. on Systems, Man, and Cybernetics (SMC’99) (IEEE Press, Piscataway, NJ), pp. 575–580. Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated prisoner’s dilemma games, Informatica 18, pp. 435–450. Yao, X., Liu, Y. and Darwen, P. J. (1996). How to make best use of evolutionary learning, in R. Stocker, H. Jelinck, B. Burnota and T. Bossomaier (eds.), Complex Systems - From Local Interactions to Global Phenomena (IOS Press, Amsterdam), pp. 229–242. Yao, X., Liu, Y. and Lin, G. (1999). Evolutionary programming made faster, IEEE Transactions on Evolutionary Computation 3, 2, pp. 82–102.
This page intentionally left blank
Chapter 4 How to Design a Strategy to Win an IPD Tournament
Jiawei Li University of Nottingham, Harbin Institute of Technology
4.1. Introduction Imagine that a player in an IPD tournament knows the strategy of each of his opponents; he will defect against opponents such as ALLC or ALLD and cooperate with opponents such as GRIM or TFT in order to maximize his payoff. This means that he can interact with each opponent optimally and receive higher payoffs. Although this information a priori is not possible, one can identify a strategy during the game. For example, if a strategy cooperated with its opponent in the previous 10 rounds while its opponent defected, it seems sensible to deduce that it will always cooperate. In fact, each strategy will gradually reveal itself through the IPD game; moreover, it is not after the game that we can identify the strategy but possibly after a few rounds. With an efficient identification mechanism, it is possible for a strategy to interact with most of its opponent optimally. However, two main problems must be solved in designing an efficient identification mechanism. Firstly, it is impossible, in theory, for a strategy to identify an opponent within a finite number of rounds because the number of possible strategies is huge. Only can the types of strategies belonging to a preconcerted finite set be identified, which may be just a small proportion of all those possible because identification will be of no use if it takes too long. Secondly, there exists a risk of exploring an opponent putting the player into a much worse position. In other words, such an action may have negative effect on future rewards. For example, in order to distinguish between ALLC and GRIM, a strategy has to defect at least once and loses the chance to cooperate with GRIM in the future. In this chapter we will discuss how to resolve these problems, how to 89
90
J. Li
design an identification mechanism for IPD games, and how the strategy of Adaptive Pavlov was designed, which was ranked first in Competition 4 of the 2005 IPD tournament. 4.2. Analysis of strategies involved in IPD games Every strategy may have its disadvantages as well as its advantages. A strategy may receive high payoffs when its opponent belongs to some set of strategies, and receive lesser payoffs when an opponent belongs to another set of strategies. However, some strategies always do better than others in IPD tournaments. The strategies involved in IPDs can be classified according to whether or not they respond to their opponents. One set of strategies is fixed and plays a predetermined action no matter what their opponent does. ALLD, ALLC and RAND are typical. Other strategies are more complicated and their actions depend on their opponent’s behavior. TFT, for example, starts with COOPERATE and then repeats his opponent’s last move. The second set is obviously superior to the former since the strategies like TFT, TFTT and GRIM have always performed better than ’fixed’ strategies in past IPD tournaments. Then, the question is what the optimal response to every opponent is. Is TFT’s imitation of opponent’s last move the best response? Although TFT has been shown to be superior to many other strategies, it is not good enough to win every IPD tournament. Let’s consider a simulation of IPD tournament with 9 players. These players are ALLC, ALLD, RAND, GRIM, TFT, STFT, TFTT, TTFT, and Pavlov. The descriptions of the strategies of these players are as shown in Table 4.1. These strategies are simple and representational, and have all appeared in past IPD tournaments. The rule of our simulation is that each strategy will play a 200-round IPD game with every strategy (including itself). The payoffs in a round are as shown in Fig. 4.1. The total payoff received by any given strategy is the summation of the payoffs throughout the tournament. The results of the tournaments vary because there are random choices in the strategies of Pavlov and RAND. In order to decrease the variability of the result, the tournament is repeated several times and the average score for each strategy is calculated. Simulation results show that TFT, TFTT and GRIM acquire higher scores than the others and their average scores across several tournaments are quite close. TFTT, however, wins
How to Design a Strategy to Win an IPD Tournament
Table 4.1.
Description of the players of the IPD simulation.
Players
Descriptions
ALLC ALLD RAND GRIM
This strategy always plays COOPERATE. This strategy always plays DEFECT. It plays DEFECT or COOPERATE with 1/2 probability. Starts with COOPERATE, but after one defection plays always DEFECT. Starts with COOPERATE, and then repeats opponent’s moves. Like TFT but it plays DEFECT after two consecutive defections. Like TFT but in first move it plays DEFECT. Like TFT but it plays two DEFECT after opponent’s defection. Result of each move is divided into two groups: SUCCESS (payoff 5 or 3) and DEFEAT (payoff 1 or 0). If the last result belongs to SUCCESS group it plays the same move, otherwise it plays the other move.
TFT TFTT STFT TTFT Pavlov
91
Player2’schoice
Player1’schoice
CO O PERATE D EFECT
CO O PERATE (3,3) (5,0)
D EFECT (0,5) (1,1)
Fig. 4.1. Payoffs table of the IPD tournament. The numbers in brackets denote the payoffs two players receive in a round of a game.
more times than the others in a single tournament. For example, TFTT wins 11 tournaments from a total of 20, while TFT wins 4 and GRIM wins 5. In addition, if Pavlov and RAND are removed TFTT will always win. One of the limitations of TFT is that it will inevitably run into the circle of defecting-defected (which means that TFT plays COOPERATE while its opponent defects; and then TFT plays DEFECT while its opponent cooperates) when its opponent happens to be STFT. However, cooperation will be achieved resulting in higher payoffs if TFT cooperates once after its opponent defects. TFTT is superior to TFT in this regard. And it is this reason why TFTT wins more tournaments than TFT in the above IPD simulation. It is easy to verify that TFT will not get lower scores than TFTT if STFT is removed from the simulation. Thus, we can improve the strategy of TFT in this way: when TFT enters a circle of defecting-defected (for example a sequence of 3 pairs of defecting-defected) it will choose COOPERATE in two continuous rounds. This modified TFT (MTFT) will achieve higher payoffs than TFT in the case that their opponents are STFT. By substituting MTFT for TFT, IPD
92
J. Li
experiments show that MTFT gets the highest average score and wins more single tournaments than the others. MTFT has used an identification technique. It identified STFT by detecting the defecting-defected circles in the process of an IPD game. When the opponent was considered to be STFT, optimal action (cooperates in two sequential rounds) would be carried out in order to maximize future payoffs. In this way, it is natural to deduce that MTFT can be further improved so that it can identify more strategies and then interact with them optimally. In the following sections, an approach to identify each strategy in a finite set will be introduced. A strategy can interact with the opponents almost optimally by using this identification mechanism. 4.3. Estimation of possible strategies in an IPD tournament In this section, we seek to define a finite set of types of strategies to be identified. Since the number of possible strategies for IPD are infinite, it is impossible to identify each of them in a finite number of rounds. For example, suppose that a strategy cooperated with its opponent in 10 sequential rounds while its opponent defected continuously. Although it is very likely to be ALLC, there are always other possibilities. It may be GRIM but the trigger is 11 defections; it may be RAND that has just happened to play 10 sequential COOPERATEs; or it may be a combination of ALLC and TFT and it will behave as TFT type in the following rounds. However, since only ALLC belongs to the set of identification, those other possibilities will be eliminated. How to choose the set of identification depends on prior knowledge and subjective estimation. Some strategies like TFT are likely to appear; while others are designated as default strategies. There are numerous strategies one can design for an IPD tournament. However, most of them seldom appear because their chances of winning are very small. For example, there may be such a strategy that it cooperates in the first two rounds and defects in the following two rounds, and then it cooperates and defects alternately. Few players will apply such a strategy because it is unlikely to win any IPD tournament. It is obvious that the strategies that usually win appear frequently and the others appear infrequently. We define two classifications of IPD strategies: cooperating and defecting. Cooperating strategies, for example TFT and TFTT, wish to coop-
How to Design a Strategy to Win an IPD Tournament
93
erate with their opponents and never start defecting. Defecting strategies, for example ALLD and Pavlov beginning with DEFECT (PavlovD), wish to defect in order to maximize their payoffs and they always start defecting. The cooperating strategies differ in the way of their responses to the opponent’s defections. For example, TFTT is more forgiving than TFT as it retaliates only if its opponent has defected twice. GRIM is sterner than TFT as it never forgives a defection. These strategies can be classified according to their responses to the opponent’s defections. The rules are the same as the one described in the previous simulation as shown in Fig. 4.2.
Stern
Forgiving
ALLC
TFTT Fig. 4.2.
TFT
TTFT
GRIM
The cooperating strategies.
The defecting strategies differ in the way they insist on defecting. PavlovD is a representative strategy in this set. It starts with DEFECT. If the opponent is too forgiving to retaliate, it defects forever. Otherwise, it tries to cooperate with the opponent.a The defecting strategies can be classified as shown in Fig. 4.3. Defect more
Defect less
PavlovD
STFT Fig. 4.3.
ALLD
The defecting strategies.
Other simple strategies which lack a clear objective differ from the cooperating and defecting strategies and hardly ever get high scores in IPD tournaments. Most of the players of an IPD tournament will be cooperating strategies at the present time since cooperating strategies have been dominant in most of the tournaments. There will also be a small quantity of dea Although
PavlovD tries to cooperate with an opponent when the opponent retaliates upon its defection, it seldom succeeds. For example, even if PavlovD meets a forgiving strategy like TFTT they cannot keep cooperating in the game. In fact, if only PavlovD cooperates one more time cooperating can be achieved. We have examined a modified PavlovD (MPavlovD) strategy that starts with DEFECT and cooperates twice when the opponent retaliates. The results of simulation show that MPavlovD always gains more scores than PavlovD.
J. Li
94
fecting strategies. Based on the above idea, we have designed the Adaptive Pavlov strategy that applies a simple mechanism to distinguish cooperating strategies and several representative defecting strategies. 4.4. Interaction with a strategy optimally For any strategy there must be another strategy that optimally deals with it. Because the strategies of ALLC, ALLD and RAND are independent of the opponent’s behavior, ALLD is the optimal strategy. Because GRIM, TFT, STFT and TTFT retaliate as soon as their opponent defects, the optimal strategy for its opponent is to always cooperate but defect in the last round. TFTT is more charitable and forgives a single defection; therefore, its opponent can maximize the payoff by alternately choosing DEFECT and COOPERATE. If Pavlov starts with COOPERATE its opponent should always cooperate except in the last round; Otherwise, its opponent should start with DEFECT, then always cooperate except in the last round. Table 4.2 shows the optimal strategies to deal with each strategy shown in Table 4.1. Table 4.2. Strategies ALLC ALLD RAND GRIM TFT TFTT STFT TTFT Pavlov
Optimal strategies to interact with a known strategy. Optimal strategy of opponent
It always plays DEFECT. It always plays DEFECT. It always plays DEFECT. It always plays COOPERATE except DEFECT in the last move. It always plays COOPERATE except DEFECT in the last move. It starts with DEFECT, and then plays COOPERATE and DEFECT in turn. It always plays COOPERATE except DEFECT in the last move. It always plays COOPERATE except DEFECT in the last move. If Pavlov starts with DEFECT it starts with DEFECT, and then always plays COOPERATE except that it plays DEFECT in the last round; If Pavlov starts with COOPERATE it always plays COOPERATE except that it plays DEFECT in the last round.
Given an IPD tournament with n players, a player will win the tournament if it interacts with each of its opponent optimally. For example, a unique ALLD will win when the other n − 1 players in a IPD tournament are all ALLC. Hence, the winning strategy of an IPD tournament must be optimal in interacting with most of the others. Although the strategy of a player is unknown to his opponent before
How to Design a Strategy to Win an IPD Tournament
95
a game, the strategy gradually emerges as the game progresses. It is not difficult for a human player to identify the strategy of his opponent but it is more difficult for a computer program to possess the ability of identification. To make this feasible, there is a need for a method to distinguish each type of strategy from the others, and then a computer program can interact with different types of strategies with a relevant response. Under the assumption that every player belongs to a pre-defined finite set of strategies, an example is given to show how the method of identification is realized and how the winning strategy is designed . Consider an IPD tournament with 10 players. Besides the players shown in Table 4.1, let us add a new player MyStrategy (MS) which applies an identification mechanism to identify its opponent. The rules are the same as those described in the previous simulation. MS starts with DEFECT. If its opponent chooses DEFECT in the first round, MS chooses COOPERATE in round two, otherwise MS chooses DEFECT. MS always chooses COOPERATE in the third round. In this way, most of the strategies can be identified after just three rounds. For example, suppose that the choices of MS and its opponent in the first 3 rounds are as shown in Fig. 4.4. The strategy of the opponent can be confirmed to be RAND. Because the opponent starts with DEFECT it must be one of the strategies of ALLD, STFT, RAND and Pavlov. Since MS defects in the first round and the opponent cooperates in round two, it is impossible to be ALLD or STFT. Since MS and the opponent cooperate in the second round, the opponent should not defect in the third round if it were Pavlov. Therefore, the opponent must be RAND. The optimal strategy is ALLD in interacting with RAND, and MS will behave as ALLD in the following rounds of the game.
Round 1
Round 2
Round 3
M S’sm oves
D efect
Cooperate
Cooperate
O pponent’sm oves
D efect
Cooperate
D efect
Fig. 4.4.
A possible process of a game (shows that the opponent is RAND).
Some possible results of identification for the 9 strategies are listed in Table 4.3, where ’C’ denotes COOPERATE and ’D’ denotes DEFECT. Because the strategy RAND chooses its move randomly it may behave like
J. Li
96
any other strategy during a short period; therefore, there needs to be more rounds to distinguish RAND from other strategies. If there is a process different from that of as shown in Table 4.3, the strategy of the opponent must be RAND. Table 4.3. Players
Identification of the 9 strategies.
Possible moves of two players
Identification result
MyStrategy The opponent
D C C D C C
Pavlov (RAND)
MyStrategy The opponent
D C C D D D
ALLD (RAND)
MyStrategy The opponent
D C C D D D C
STFT (RAND)
MyStrategy The opponent
D D C C C C C
ALLC (RAND)
MyStrategy The opponent
D D C C C C D
TFTT (RAND)
MyStrategy The opponent
D D C C C D C
Pavlov (RAND)
MyStrategy The opponent
D D C C C C D D C
TFT (RAND)
MyStrategy The opponent
D D C C C C D D D C
TTFT (RAND)
MyStrategy The opponent
D D C C C C D D D D
GRIM (RAND)
In this way, a strategy can be identified after several rounds of game, and then the optimal strategy can be applied. Ten IPD tournaments with the above 10 players are carried out.b The simulation results are as shown in Fig. 4.5. It shows that MS gains the highest average payoffs when compared to the other strategies and achieves the highest score in each tournament. The reason for MS’s success is that b How
many rounds an IPD game commits is usually not fixed in order to avoid the players’ knowing when the end of the game is due. The simulation applies a fixed number of rounds in order to decrease complexity of computation. However, the strategy of MS does not make use of this to get extra payoff; that is to say, MS does not purposely choose DEFECT in the last round of a game.
Points in 10 tournaments
Average Rank
MS
6134
6213
6179
6127
6202
6175
6152
6172
6212
6187
6175.3
1
TFTT
5957
5996
5970
6003
5994
5959
5965
5969
5966
5976
5975.5
2
TFT
5961
5936
5919
5946
5959
5938
5940
5929
5954
5978
5946.0
3
Pavlov
5718
5691
5725
5775
5816
5763
5748
5763
5733
5745
5747.7
4
TTFT
5725
5723
5725
5717
5719
5725
5746
5732
5722
5716
5725.0
5
GRIM
5404
5394
5416
5410
5440
5468
5322
5400
5390
5384
5402.8
6
ALLC
5115
5091
5103
5127
5103
5103
5103
5082
5109
5091
5102.7
7
RAND
4339
4349
4254
4340
4216
4219
4258
4241
4228
4274
4271.8
8
STFT
4165
4187
4160
4169
4179
4144
4173
4158
4142
4158
4163.5
9
ALLD
3800
3792
3852
3792
3848
3856
3832
3864
3832
3832
3830.0
10
Fig. 4.5.
How to Design a Strategy to Win an IPD Tournament
Players
Simulation results of 10 IPD tournaments.
97
98
J. Li
it has almost optimally interacted with most of the strategies in this IPD tournament. Most IPD strategies, such as TFT or Pavlov, are memory-one strategies which can only respond to the opponent’s last move; however, the past process of the game contains more information. The identification mechanism of MS uses information about the opponent’s strategy, thus MS responds to not just the opponent’s past moves but the opponent’s strategy. By identifying different opponents, MS makes use of more information than the simple strategies. This is the reason MS is able to win IPD tournaments. Different identification approaches may lead to different results for MS. For example, all of the strategies GRIM, TFT and ALLC start with COOPERATE, and they will not defect if their opponents don’t. To identify each of these strategies, MS starts with DEFECT and loses the chance to cooperate with GRIM. On the other hand, if MS doesn’t firstly defect, it cannot distinguish the 3 strategies and cannot interact with ALLC optimally. The risk involved in exploring the opponent must be considered in order to choose an efficient or payoff-maximizing identification approach. 4.5. Escape from the trap of defection When a player begins to explore the opponent, there is a risk of the identifying process’s putting the player into a much worse position. Some strategies, especially those with trigger mechanism such as GRIM, will change their behaviors at the trigger point. For example, the strategy MS described in the above section defects at the beginning of IPD games in order to distinguish each of the cooperating strategies ALLC, TFT and GRIM; however, the chance to cooperate with GRIM is lost. In IPD games, the risk of identification is mainly the trap of defection, which means an identifying process leading the opponent to keep defecting with nothing that can be done to rescue the situation. It appears that a strategy will not run into the trap of defection if it never defects first. But this is not the case. Suppose a strategy keeps playing COOPERATE if its opponent defects, and defects forever once its opponent cooperates; then, any cooperating strategy will be defected against in interacting with it while most of defecting strategies will keep cooperating. If there is a equal possibility of this reverse-GRIM strategy appearing in a game to that of GRIM, to cooperate or to defect has equal risk to invoke future defection. This means that there always exists the risk of the defection trap whether or not an identification mechanism is applied.
How to Design a Strategy to Win an IPD Tournament
99
One may argue that the reverse-GRIM type of strategies will not appear as frequently as GRIMs in IPDs, so to cooperate is safer than to defect and the MS strategy is more likely to run into the defection trap than TFT. That is right. But it is not enough to testify that a defection trap is not inevitable for a strategy with an identification mechanism because many identification approaches can be applied. For example, a simple way to avoid retaliation from GRIM is not to defect first. The identification mechanism that Adaptive Pavlov used in 2005 IPD tournament only explored defecting strategies in order to keep cooperation with each of those cooperating strategies. Again, what kind of identification mechanisms should be applied depends on prior knowledge and subjective estimation. If there are enough ALLC strategies in an IPD game, it is worth identifying them from other cooperating strategies. But if GRIMs are prevailing, it is better not to defect first. Generally speaking, we can compare different identification approaches to choose the most efficient one although uncertainty still exists. 4.6. Adaptive Pavlov and Competition 4 of 2005 IPD tournament The 2005 IPD tournament comprised 4 competitions. Competition 4 mirrored the original competition of Axelrod. There were a total of 50 players including 8 default strategies. The strategy of Adaptive Pavlov (AP) that was ranked first in Competition 4 will be analyzed in this section. The strategy of AP combines 6 continuous rounds to a period and applies different tactics in different periods. AP behaves as a TFT strategy in the first period, and then changes its strategy according to the identification of its opponent. AP classifies the possible opponents into 5 categories: cooperating strategies, STFT, PavlovD, ALLD and RAND.c By identifying the opponent’s strategy at the end of a period, AP shift its strategy in the new period in order to deal with each opponent optimally. AP is never the first to defect, and thus it will cooperate with each cooperating strategy. AP tries to cooperate with the strategies of STFT and PavlovD, and defect to the strategies such as ALLD or RAND. The processes of AP’s interacting with cooperating strategies, ALLD, STFT, and PavlovD in the first 6 rounds are shown in Fig. 4.6 (AP behaves as TFT). For example, when a process of interaction as shown in Fig. 4.6(c) c RAND
is claimed to be a default strategy.
100
1
2
3
4
5
6
AP
C
C
C
C
C
C
Co-op
C
C
C
C
C
C
1
2
3
4
5
6
AP
C
D
D
D
D
D
A LLD
D
D
D
D
D
D
(a)
(b)
2
3
4
5
6
AP
C
D
C
D
C
D
STFT
D
C
D
C
D
C
(c)
1
2
3
4
5
6
AP
C
D
D
C
D
D
PavlovD
D
D
C
D
D
C
(d)
Fig. 4.6. Identifying the opponent according to the process of interaction in six rounds. (a) AP cooperates with any cooperating strategy. (b) ALLD strategy always defects. (c) If a strategy alternately plays D and C when interacting with TFT, it is identified to be STFT. (d) If a strategy periodically plays D-D-C when interacting with TFT, it is identified to be PavlovD.
J. Li
1
How to Design a Strategy to Win an IPD Tournament
101
occurs, the opponent will be identified to be STFT and AP will cooperate twice in the next period in order to achieve cooperation. If the opponent is determined to be PavlovD, AP will defect once and then always cooperate in the next period. If there is a process of interaction different from that of as shown in Fig. 4.6, the opponent will be identified as RAND. In this way, any strategy that is not defined in identification set is likely to be identified as RAND. Once cooperation has been established, AP will always cooperate unless a defection occurs. Identification of the opponent is performed in each period throughout the IPD tournament in order to correct misidentification and to deal with those players who change their strategies during a game. As we have mentioned, most of the players will be cooperating strategies. The results show that there are 34 cooperating strategies in Competition 4 (including 4 default strategies of TFT, TFTT, GRIM and ALLC). With the exception of the default strategies, there are still 3 strategies that behave like ALLD, 5 strategies that behave like STFT, and 2 strategies that behave like NEG. As shown in Table 4.4, AP can identify most of the strategies involved in Competition 4.d Table 4.4.
Categories of the strategies in Competition 4.
Categories
Number of the strategies
Cooperating strategies Strategies like STFT Strategies like ALLD Strategies like NEG Strategies like RAND Others
34 6 4 3 1 2
4.7. Discussion and conclusion AP belongs to the type of adaptive automata for IPD. However, it differs from other adaptive strategies in respect of how adaptation is achieved. The approach of AP exactly belongs to the set of artificial intelligence approaches. Rather than adjusting some parameters in computing responses as most of the adaptive strategies do, AP uses an identification mechanism d AP
regards NEG as RAND. It maximizes the scores when interacting with the strategies like NEG because either of the optimal strategies to interact with NEG and RAND are ALLD.
102
J. Li
which acts as an expert system. Knowledge about different opponents is expressed in the form of ’If..., then...”, for example, if the opponent cooperates in 6 rounds then it is determined to be ALLC. In this way, information acquired and used can be transparently expressed and thus AP can tell which strategy the opponent is using. Recent years have seen many AI approaches applied in evolutionary game theory and IPD, for example reinforcement learning, artificial neural networks, and fuzzy logic [Sandholm and Crites (1996); Macy and Carley (1996); Fort and Perez (2005)]. To solve the problem of computing a best response to an unknown strategy has been one of the objectives of those AI approaches. The problem is, in general, intractable because of the computational complexity, and finding the best response for an arbitrary strategy can be non-computable [Papadimitriou (1992); Nachbar and Zame (1996)]. Reinforcement learning which is based on the idea that the tendency to produce an action should be reinforced if it produces favourable results, and weakened if it produces unfavourable results [Gilboa (1988); Gilboa and Zemel (1989)] is widely used for the automata to learn from the interaction with others. With respect to IPD, several approaches have been developed to learn optimal responses to a deterministic or mixed strategy [Carmel and Markovitch (1998); Darwen and Yao (2002)]. However, computational complexity is still the main difficulty in the application of these approaches in real IPD tournaments. AP’s identification mechanism is implemented in a simple way by making use of a priori knowledge, which greatly reduces the computational complexity and makes it practical for AP to respond to the opponent almost optimally. First, a priori knowledge about what strategies are more likely to appear in the IPD tournament is used in determining the identification set. The size of the identification set is restricted in order to reduce computational complexity. Second, a priori knowledge about how well different identification approaches will work in a certain environment is used in selecting an efficient identification approach, with which AP can avoid the risk of identification and maximize the payoffs. Third, a priori knowledge about how to identify the opponent according to the process of interaction is used in constructing the identification rules. With these simple rules, the AP strategy is easy to understand. It is obvious that the identification set can be extended in order to include more strategies that can be identified; however, more calculations will be involved as the size of identification set increases. We have to make a tradeoff between the wish to identify any strategy and the wish to develop a less complicated strategy. Compared to the NP-completeness
How to Design a Strategy to Win an IPD Tournament
103
of those reinforcement learning approaches [Papadimitriou (1992)], AP’s √ computational complexity is between O( n) and O(n), which depends on the similarities of those strategies to be identified. Therefore, the algorithm of AP is suitable for real IPD tournaments. An identification mechanism can also work in the environment with noise, where each strategy might, with a possibility, misunderstand the outcome of game. Noise blurs the boundaries between different strategies. However, identification can still be applicable by admitting a small identification error. In this circumstance, we can set a threshold value that the opponent is considered to be identified if the probability of misidentification is smaller than this value. Just as the case of identifying the strategy of RAND, the probability of mistakenly identifying a strategy will decrease to zero as the process of computation and identification repeats. Information plays a key role in intelligent activities. The individuals with more information consequentially gain the advantage over others in most circumstances. With an identification mechanism, strategies such as AP acquire information about their opponents and they are more intelligent than any known strategies such as TFT or Pavlov. These type of strategies are suitable in modeling the decision-making process of human beings, where learning and improvement frequently happens.
References Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy (almost) optimally, in Proceedings of the International Conference on Multi Agent Systems, pp. 64–71. Darwen, P. and Yao, X. (2002). Co-evolution in iterated prisoner’s dilemma with intermediate levels of cooperation: Application to missile defense, International Journal of Computational Intelligence and Applications 2, 1, pp. 83–107. Fort, H. and Perez, N. (2005). The fate of spatial dilemmas with different fuzzy measures of success, Journal of Artificial Societies and Social Simulation 8, 3. Gilboa, I. (1988). The complexity of computing best response automata in repeated games, Journal of Economic Theory 45, pp. 342–352. Gilboa, I. and Zemel, E. (1989). Nash and correlated equilibria: some complexity considerations, Games and Economic Behavior 1, pp. 80–93. Macy, M. and Carley, K. (1996). Natural selection and social learning in prisoner’s dilemma: co-adaptation with genetic algorithms and artificial neural networks, Sociological Methods and Research 25, 1, pp. 103–137.
104
J. Li
Nachbar, J. and Zame, W. (1996). Non-computable strategies and discounted repeated games, Economic Theory 8, pp. 103–122. Papadimitriou, C. (1992). On players with bounded number of states, Games and Economic Behavior 4, pp. 122–131. Sandholm, T. and Crites, R. (1996). Multiagent reinforcement learning in the iterated prisoner’s dilemma, Biosystems 37, 1-2, pp. 147–166.
Chapter 5 An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
Oscar Alonso, Fernando Ni˜ no National University of Colombia
5.1. Introduction The Prisoner’s Dilemma [Tucker (1950)] is a game in which two players have to decide between two options: cooperate, doing something that is good for both players, and defect, doing something that is worse for the other player but better for oneself. No pre-play communication is permitted between the players. The dilemma arises because no matter what the other does, each player will do better defecting than cooperating, but as both players defect, both will do worse than if both had cooperated [Alonso et al.]. The payoff obtained by each player is given by a payoff matrix, as shown in table 5.1. The first number in each cell represents the payoff for the row player, and the second value represents the payoff for the column player. Table 5.1. Payoff matrix C D C 3,3 0,5 D 5,0 1,1
When the game is played several times between the same players, and the players are able to remember past interactions, it is called the Iterated Prisoner’s Dilemma (IPD). Each player is said to have a strategy, i.e., a way to decide its next move depending on previous interactions. Accordingly, complex patterns of strategic interactions may emerge, which may lead to exploitation, retaliation or mutual cooperation. The Iterated Prisoner’s Dilemma game has attracted the interest of many researchers in a wide set of fields, including game theorists, social 105
106
O. Alonso and F. Ni˜ no
scientists, economists and computer scientists [Axelrod (1984); Angeline; Hofstadter (1985); Yao and Darwen (1994)]. From the computational point of view, there has been a deep interest in the development of effective strategies for the IPD game [Yao and Darwen (1994); Axelrod (1984); Delahaye and Mathieu (1995)]. Most well-known IPD strategies have been proposed by humans, specifying the decision rules that a player will follow depending on the opponent’s behaviour[Beaufils et al. (1997); Nowak and Sigmund (1993)]. Clearly, this has mainly depended on the researcher’s assumptions about the game. In a first computational approach, Axelrod explored human designed strategies by confronting them through a tournament [Axelrod (1984)]. Conversely, there has been also some interest in obtaining IPD strategies using evolutionary computation, coevolution, reinforcement learning and other computational techniques, without explicitly specifying the decision rules [Sandholm and Crites (1995); Darwen and Yao (1995)]. These methods have found good IPD strategies, requiring little or no intervention from a human. For instance, in Axelrod’s work, human-designed strategies were compared to strategies obtained through evolution and coevolution [Axelrod (1984)]. Further research has been done towards finding strategies that generalise well without human intervention. Studies have focused on coevolutionary approaches, since no human intervention is required in the evaluation process. For instance, Darwen and Yao [Darwen and Yao (1996)] proposed a speciation scheme in order to get a modular system that played the IPD, in which coevolution and fitness sharing were used in order to get a diverse population that played as a whole against the opponent. The scheme showed a significant degree of generalisation. The model proposed in this work falls into the second kind of method. Thus, the main goal of this research is to generate an agent which will learn to play the IPD game and will be able to adapt to the opponent’s behaviour. Learning, memory and adaptation capabilities are argued to be desirable to be present in an IPD agent, and consequently, the agent implementation is accomplished using artificial immune networks, a computational technique inspired in the natural immune system that presents such capabilities. The rest of this chapter is organised as follows. First, some fundamentals about artificial immune systems, namely, immune networks are summarised. Subsequently, a general model for an adaptive agent is introduced. Then, a specific immune-based model of this agent is explained in detail. An implementation of the immune model was developed and some experiments were carried out to validate the agent capabilities. The imple-
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
107
mented agent showed adaptation and learning; however, in some cases, the immune agent exhibited a poor performance. 5.2. Immune network fundamentals Antigens are substances capable of inducing a specific immune response. They may be viruses, bacteria, fungi, or other protozoa. They are invaders assumed to cause harm in the body. However, an antigen may be harmless, such as grass pollen [Jonathan (2001)]. On the other hand, antibodies are proteins found in the blood, produced by specialised white blood cells, called B-cells. B-cells make antibodies when the body recognises that something foreign (antigen) is present. Antibodies are the antigen-binding proteins that are present on the B-cell membrane. They are also secreted by plasma cells. The affinity between an antigen and an antibody is given by the complementarity of their binding proteins. If the antigen/antibody affinity is higher than an affinity threshold, the corresponding B-cell becomes stimulated. In the early stages of the immune response, the affinity between the antibodies and antigens may be low, but as the B-cells undergo clonal selection, the binding B-cells mutate and clone again and again to improve the affinity of the binding between a particular antigen and a B-cell. Then, the mature and activated B-cells produce plasma cells, which differentiate into antibodies with a high affinity of the antigen/antibody bonds. The Immune Network Theory tries to explain the way in which a natural immune system achieves immunological memory [Perelson and Weisbuch (1997)]. Jerne [Jerne (1974)] hypothesised that the immune system is a regulated network of molecules and cells that recognise one another even in the absence of antigens, rather than being a set of isolated cells that respond only when stimulated by antigens. Though in immune network theory the main elements are B-cells, most models only consider the antibodies attached to the B-cell membranes. Therefore, here only antibodies will be considered. The basic idea behind immune network theory is that antibodies are stimulated not only by antigens, but also by other antibodies, allowing the generated antibodies to be preserved over time for future encounters with the same or similar antigens. Therefore, when the same antigen reappears, the immune response is faster, since the immune system already contains suitable antibodies to deal with such antigen. This is known as secondary response, which is depicted in figure 5.1. [Jonathan (2001)].
108
O. Alonso and F. Ni˜ no
Fig. 5.1. Secondary Response. The amount of antibodies is greater and the response time is shorter when the antigen is presented for the second time to the immune system
Even though antibodies stimulate each other, there is also a suppression relation between them, which controls the size of the network. Thus, the network structure is a result of the interactions among antibodies. A graphical representation of an immune network model is shown in figure 5.2. An Artificial Immune Network (AIN) is a computational model based on immune network theory. In a broad sense, immune networks are mainly suitable to solve clustering and classification problems, due to their natural dynamics by which affine antibodies stimulate each other, thus forming clusters of antibodies with similar features. Typically, an immune network is stimulated by a set of antigens, corresponding to input data to a problem, and the resulting structure of the immune network will give the solution to the related problem [Castro and Zuben (2000)]. When an antigen is presented to the AIN, the internal dynamics of the AIN develops antibodies with high affinity to the antigen, through a process called affinity maturation. This process implies selection of high affinity antibodies and a mutation process called somatic hypermutation; this is
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
Fig. 5.2.
109
Immune Network Theory
an evolutionary process that in a short period of time evolve antibodies capable of deal with the presented antigen. Several computational models for immune networks have been proposed, which are mainly derived from aiNet, a model used for optimisation and data clustering proposed by de Castro, and RAIN, a model proposed by Timmis, also used for data analysis [Castro and Zuben (2000); Castro (2003)]. In the RAIN model, the resulting set of antibodies exhibits an spatial distribution that reflects the data concentration in the data space. On the other hand, the result of the aiNet model does not present this behaviour, as highly concentrated data are considered redundant and then eliminated. In the aiNet model the interaction among antibodies leads to network suppression, i.e., antibodies that are affine (close) will suppress each other in order to control the size of the network and eliminate redundant information. Consequently, in this work the aiNet model will be used. Notice that this model does not consider stimulation among antibodies. When using an immune network to solve a problem, it is necessary to specify the following aspects:
110
O. Alonso and F. Ni˜ no
(1) Identify the entities of the problem and find the corresponding elements in an immune network, i.e., antibodies and antigens; (2) define an appropriate representation of such elements, (3) define an affinity measure between antigens and antibodies, and among antibodies themselves; and (4) establish the algorithms that model the behaviour of the immune network. 5.3. A general adaptive agent model In this section, a general adaptive agent model to play the IPD game is proposed. The model is based on trying to figure out the opponent’s strategy, which is further used to determine the next move of the agent. The information about the recent history of the game is used to model the opponent’s strategy. Accordingly, in order to decide the next move, the IPD agent will accomplish the following three phases: (1) Recognition of the opponent’s strategy (2) Development of a good strategy to face the opponent (3) Selection of the next move to play In the first phase, the Agent attempts to guess the strategy the opponent is playing, based on the recent history of moves from both players. As a result of this phase, an IPD strategy which resembles the behaviour of the opponent is obtained, which will be used in the next stage. In the second phase, the Agent generates a strategy which obtains a good score when it is faced to the strategy generated in the first phase. Finally, in the third phase, the Agent uses the strategy obtained in the second phase to decide its next move. The adaptive IPD agent consists of a memory, a recognition module, strategy generation module and a decision module (see figure 5.3), which are explained next. • The memory is responsible for storing the recent history of moves played by both, the agent and the opponent. • The recognition module is responsible for recognising the opponent’s strategy based on the recent history; it produces a strategy that resembles the one of the opponent • The strategy generation module is responsible for generating a strategy which obtains a good score when faced to the strategy that resembles
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
Fig. 5.3.
111
General model of the IPD agent
the opponent’s • The decision module is responsible for using the strategy obtained by the strategy generation module in order to decide the next move that the agent will play Though, the model may look simple at first, it should be emphasised that the implementation of each one of the modules is not trivial. The recognition module should try to infer the strategy that the opponent is playing, which may be a difficult task. Also, the strategy generation module should be able to adapt to the changes in the opponent’s strategy. 5.4. Immune agent model The definition of a particular agent based on the general model presented above requires the stipulation of each module, as well as the representation that will be used for the strategies. The recognition of the opponent and the generation of a good strategy against it require adaptability and learning. Additionally, it would be desirable to preserve the strategies generated, which implies a memory
O. Alonso and F. Ni˜ no
112
mechanism. For these reasons, Artificial Immune Networks are used to implement the recognition and strategy generation modules. The structure of the general IPD agent can be seen in figure 5.4, and the global IPD decision making process is described in algorithm 5.1.
Fig. 5.4.
Structure of the immune agent
Algorithm 5.1. Decision making algorithm Decision making 1 while playing 2 do 3 Present history to the recognition AIN 4 Find recognised strategy from the recognition AIN 5 Present recognised strategy to strategy generation AIN 6 Find best payoff strategy from strategy generation AIN 7 Obtain suggested next move from best payoff strategy 8 Play next move
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
113
5.4.1. Strategy representation First, each strategy is represented using a look up table [Axelrod (1984)]. This representation indicates the next move to play, based on the n previous moves of both players. The representation consists of a vector of moves, where each position in the vector indicates the next move to be played given a specific history of the game. Thus, there are 22n possible histories given a memory of n previous moves. Additionally, since there is no initial history, this representation requires 2n assumed pre-game moves at the beginning of the game. Hence, the total length of the vector of moves will be 22n +2n, and given that each position of the vector has 2 possible values, cooperate 2n and defect, the number of strategies that can be represented is 22 +2n . An example of a look up table is shown in figure 5.5.
Fig. 5.5.
Example of a Look up Table representing the strategy TFT
5.4.2. Memory The memory of the agent is represented by 2 vectors, containing the last k moves played by the agent and the opponent. 5.4.3. Recognition module An antibody of the Recognition AIN is represented by an IPD strategy. The Recognition AIN will receive as an antigen the history of recent moves of both, the opponent and the agent itself. As the agent should obtain a strategy similar to the opponent’s, the an-
114
O. Alonso and F. Ni˜ no
tibodies are stimulated according to its similarity with the opponent. This was measured presenting the moves played by the agent to each strategy, and comparing its response with the one of the opponent. Such measure is given by the Hamming distance between the move sequences of the strategy and the opponent. Additionally, the AIN model requires a measure of stimulation between antibodies. Such measure was given by the similarity between the strategies. The similarity between two strategies is measured indirectly as follows: both strategies play against a randomly generated sequence of moves. Then, the moves of the strategies are compared using the Hamming distance and the percentage of coincidences determines the similarity of the strategies. The interaction between antibodies leads to suppression, i.e., similar strategies suppress each other. After presenting the history of recent moves to the Recognition AIN, the most stimulated antibody is taken, because it represents the strategy which better resembles the opponent. A summary of the representation of the elements in the recognition AIN is given in table 5.2. Table 5.2. Recognition immune network representation Immune network Representation Antigen History of moves Antibody IPD strategy Antibody/Antigen affinity Similarity between the strategy an the opponent’s Antibody/Antibody affinity Similarity between the strategies
In addition, the process of affinity maturation requires the strategies to be mutated. Particularly, strategies will be mutated in two fashions. The first one consists of changing the number of previous interactions remembered by the strategy (memory length), and the second one consists of mutating each position of the vector that defines the strategy according to the mutation rate. The process of changing the memory length is performed as follows: the new length is selected randomly between one and the maximum allowed memory length. If the new length of the strategy is same as the old one, nothing has to be done. If it is longer, the new positions of the vector are filled in such a way that the strategy presents the same decision rules as before. This operation is shown in figure 5.6. If the new history length is shorter than before, the process is done as
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
Fig. 5.6.
115
Example of mutation when new LuT memory length is larger
follows: notice that there are four histories which are different in only the last move. Thus, removing the last move will cause those four histories to be compressed into one. Therefore, the corresponding value of the compressed history will be the value that has the majority in the correspondent histories of the original vector. If there is a tie, it is resolved as Defect. This operation is shown in figure 5.7. 5.4.3.1. Immune network model In the aiNet model, all the antigens are known a priori and they are presented to the network many times until the structure of the network adapts to the antigen set. In contrast, the proposed IPD agent, the opponents are not known a priori, and the agent will have to be adapted to the opponents as they appear. Accordingly, to deal with such problem, a slightly modified version of the aiNet algorithm will be used. The main modification of the aiNet algorithm is introduced in the mechanism used by the network to add antibodies to the memory. An antibody interacts with the antibodies that have been already memorised. If the suppression it receives from memorised antibodies is less than the suppression threshold, it is added to the memory and will never be removed. Notice that if an antibody is suppressed by the memorised ones, it means that an antibody capable of recognising such antigen is already present in the
O. Alonso and F. Ni˜ no
116
Fig. 5.7.
Example of mutation when new LuT memory length is shorter
memory. Thus, in order to avoid redundancy, this new antibody is not added to the memory. When a new opponent starts playing a game, there is not enough information to consider that the recognised antibodies correspond to the opponent, therefore adding the antibody in the very early beginning of the game is not a good idea. Additionally, since the agent confronts the same opponent during various moves, it is not necessary to add antibodies to the memory in each movement given that the history of moves does not change significantly with only one new movement. Thus, in this situation, it is more efficient to add the antibodies that have been periodically generated every k movements. The modified version of the aiNet algorithm is summarised in algorithm 5.2.
Algorithm 5.2. Modified aiNet algorithm
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
117
Modified aiNet 1 for each antigen 2 do 3 Add new random antibodies to the network 4 Calculate antigen/antibody affinity 5 Select the n antibodies with highest affinity 6 Clone and hypermutate selected antibodies 7 Re-calculate antigen/antibody affinity 8 Re-select a percentage of highest affinity antibodies 9 Remove low affinity antibodies 10 Calculate suppression among antibodies 11 Remove highly suppressed antibodies 12 Add resultant antibodies to the memory In the algorithm, the affinity (suppression) of the antibodies is normalised in the interval [0,1]. After that, it is considered low (high) in relation to an affinity (suppression) threshold, which is a parameter of the algorithm. Additionally, in the hypermutation process, the mutation rate is inversally related to the affinity of the antigen. Particularly, it was defined as 1- affinity. This means that high affinity antibodies are mutated less than low affinity antibodies, which helps keeping good antibodies while exploring new regions of the search space. When an new antigen is presented, the network dynamics develops antibodies with high affinity (similarity) with it. 5.4.4. Strategy generation module For this module, the antibodies are also represented as game strategies. In this case, the strategy obtained in the phase one is presented as an antigen for the second AIN. As the agent is interested in obtaining a good strategy against the one obtained in the first phase, the antibodies are stimulated according to the result of a short IPD game between the antigen and each antibody, beginning from the current history of the game between the agent and the opponent. The affinity between antibodies, the mutation operator and the immune network algorithm are defined in the same way as in recognition AIN. Therefore, the most stimulated antibody corresponds to the best strategy against the one that resembles the opponent, and is selected as the output of this phase.
118
O. Alonso and F. Ni˜ no
A summary of the representation in the strategy generation AIN is shown in table 5.3. Table 5.3. Strategy generation immune network representation Immune network Representation Antigen IPD strategy Antibody IPD strategy Antibody/Antigen affinity Payoff of a short IPD game Antibody/Antibody affinity Similarity between the strategies
5.4.5. Decision module Once a good strategy against the opponent has been found, it is used to look up the next move that the agent will play, given the recent history of the game.
5.5. Experimental results Some experiments were carried out in order to explore the capabilities of the proposed agent. All the experiments used a payoff matrix where Temptation=5, Punishment=1, Sucker’s Payoff=0 and Reward=4. The values of the parameters of an immune network affect some aspects of it, such as the number of antibodies of the network and the performance of the affinity maturation process. After testing several values for the parameters, the following were found to provide a good behaviour to the agent: the suppression threshold was 0.8, and the affinity threshold was 0.9; the number of stimulated antibodies that were selected in each iteration was 5, and the percentage of stimulated antibodies that were selected after being cloned and hypermutated was 20%. In each iteration of the immune networks, four new random antibodies were added to the network. In clonal selection, the minimum number of clones that a stimulated antibodies could generate was 5, and the maximum amount was 10. New antibodies were added to the memory of each network every 20 moves. In the recognition process, the length of the history of moves was 10, and the maximum length of the memory of the lookup table representation was set to 3 previous moves. The experiments were designed to answer some key questions about the agent’s capabilities, which are addressed in the following subsections.
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
119
5.5.1. Can the agent adapt to a new opponent? In order to test the adaptability of the immune agent when confronting one opponent, it was faced to opponents playing the well-known strategies TFT, ALLD, Pavlov and GRIM. The length of the game was 100 moves, and there were 100 repetitions for each opponent. The average score obtained in this experiment is shown in figure 5.8. a)
b) Adaptation for TFT
Adaptation for ALLD
6
6 Agent Optimal
5
5
4
4 Average Payoff
Average Payoff
Agent Optimal
3
3
2
2
1
1
0
0 0
10
20
30
40
50 Move Number
60
70
80
90
0
100
c)
10
20
30
40
50 Move Number
60
70
80
100
90
d) Adaptation for Pavlov
Adaptation for GRIM
6
6 Agent Optimal
5
5
4
4 Average Payoff
Average Payoff
Agent Optimal
3
3
2
2
1
1
0
0 0
10
20
30
40
50 Move Number
60
70
80
90
100
0
10
20
30
40
50 Move Number
60
70
80
90
100
Fig. 5.8. Adaptability Tests. Optimal is obtained from mutual cooperation in a), c and d), and mutual defection in b).
As it can be seen, the agent adapts its behaviour to the one of the opponent, which leads to an increase of the mean payoff over the first 20 moves, and then it stabilises. 5.5.2. Can the agent adapt to consecutive opponents? The agent was confronted with two opponents one after the other, in order to evaluate the adaptability of the agent to further opponents (i.e. not only the first opponent it confronts). The results for consecutive opponents playing ALLD-TFT and PAVLOV-GRIM can be seen in figure 5.9.
O. Alonso and F. Ni˜ no
120
Adaptation for Consecutive Opponents (ALLD-TFT)
6 Agent Optimal
5
Average Payoff
4
3
2
1
0 0
20
40
60
80
120 100 Move Number
140
180
160
200
Adaptation for Consecutive Opponents (Pavlov-GRIM)
6 Agent Optimal
5
Average Payoff
4
3
2
1
0 0
20
40
Fig. 5.9.
60
80
120 100 Move Number
140
160
180
200
Tests of adaptation to consecutive opponents
Experimental results showed that the immune agent adapts to every new opponent it confront. Moreover, the curves described by the mean
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
121
payoff is very similar to the one found in the first experiment, which shows that the agent preserves its adaptability through multiple games. 5.5.3. Can the agent remember previous opponents? Since immune networks possess a memory mechanism, this was evaluated in the agent. In this setup, the agent first confronts an opponent, then it is faced with another opponent and once again it is confronted by the first opponent. In this case two experiments were performed, the first confronted TFT-ALLD-TFT, and the second one confronted Pavlov-GRIM-PavlovGRIM. Also 100 repetitions of the experiment were carried out and the length of every game was 100 moves. The average payoff can be seen in figure 5.10. The results showed that the mean average curves stabilised faster the second time the agent faced an opponent, as a result of the memory capability of the agent. However, the mean in which the payoff stabilises did not increase. 5.5.4. Results from the IPD competition The agent proposed in this chapter participated in the IPD competitions held at CEC 2004 and CIG 2005, under the name of ”Immune Based Agent”. It competed twice in the first competition and it was ranked 126 out of 223 and 160 out of 223. In the second competition, it participated in the category # 4 (one entry per participant), and it was ranked 40 out of 50. 5.6. Discussion Experimental results show that the proposed agent presents the expected behaviour: it adapts its behaviour to the opponents it confronts in order to increase its payoff, and is also able to remember its interactions with opponents in order to recognise them faster in future encounters. However, the following was observed in the agents behaviour: • The payoff stabilises in a mean value which is less than the best possible payoff. • For some opponents such as GRIM, the performance of the agent is very poor: it obtains a payoff much lower than the best possible
O. Alonso and F. Ni˜ no
122
Memory of previous opponents (TFT-ALLD-TFT)
6 Agent Optimal
5
Average Payoff
4
3
2
1
0 300
250
200
150 Move Number
100
50
0
Memory of previous opponents (Pavlov-GRIM-Pavlov-GRIM)
6 Agent Optimal
5
Average Payoff
4
3
2
1
0 0
50
Fig. 5.10.
100
150
200 Move Number
250
300
350
400
Test of memory of previously met opponents
Some explanations to the agent’s behaviour could be hypothesised as follows:
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
123
• The recognition module finds a strategy similar to the opponent’s, which is good enough for most cases. However, a history of moves could correspond to several opponent strategies, which makes very difficult for the recognition module to find the exact strategy that the opponent is playing. For instance, a history where all previous moves are COOPERATE could correspond to players playing TFT or ALLC, and the best response is different in every case. • Since the recognition process is imperfect, the strategy found by the strategy generation module may not be the most adequate, and could lead the agent to make bad decisions. This produces non-optimum payoff, and with some strongly retaliative opponents, it may lead to mutual defection and low payoffs. This explains why the immune agent does not obtain a good payoff confronting strategies such as GRIM, since it tries to take advantage of the opponent and, consequently, it receives a strong retaliation from GRIM. • The model does not implements a feedback mechanism which may help to determine how good a strategy selected is. Notice that the agent knows the history of moves, but it does not analyse if the strategy it is currently using is good or bad in order to change it if the strategy is performing badly. The experiments also show that since the agent does not reach the best possible payoff, it is slightly exploited by very uncooperative strategies, such as ALLD. An analysis of the performance of the strategy during the competition showed that the agent frequently evolved to mutual defection with opponents that were not fully uncooperative, such as go by majority. There were also some cases where the agent became exploited by some opponents, probably due to the perception limitations of the agent exposed above. As a consequence, the agent performed poorly in the competition. 5.7. Conclusions This work presented an agent model that played the IPD game. The model is based on artificial immune systems in order to achieve adaptability, learning and memory. Some experiments were carried out in order to evaluate the behaviour of the proposed agent. The results showed that the agent presents the expected capabilities: it adapted its own behaviour to suit the opponent’s
124
O. Alonso and F. Ni˜ no
one, going through a learning process which produced an increase of the mean payoff until it reached a stable value. Additionally, the learning process was faster when the agent met the opponent for the second time, which evidenced a memory mechanism. However, although the mean payoff increased and stabilised due to the learning process, it did not reach the optimum value. Additionally, for some strategies such as GRIM, the agent did not even obtain a payoff close to the best possible. This shows that although the agent perform as expected, it still needs to be tuned in order to avoid poor performance in some special cases. Particularly, the proposed model could be modified by using different computational techniques, such as evolutionary algorithms, to implement some of the modules. It may also be extended to include multiple levels of cooperation and multiple opponents. Additionally, the agent could be endowed with a feedback mechanism, such as reinforcement learning. References Alonso, O., Ni˜ no, F. and Velez, M. (2004). A robust immune based approach to the iterated prisoner’s dilemma, in Proceedings of the 3rd International Conference on Artificial Immune Systems, pp. 290–301. Angeline, P. J. (1994). An alternate interpretation of the iterated prisoner’s dilemma and the evolution of non-mutual cooperation, in Proceedings 4th Artificial Life Conference, pp. 353–358. Axelrod, R. (1984). The Evolution of Cooperation (Basic Books, New York, USA). Beaufils, B., Delahaye, J.-P. and Mathieu, P. (1997). Our meeting with gradual: A good strategy for the iterated prisoner’s dilemma, in Artificial Life V (Proceedings of the Fifth Int’l Workshop on the Synthesis and Simulation of Living Systems) (MIT Press), pp. 202–209. Castro, L. N. D. (2003). The immune response of an artificial immune network (ainet), in Congress on Evolutionary Computation (CEC’03) (Canberra), pp. 146–153. Castro, L. N. D. and Zuben, F. J. V. (2000). An evolutionary immune network for data clustering, in IEEE Brazilian Symposium on Artificial Neural Networks (Rio de Janeiro), pp. 84–89. Darwen, P. and Yao, X. (1995). On evolving robust strategies for iterated prisoner’s dilemma, in Progress in Evolutionary Computation, Lecture Notes in Artificial Intelligence, Vol. 956, pp. 276–292. Darwen, P. and Yao, X. (1996). Automatic modularization by speciation, in Proc. of the 1996 IEEE Int’l Conf. on Evolutionary Computation (ICEC’96) (IEEE Press, Nagoya, Japan), pp. 88–93. Delahaye, J.-P. and Mathieu, P. (1995). Complex strategies in the iterated prisoner’s dilemma, in A. Albert (ed.), Chaos and Society, Frontiers in Arti-
An Immune Adaptive Agent for the Iterated Prisoner’s Dilemma
125
ficial Intelligence and Applications, Vol. 29 (IOS Press, Amsterdam), pp. 283–292. Hofstadter, D. R. (1985). The prisoner’s dilemma computer tournaments and the evolution of cooperation, in Metamagical Themas: Questing for the essence of mind and pattern (Basic Books, New York). Jerne, N. K. (1974). Towards a network theory of the immune system, Ann. Immunol. 125, pp. 373–389. Jonathan, T. (2001). Artificial Immune Systems: A novel data analysis technique inspired by the immune network theory, Ph.D. thesis, University of Wales, Aberystwyth, Wales. Nowak, M. A. and Sigmund, K. (1993). A strategy of win-stay lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58. Perelson, A. S. and Weisbuch, R. (1997). Immunology for physicists, Rev. Modern Physics 69, pp. 1219–1267. Sandholm, T. and Crites, R. (1995). Multiagent reinforcement learning in the iterated prisoner’s dilemma, BioSystems: Special Issue on the Prisoner’s Dilemma 37, pp. 147–166. Tucker, A. W. (1950). A two person dilemma, . Yao, X. and Darwen, P. J. (1994). An experimental study of n-person iterated prisoner’s dilemma games, Informatica 18, pp. 435–450.
This page intentionally left blank
Chapter 6 Exponential Smoothed Tit-for-Tat
Michael Filzmoser University of Vienna Reciprocating strategies, as for instance Tit-for-Tat, have been shown to be very successful in IPD Tournaments without noise, while other tournaments and analytical studies show that they perform rather poor in noisy environments. The implementation of generosity or contrition into reciprocating strategies was proposed as a solution for this poor performance. We propose a third possibility, a relief of the provocability property of reciprocating strategies, which we design by exponential smoothing. This chapter explores how exponential smoothing and Tit-for-Tat can be combined in ’Exponential Smoothed Tit-for-Tat’ strategies for the Iterated Prisoners’ Dilemma and how the strategies perform in competitions with and without noise compared to Tit-for-Tat 6.1. Introduction Robert Axelrod (1980a,b, 1984) was the first to perform computer tournaments of the Iterated Prisoners’ Dilemma (IPD). In these tournaments strategies played the Prisoners’ Dilemma repeatedly with additional information about the history of their own moves as well as of the moves of the opponent strategy. In two tournaments with 14 and 62 entries respectively the winner both times was Tit-for-Tat (TFT), submitted by Anatol Rapoport, the simplest of all participating strategies. TFT starts with cooperation and afterwards mirrors the opponent’s move of the previous round. Niceness and provocability were identified to be important properties of successful strategies in the IPD and both are embodied in TFT. Niceness in this context denotes that a strategy never should be the first to defect, while provocability denotes that an ’uncalled for’ defection of the opponent should be punished by a defection immediately (Axelrod, 1980b). 127
128
M. Filzmoser
In recent years the original IPD has been extended by the integration of ’noise’. Noise in the context of IPD can either denote measurement errors — a strategy receives incorrect information that its opponent defected while it actually cooperated and vice versa — or implementation errors — a strategy which is intended to cooperate in a given situation erroneously defects and vice versa (Bendor, 1993).a Axelrod and Wu (1995) state that noise is an important feature of real-world interaction as errors in the implementation of choice can never be completely excluded. It has been shown analytically (Molander, 1985; Bendor, 1993) as well as by further IPD tournaments which incorporated noise (Donninger, 1986; Bendor et al., 1991) that the existence of noise undermines the performance of reciprocating strategies like TFT dramatically. Bendor, Kramer and Stout (1991) argue that a main reason for the poor performance of TFT in noisy environments is the unintended involvement in vendettas of mutual or alternating defection with other nice and provocable strategies, which can be caused by one single implementation error on either side. For coping with noise Axelrod and Wu (1995) propose to make reciprocating strategies more generous or more contrite. Generosity denotes that some of the opponent’s defections are not punished as they could be the result of noise. Such generosity of course can be exploited easily but prevents from an echoing of a single error throughout the whole game and therefore can maintain mutual cooperation among reciprocating strategies. Contrition on the other hand means that a defection as a reaction to a defection of the opponent in the last round, which in turn occurred as an answer to one’s own implementation error in the round before last, should be avoided. While generosity can be conceived as a correction of the opponent’s implementation errors, contrition can be interpreted as the correction of one’s own implementation errors in a noisy environment. However both of these further-developments of reciprocating strategies for noisy environments are one-sided insofar as they focus on correcting their own or the opponent’s implementation errors only, none of these concepts attempts to correct both kinds. Moreover the mitigation of the provocability property of reciprocating strategies takes place in a rather indiscriminate way by an increase of generosity or the implementation of contrition. In an effort to improve the a We
focus exclusively on implementation error as this was the category of noise implemented in the IPD tournament of G. Kendal, P. Darwen, and X. Yao performed in April 2005 on which this study bases (see http://www.prisoners-dilemma.com).
Exponential Smoothed Tit-for-Tat
129
performance of reciprocating strategies if they play against other reciprocating strategies one must not neglect the existence of non-reciprocating strategies. Such strategies could capitalize on the combination of noise and generosity by infrequent but intentional defections. Moreover if the history of the opponent’s moves consists of a series of defections, a single cooperation, which could be an implementation error of the opponent, should not induce a reciprocating strategy to switch from defection to cooperation. In such a case of continuous defection of the opponent an increase of generosity will only reduce the performance of a reciprocating strategy. We share the opinion that a mitigation of the provocability property is essential to overcome the comparatively poor performance of reciprocating strategies like TFT in IPD tournaments with noise. To do so we propose a third alternative beside generosity and contrition. We hold the view that the whole history of the opponent’s moves as well as the misperceptions should be taken into consideration by a reciprocating strategy in the decision to cooperate or defect. Generous or contrite reciprocating strategies as proposed by Axelrod and Wu (1995) only take into account the last move of the opponent and use some additional modification rules to adapt to the situation of noise. The analysis of the entire series of moves of the opponent should allow filtering out reactions to our own implementation errors as well as the opponent’s implementation errors, which in turn should improve the performance of a so-designed reciprocating strategy. In section 6.2 we present exponential smoothing which we suggest as a method to implement the concept of considering the whole history of moves in the decision making process of reciprocating IPD strategies. Furthermore ’Exponential Smoothed Tit-for-Tat’ (ESTFT) strategies are developed. Section 6.3 reports on the performance of the ESTFT strategies in an IPD tournament in competitions with and without noise, and in comparison to TFT. Section 6.4 summarizes the main results and concludes. 6.2. Exponential Smoothed Tit-for-Tat The intention of ESTFT is to incorporate the two properties of TFT, niceness and provocability, which have demonstrated to be important ingredients of successful strategies in the IPD without noise, and mitigate provocability to adjust to the existence of noise. To do so ESTFT uses exponential smoothing. Tzafestas (2000) used exponential smoothing as the basis for the development of his meta-regulated adaptive TFT (a strategy that drops the cooperation rate when the opponent is conceived cooperative and increases
M. Filzmoser
130
it otherwise) and by Ashlock et al. (1996) for memory weighting in a study on partner selection for the IPD. However exponential smoothing has not yet been applied to cope with the problem of noise in the IPD, which is the focus of this study. In the next two subsections, first the concept of exponential smoothing will be briefly presented and afterwards applied for the design of exponential smoothed Tit-for-Tat strategies for competitions with and without noise. 6.2.1. Exponential Smoothing Exponential smoothing was originally a time series analysis approach, which can be used for the analysis of time series that neither exhibit trend nor seasonal components. It allows for weighting past – possibly not so important – observations differently than the recent ones. From the original time series Xt the exponential smoothed time series St can be calculated by (6.1). For the calculations it is necessary to indicate a starting value S 0 as for the first period no observations of the original time series exist.
St =
( S0
(1 − α)St−1 + αXt−1
if t = 0, else
(6.1)
In (6.1) α is the smoothing parameter that indicates the weight assigned to the last observation.b The higher α, the lower the smoothing of the time series, so for α = 1 exponential smoothing reproduces the original time series while for α = 0 the smoothed time series is a constant of St = S0 . Exponential smoothing can be customized for the design of reciprocating or simple deterministic IPD strategies. We conceive the series of the transformations of the opponent’s moves mt as the observations that are to be smoothed. Where the opponent’s moves are transformed into discrete numbers applying (6.2).
mt =
( 1 if opponent move in t is ’cooperate’ 0 if opponent move in t is ’defect’
(6.2)
With the adapted exponential smoothing formula (6.3) different kinds of simple deterministic strategies can be designed, that either defect if St = 0 b The
notion ’smoothing parameter’ is somewhat misleading, as a higher value for this parameter leads to a stronger consideration of currently observed values and therefore results in a less smoothed time series.
Exponential Smoothed Tit-for-Tat
131
or cooperate if St = 1. The parameter combination S0 = 1 and α = 1 exactly equals the TFT strategy (Tzafestas, 2000), with α = 0 and S 0 = 1 (respectively S0 = 0) a constant series of cooperations (respectively defections) and therefore an ALLC (respectively ALLD) strategy can be modelled. Many other combinations of the two variables, starting value S0 and smoothing parameter α, are possible which allow modelling a large number of strategies.
St =
( S0
if t = 0
(1 − α)St−1 + αmt−1
else
(6.3)
We refer to the internal register St as the ’mood’ of the strategy. This mood is a continuous variable ranging from 0 — in case of total defection of the opponent — to 1 — for total cooperation of the opponent. Intermediate values between these two extremes represent different degrees of cooperation (closer to 1) and defection (closer to 0). In the spirit of TFT the next own move (either cooperate or defect) is derived from this mood by a threshold rule (see section 6.2.2). Furthermore we need an initial mood I — an expectation about the opponent’s behavior — to calculate the bounds on the smoothing parameter α for the ESTFT strategies designed for the competition with noise (see section 6.2.2). We derive I from the optimistic assumption that the opponent strategy is cooperative or reciprocating and therefore will cooperate if it plays against ESTFT strategies except the expected 10% of implementation errors due to noise (i.e. I = 0.9). 6.2.2. Strategies for Competitions with and without Noise The ESTFT strategies were actually a two parameter family of IPD strategiesc where the two decision parameters are i) the smoothing parameter α and ii) the threshold rule for which values of St the strategy should cooperate or defect. For all ESTFT strategies the threshold rule for cooperation and defection in round t is determined as follows: for St ≥ 0.5 ESTFT cooperate otherwise defect (see 6.4).
move in t =
( ’cooperate’ if 0.5 ≤ St ≤ 1 ’defect’
if 0 ≤ St < 0.5
(6.4)
In defining one threshold rule for all ESTFT strategies the only variable parameter of these strategies is the α-value. For the ESTFT strategies c The
starting value S0 can be derived from α
132
M. Filzmoser
designed for the competition with noise we demand two additional characteristics to cope with the problem of noise in the IPD: i) they should never defect in return to a single defection of the opponent as this single defection could be an implementation error by the opponent or a reaction to an implementation error of the ESTFT strategy itself, and ii) they should react with defection in return to two consecutive defections of the opponent to avoid exploitation by the opponent. These additional requirements restrict the area of possible values for the smoothing parameter α. The possible area according to restrictions i) and ii) is calculated in (6.5) and (6.6) respectively for an initial mood of I = 0.9. Restriction i): ESTFT should cooperate after a single defection St = (1 − α)I + αmt−1 ≥ 0.5 for mt−1 = 0 and I = 0.9 0.5 = 0.44444 α ≤ 1 − 0.9
(6.5)
Restriction ii): ESTFT should defect after two consecutive defections St−1 = (1 − α)I + αmt−2 St = (1 − α)St−1 + αmt−1 < 0.5 for mt−2 = mq t−1 = 0 and I = 0.9 α>1−
0.5 0.9
(6.6)
= 0.25464
From (6.5) we derive that the ESTFT strategy will not defect after a single defection (mt−1 = 0) of assumed cooperative or reciprocating strategies (I = 0.9) when α ≤ 0.44444 as for this value of the smoothing parameter S t remains above the threshold for defection of 0.5. An α > 0.25464 guarantees that for two consecutive defections of the opponent (mt−2 = mt−1 = 0) the St lies below the threshold value and the ESTFT strategy therefore defects (6.6). Three ESTFT strategies lowESTFT_noise, mediumESTFT_noise, and highESTFT_noise were designed for the competition with noise using αvalues that represent the upper bound, the lower bound, and the average between these extremes (α ≈ 0.34) respectively. As large numbers for the smoothing parameter α lead to higher weighting of the current observations and to a less smoothed value the border induced by (6.5) is applied in the lowESTFT_noise, the border induced by (6.6) in the highESTFT_noise and the average between these borders in the mediumESTFT_noise ESTFT strategy. For the competition without noise neither of the two restrictions
Exponential Smoothed Tit-for-Tat
133
mentioned above is necessary as the true move of the opponent can be observed with certainty. We determine smoothing parameters of α = 0.2 for the highESTFT_classic , α = 0.35 for the mediumESTFT_classic , and α = 0.5 for the lowESTFT_classic strategy respectively and start with cooperation. Due to the decision rule mentioned above values above 0.5 would not change the result and are therefore omitted. The results these strategies achieved compared to TFT in competitions with and without noise are summarized in the next section.
6.3. Tournament Results The IPD computer tournament organized by Graham Kendal, Paul Darwen, and Xin Yao in April 2005 offered an excellent possibility to test the ESTFT strategies and to compare it to TFT in situations with and without noise. In addition to the classical competition without noise (competition 1) – a re-run of Robert Axelrod’s original tournaments – a competition with a 10% chance of noise in the form of implementation error (competition 2) was conducted. The Java Applet that was used to run the tournament, as well as the entries, based on the Java IPDLX software library, in addition simple deterministic strategies with a history of maximal three rounds could be entered via a web-interface. In each of the competitions five runs were performed, each of these runs lasted 200 rounds. For each competition we calculate the average of the payoffs of all five runs the ESTFT strategies and TFT reached if it plays against a specific opponent. In using the average we can filter out random effects induced by noise or by the strategies themselves (e.g. RAND). Furthermore we consider only the payoffs against the 141 strategies that are represented in both the classical competition and the competition with noise. This establishes a common basis for analysis that allows us to perform paired tests on the difference between the average payoffs against each of these 141 reference strategies. Figure 6.3 presents box-wisker diagrams for the six ESTFT strategies and TFT (the according data can be taken from Table 6.1). First we apply a non-parametric paired Wilcoxon test to test the difference in the payoffs of the ESTFT strategies and TFT between the competition without noise and the competition with noise. The alternative hypothesis that payoffs are higher in the competition without noise than in the competition with noise can be accepted for all seven reciprocating
M. Filzmoser
500 400 300
payoff (averaged over the five runs)
600
134
0.20
0.26
0.34
0.35
0.44
0.5
1 (TFT)
soomthing parameter
Fig. 6.1. Box-Wisher plot of the average payoff of ESTFT and TFT strategies for the competition with noise
strategies. The results are highly significant as can be seen from Table 6.2.d Next we test the difference in the payoff between the ESTFT strategies and TFT in the competitions without and with noise. Again we use a non-parametric paired Wilcoxon test. The alternative hypothesis that the payoffs of the focal ESTFT strategy are greater than the payoffs of TFT can be accepted only for the lowESTFT_classic and mediumESTFT_classic (p < 0.05). From Table 6.2 we see that the tournament results reproduce what has been argued analytically and already shown in previous IPD tournaments already (Molander, 1985; Donninger, 1986; Bendor et al., 1991; Bendor, 1993), reciprocating strategies like TFT or ESTFT are less successful in noisy d In
Tables 6.2 and 6.3 the column α represents the specific value of the smoothing parameter for this strategy, µ the mean of the payoff averaged over the five runs per competition, ± the standard deviation of payoffs, V the test statistic and p the significance of the non-parametric paired Wilcoxon test.
Exponential Smoothed Tit-for-Tat
135
Table 6.1. Data for the Box-Wisher plot of the average payoff of ESTFT and TFT strategies of the competition with noise strategy min 1. quartile median 3. quartile max highESTFT_classic 232.6 255.4 432.8 500.2 613.4 highESTFT_noise 242.0 268.8 435.4 515.6 594.2 mediumESTFT_noise 241.0 260.2 430.6 514.0 585.6 mediumESTFT_classic 243.0 270.6 430.4 505.0 593.2 lowESTFT_noise 240.0 264.2 435.4 513.0 597.6 lowESTFT_classic 245.8 300.6 428.0 512.2 586.4 TFT 223.4 269.0 430.6 502.6 614.2
Table 6.2. Performance of the ESTFT and TFT strategies in environments without noise without noise with noise strategy α µ ± µ ± V highESTFT_classic 0.20 467.22 181.14 397.73 123.93 7,817.0 highESTFT_noise 0.26 470.97 180.20 399.73 117.57 8,131.0 mediumESTFT_noise 0.34 470.43 180.45 399.45 120.43 8,115.5 mediumESTFT_classic 0.35 468.47 179.45 404.26 114.16 7,733.5 lowESTFT_noise 0.44 469.43 181.72 401.94 120.24 8,000.5 lowESTFT_classic 0.50 469.76 179.64 408.43 109.95 7,734.0 TFT 1.00 467.49 181.06 400.39 121.78 7,615.0
with and
p < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001 < 0.0001
Table 6.3. ments
Comparison of the TFT and ESTFT strategies in environwith and without noise without noise with noise strategy α V p V p highESTFT_classic 0.20 3.0 0.6054 4,779.0 0.6798 highESTFT_noise 0.26 449.0 0.9980 5,342.5 0.2443 mediumESTFT_noise 0.34 366.0 0.9998 4,962.0 0.5361 mediumESTFT_classic 0.35 500.5 0.6830 6,071.0 0.0142 lowESTFT_noise 0.44 282.0 1.0000 5,308.5 0.1758 lowESTFT_classic 0.50 628.5 0.2342 5,961.0 0.0247
environments than in environments without noise. The comparison between the performance of ESTFT strategies compared to TFT for the competition without and the competition with noise summarized in Table 6.3 shows two noteworthy results. First while there are no significant differences in performance between ESTFT strategies and TFT in the case of no noise, the ESTFT strategies lowESTFT_classic and mediumESTFT_classic are significantly better (p < 0.05) in the presence of noise. Moreover the three ESTFT strategies designed for the competition with noise (lowESTFT_noise,
136
M. Filzmoser
mediumESTFT_noise, and highESTFT_noise) where, though still less successful than TFT, able to reduce the distance to TFT in the competition with noise compared to the one without. Above we mentioned that the α-value is the only parameter that varies across the six ESTFT strategies. In the competition without noise the average performance of all ESTFT strategies except highESTFT-noise – the strategy with the lowest α-value – exceeded the performance of TFT, however these results are not significant according to the non-parametric paired Wilcoxon tests (see Table 6.3). From Table 6.2 one can see that in the competition with noise the three ESTFT strategies with the higher α-values (mediumESTFT_classic, lowESTFT_noise, and lowESTFT_classic) reach higher average payoffs than TFT while the three ESTFT strategies with lower α-values (lowESTFT_classic, lowESTFT_noise, and mediumESTFT_noise) reach lower average payoffs. Generally the performance of the ESTFT strategies in the competition with noise increases with the smoothing parameter. That two ESTFT strategies designed for the classical competition without noise outperformed TFT in the competition with noise while the ESTFT strategies designed for the competitions with noise did rather poorly, does not necessarily contradict the statements made above. We stated that an unbalanced mitigation of the provocability property of reciprocating strategies or too much generosity is insufficient to improve the performance of reciprocating strategies in the IPD with noise. A one-sided reduction of provocability which just focuses on not punishing some of the opponent’s defections as they could be the direct or indirect result of implementation errors neglects the possibility of implementation errors in combination with opponent’s cooperation, while too much generosity could cause exploitation. On the one hand the ESTFT strategies for the competition with noise were more generous as they only defect when two consecutive defections of the opponent occur or the smoothed value declines below a limit for cooperation. On the other hand highESTFT_classic probably smoothed too much which reduces its performance. Two strategies that outperformed TFT used higher values for the smoothing parameter that leads to a higher weighting of the currently observed opponent’s moves and reduces the smoothing effect. 6.4. Conclusions Based on the shortfalls of existing approaches that attempt to improve the poor performance of reciprocating strategies for the IPD with noise, we
Exponential Smoothed Tit-for-Tat
137
suggest that exponential smoothing is an approach that allows an outbalanced mitigation of the provocability property of reciprocating strategies. By exponential smoothing the whole series of the opponent’s moves rather than only the previous move of the opponent can be taken into consideration in the decision of cooperation or defection. Six ESTFT strategies were designed and participated in an IPD tournament in competitions with and without noise. The results of the tournament show that in noisy environments the performance of ESTFT strategies increases with the smoothing parameter and that low exponential smoothing improves the performance of reciprocating strategies. While exponential smoothing improves the ability of TFT to deal with noise in the IPD, it still does not deal with it very well. Moreover the results indicate that our design concept for determining smoothing parameters for strategies for the competition with noise seem to be inadequate, as in the competition with noise strategies designed for the competition without noise outperformed strategies designed especially for this environment. The optimistic assumptions concerning the initial mood of the ESTFT strategies and the simplistic restrictions for the smoothing parameter α for strategies for the competition with noise may be the cause of this weaker than expected performance. While we found a seemingly promising way to improve the performance of reciprocating strategies for noisy environments, obviously further research in this direction is necessary. Moreover we used TFT — as the most important representative of reciprocating strategies — as a benchmark, clearly ESTFT strategies have to be compared to other (reciprocating) strategies as well.
References Ashlock, D., Smucker, M. D., Stanley, E. A. and Tesfatsion, L. (1996). Preferential partner selection in an evolutionary study of prisoner’s dilemma, BioSystems 37, pp. 99–125. Axelrod, R. (1980a). Effective choice in the prisoner’s dilemma, Journal of Conflict Resolution 24, 2, pp. 3–25. Axelrod, R. (1980b). More effective choice in the prisoner’s dilemma, Journal of Conflict Resolution 24, 3, pp. 379–403. Axelrod, R. (1984). Genetic algorithms and simulated annealing, chap. The evolution of strategies in the iterated prisoner’s dilemma (Pitman, London), pp. 32–41. Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189.
138
M. Filzmoser
Bendor, J. (1993). Uncertainty and the evolution of cooperation, Journal of Conflict Resolution 37, 4, pp. 709–734. Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt. cooperation in a noisy prisoner’s dilemma, Journal of Conflict Resolution 35, 4, pp. 691–719. Donninger, C. (1986). Paradoxical effects of social behavior. Essays in honor of Anatol Rapoport, chap. Is it always efficient to be nice? A computer simulation of Axelrod’s computer tournament (Physica, Heidelberg). Molander, P. (1985). The optimal level of generosity in a selfish, uncertain environment, Journal of Conflict Resolution 29, 4, pp. 611–618. Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of the simulation of Adaptive behavior conference (Paris).
Chapter 7 Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma Philip Hingston1 , Dan Dyer2 , Luigi Barone2 , Tim French2 , Graham Kendall3 Edith Cowan University1 , The University of Western Australia2 , The University of Nottingham3 In this chapter, we report on a series of studies exploring the interplay between evolution and intelligence. The evolutionary setting is a population of agents playing Iterated Prisoners Dilemma, a setting which provides choice between cooperative and selfish behaviour in interactions between agents. Intelligence is represented using opponent modelling agents. Our studies show that, while opponent modellers can survive in such a setting, an evolving population of less intelligent agents can limit their success. We also report on the performance of our opponent modelling agent, which competed in the CIG’05 IPD competition. 7.1. Introduction IPD has served as a model for cooperation between self-interested individuals for 40 years. Sometimes, these individuals are taken to be animals, sometimes humans, and sometimes some other kind of agency, such as a corporation or a nation. A useful way to categorise studies based on the IPD model is by what is assumed about the cognitive abilities of the players. On an increasing scale of rationality, well-studied assumptions include • That populations of players can evolve good strategies. This is the traditional evolutionary computation approach. • That the players can learn good strategies. This is the traditional machine learning approach. • That the players are perfectly rational. This is the traditional mathematical game theory approach. 139
140
P. Hingston et al.
But there is another point on this scale, somewhere between the last two points, that has, surprisingly, been largely neglected — the assumption that players adapt their play based on a learned model of their opponents’ play. In this chapter, we will argue the merits of opponent modelling as a realistic approach to the study of IPD, and present some results of experiments designed to explore this approach. The expanded, four point scale, corresponds roughly to some theories concerning stages in the evolution of intelligence. A recent and controversial example is the Machiavellian intelligence hypothesis – “that apes and humans have evolved special cognitive adaptations for predicting and manipulating the behaviour of other individuals” [Miller (1997), p313]. There are various stronger or weaker interpretations of this hypothesis. A strong version postulates a “theory of mind”, a module that attributes beliefs and desires to others in order to better predict their behaviour. In other words, in our terms, apes and humans use opponent modelling. Researchers disagree about whether or not, and to what degree, various primates have such a module. Everyone seems to agree that humans do, but there is evidence supporting both sides of the argument regarding distinctions between sprepsirhine primates (lemurs and lorises), haplorine primates (the rest), or between monkeys, great apes and ancient and modern humans. A well known example is the ability of great apes to recognize themselves in mirrors, whereas monkeys cannot [(Parker et al. (1994), as cited in Miller (1997)]. If IPD is to teach us about human behaviour, then it makes sense to model intelligence at the correct level. To evolve agents that play predetermined, fixed strategies seems appropriate for studies of animal with low levels of intelligence. Game theorists might argue that corporations or nations are best modelled as perfectly rational, though many popular commentators would disagree. For humans, neither of these seems realistic – humans do not ignore what experience teaches. Rote learning of strategies, via some mechanism such as reinforcement learning, or learning by imitation, may be sufficient to explain animal behaviours, and some aspects of human behaviour, but even our great ape cousins are known to go beyond this. Thus, we believe, to realistically model strategies employed by humans, we must include learning to predict our opponent’s behaviour, and applying our reasoning abilities to devise a plan based on our predictions. Just how sophisticated the prediction and reasoning method needs to be is another question, but some kind of opponent modelling is called for.
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
141
In the field of multi-agent systems, our approach would be called modelbased learning, as distinct from model-free learning. We see many of the acknowledged advantages and disadvantages of model-based learning in this study. Many variations and subtleties of approaches to problems of learning in multi-agent systems have been studied (see, e.g. [Markovitch and Reger (2005)] for one such variation, and a nice overview). It is not our aim in this study to survey this field. Likewise, there are many aspects of intelligence that we do not concern ourselves with, including the question of what intelligence actually is! One of our reviewers pointed out that our “four point scale” could have many other intermediate points on it, depending on the level of sophistication of the modeller. For example, should the modeller assume that the other players are also modellers and reason about them on that level (we have chosen to answer “no”)? Again, should the modeller model only individual players, or the population of players in its environment (we have chosen the former)? We neglect these questions not because we see them as uninteresting – far from it – in fact it is because there are so many aspects, so many possibilities, that we must make our one set of choices and stick with them. From our point of view, the key requirement is that the players must be opponent modellers of some sort, and that we want to learn about what happens when such players are subjected to the forces of evolution. In the following pages, we discuss the advantages of opponent modelling, and the problems that a successful opponent modeller must solve. With this background, we then describe our opponent modelling entry for the IPD competition held at CIG’05, the 2005 IEEE Computational Intelligence in Games conference. This entry was adapted from an earlier IPD opponent modeller used to study the role of intelligence in the evolution of cooperation [Hingston and Kendall (2004)]. We revisit this work, and follow up with a report on some new experiments carried out recently to better understand our earlier findings.
7.1.1. Opponent modelling Opponent modelling is the term used to describe the process of constructing some form of representation (called the model) for an opponent’s strategy, typically in order to exploit inherent weaknesses in their play. It is worth pointing out here that we take the view that all game players are ultimately self-interested. Even in games where cooperation is possible, players only
142
P. Hingston et al.
cooperate because it is to their advantage to do so. This is not so much a value judgement on our part – since we are going to be dealing with evolution, those who are not self-interested (or at least those whose genes are not self-interested!) will cease to be relevant. Consider then a simple example, the two-player game of rock, paper, scissors (also known as Roshambo). In this game, each player selects one of the three options, rock, paper, or scissors, ensuring their selection is hidden from the other player. After both players have made their selection, players reveal their choices and the winner is determined as follows: rock defeats scissors, scissors defeats paper, and paper defeats rock. Should both players select the same option, the game is deemed a draw. Simple analysis shows that if an opponent truly selects randomly, the best a player can do is to also choose randomly, thus assuring the overall expectation is neutral (each player winning one third of games, each player losing one third of games, with the remaining third of games drawn). However, if the opponent is not selecting randomly (or truly randomly), a player can potentially do better than this neutral expectation by “guessing” (or in artificial intelligence speak, “predicting”) which option the opponent will select next. Using this prediction, the player can then choose the option (the counter-strategy) that ensures victory in the game (choosing rock if the prediction suggests the opponent will select paper, choosing scissors if the prediction suggests the opponent will select paper, and choosing paper if the prediction suggests the opponent will select rock). This is the domain of opponent modelling – building a model, typically from observation or experience, of the next most likely action (move) of the opponent. Note that building a model of the opponent’s next most likely action is equivalent to building a model of the opponent’s strategy directly since a player’s strategy directly determines the next move of the player. Once a model of an opponent’s strategy is determined, the model can be analysed (or deconstructed) to identify weaknesses in the opponent’s play. From this analysis, a counter-strategy that best “improves” the player’s position in the game (typically by exploiting any identified weaknesses in the opponent’s strategy) can then be determined and executed, allowing the player to maximise personal gain in the game. All other things being equal, the overall success of a player employing opponent modelling (an opponent modeller ) depends on the accuracy of its prediction of the opponent’s next action. For example, imagine an opponent that always selects rock as their hidden choice in the fore-mentioned game of Roshambo. Obviously, the optimal counter-strategy is to select paper,
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
143
thus ensuring a win against the opponent’s rock selection. An opponent modeller that is able to correctly deduce this strategy weakness is then able to ensure victory in all games against this opponent. While this type of obvious strategy flaw is unlikely, experience shows most players of games contain some form of strategy weakness, especially in games containing many different game states. For example, standard 5-card poker has over 2,500,000 ways of forming a hand; 7-card variants have over 133 million ways. Factoring in the complications due to betting, considerations of the different playing abilities and styles, and the large number of situations a player must respond to, every poker player is likely to contain some (and probably many) predictabilities in their strategy (if they didn’t, playing the game would be pointless — ignoring short-term variance, all players would end up level in the long run). Even in simple games like Roshambo, players often contain subtle weaknesses in their play that can be exploited by an opponent modeller (world championships in the game pit players’ abilities to determine and exploit these weaknesses). For example, some players may always choose rock after winning the previous game with paper. Other players may never select any single option four times in a row. Both of these examples demonstrate non-random choices by an opponent and hence can be exploited by an opponent modeller capable of deciphering predictable patterns in opponent behaviour. Indeed, any non-random choice in the selection of the hidden option may be exploited. If the game is played often enough, subtle weaknesses in strategy may well give the advantage to the opponent modeller in the long run. Artificial intelligence research into opponent modelling is interested in just this – finding subtle flaws in an opponent’s strategy in order to maximise personal gain in the game. While it seems opponent modelling is an obviously good way of identifying strategy weaknesses, care must be taken to ensure against over-reliance on the inferred model of the opponent’s strategy. Two immediate problems can arise: the inferred model may be incomplete, or even worse, incorrect for certain scenarios, or even if the model is definitely correct at some moment in time, an opponent may dynamically modify their strategy over time invalidating the model. The first problem is obvious – incorrect or incomplete models affect an opponent modeller’s capability to identify weaknesses in an opponent’s strategy and hence determine the next best action to select. The second problem motivates the need for adaptation – the opponent modeller must constantly re-assess its inferences and resulting
144
P. Hingston et al.
counter-strategies in order to stay abreast of the strategy employed by the opponent. For example, consider the always-select-rock strategy flaw discussed earlier for the game of Roshambo. Obviously, an opponent modeller capable of correctly inferring this strategy weakness will have no problem exploiting the weakness to ensure victory against this opponent. However, with such an obvious weakness in strategy, the opponent is likely soon to realise their flaw and try another strategy instead. The opponent modeller must now adapt their counter-strategy in order to exploit the new strategy, otherwise they run the risk of becoming predictable and may well be exploited themselves (recall that in Roshambo, a player must select randomly, otherwise an opponent may be able to predict their next action). The opponent may be “setting-up” the opponent modeller with a false model in order to exploit the opponent modeller later on with rapid successive changes in strategy (the hunter becoming the hunted). The other major problem for an opponent modeller is striking a balance between exploring unknown regions of an opponent’s strategy to discover new information (and new weaknesses) and using existing information to exploit weaknesses in the strategy. A trade-off occurs: insufficient exploration may prevent the opponent modeller from finding better counter-strategies that yield higher returns, but exploration is costly since it is a distraction from the primary task of exploiting the opponent by using the information the player already has. Exploring new counter-strategies may mean sacrificing short-term performance (the player may need to accept short-term losses), and in the worse case, may even lead to inescapable sub-graphs of the opponent’s strategy that yield sub-optimal returns in the long run (for example, exploring the strategy of defecting against a grim-like player in IPD – see later, in 7.1.4). Opponent modelling is not only useful in games, but also in other situations involving responding to opponent actions. Examples include evolving cooperative behaviour, stock market prediction, negotiation and diplomacy, and military strategy planning. These types of problems can benefit from opponent modelling — building a model of the behaviour of the opponent in order to exploit strategy weaknesses and to respond “well” to opponent actions. Indeed, most environments containing adversarial situations can benefit from opponent modelling — that is, exploitation of opponent weaknesses in order to maximise personal gain in the game. The question is to what extent.
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
145
IPD is a game often touted as being an example of human behaviour. Due to the iterated nature of the game, players may choose to take their opponent’s previous actions into account when deciding how to act in subsequent rounds of the game. This opens up the possibility of a player being predictable, and hence the possibility of exploiting the player’s predictabilities. This is the thesis of our work — that the use of opponent modelling to construct a model of an opponent’s strategy can offer an advantage to a player in the IPD game. Using this approach, we aim to construct automated computer players capable of exploiting observable strategy weaknesses in opponent’s strategies in IPD. This means that we need some way to automatically construct a model of the opponent’s strategy, some way of automatically analysing the constructed model to determine weaknesses in the strategy, and some way of automatically determining the best counterstrategy to counter-act the inferred strategy of the opponent. In general, all three of these tasks may indeed be difficult. In the next section, we detail how we did these things in the context of an IPD competition. 7.1.2. Modeller, the competition entry In this section, we describe Modeller, the strategy that we entered into the IPD competition held in conjunction with CIG’05, the 2005 IEEE Computational Intelligence in Games conference. It is a modified version of an opponent modelling agent described in a paper presented at CEC’04, the IEEE Congress on Evolutionary Computation in Seattle in 2004, [Hingston and Kendall (2004)]. The focus of that work was the interplay of evolution and learning, which was explored by simulating co-evolving populations of IPD playing agents using fixed strategies with agents using opponent modelling. It is discussed further in the next section, Opponent modelling versus evolution. We made some minor changes to the opponent modelling agent, for the purposes of the competition, but the precise details of the implementation are less important than the overall spirit of the opponent modelling approach. 7.1.3. Anatomy of the modeller Modeller plays tit-for-tat for a fixed number (50) of moves. (Recall that tit-for-tat cooperates on the first move, and copies the opponent’s previous move from then on.) During that time, it builds up a predictive model of the opponent. After the fixed number of moves, it uses the model to
146
P. Hingston et al.
calculate expected future payoffs for each possible move, depending on the game position, choosing the move with the highest expected future payoff. In the case of ties, it chooses randomly between the moves with the highest expected future payoff. The opponent model used is a 1st order lookup table. It is assumed that the opponent’s probability of cooperation on a given move is determined by what happened on the previous move, e.g. both cooperated, or we cooperated but our opponent defected, etc. This assumption was probably incorrect for most of the strategies entered in the competition, but we hoped that it would be approximately true, or at least true enough to obtain good average scores. We could employ more complicated models. For example, the opponent’s probability of cooperation could depend on the previous two moves, or even more generally, could be described by a probabilistic finite state automaton. We opted for the simplest choice that demonstrates the opponent modelling approach. In any case, we reasoned, more complicated models might not be warranted for the competition, because they have more parameters to estimate, requiring more time to learn. However, if the expected game length was very long, and our opponents were sophisticated, we conjecture that using more complicated models would produce a more capable strategy. The hypothetical probabilities that determine a 1st order model are estimated by counting how many times the opponent cooperated or defected after each possible previous move. These counters are initialized with values that are consistent with the opponent playing tit-for-tat, that is, we used tit-for-tat as an a priori model. This seemed like a good choice for the competition, as we expected that many opponents would play variants of titfor-tat. The counters are used to compute an estimate of the probability of the opponent cooperating as the ratio of the cooperation counter to the sum of the cooperation and defection counters. We continue to update the model by incrementing the counters during subsequent play. More sophisticated updating could also be used, for example, weighting evidence on recency, to respond faster to opponents with dynamic strategies, as was done in Hingston and Kendall (2004). Since our aim was to see how well opponent modelling would do in the competition environment, rather than to compare and tweak implementation details, we again decided to opt for simplicity. Assuming that the opponent model is correct, and given knowledge of the probability of the game continuing to another move, we can calculate the expected future payoff for any move in any game position. According to
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
cccc cccd ccdc ccdd cdcc
cdcd
147
dddd
C1(cc)
1-C1(dd) C
D
C
D cd
cc C0
C
D
C
dc
D dd
1-C0
C0
C
D
1-C0
first
Fig. 7.1.
Game tree for the first few moves of a game of IPD.
the competition rules, this probability was constant at 1-0.00346, giving an expected game length of 200 moves. To see how expected payoffs can be calculated, consider the initial segment of an IPD game tree shown in figure 7.1. Starting at the bottom, there is one branch for our choice to cooperate (labeled C) and one for our choice to defect (labeled D). Following the C branch, there is then one branch for our opponent’s choice to cooperate (labeled C0 ) and one for his choice to defect (labeled 1-C0 ). These labels represent the probability that our opponent will cooperate, or respectively, defect, on the first move of the game. Following the “cooperate” branch, we reach a node that represents the game position in which both players cooperated on the previous move (labeled cc). There is then a branch representing our next move (C or D), and then our opponent’s next move, where the label C1 (cc) represents the probability that the opponent will cooperate when both players cooperated on the previous move. Likewise, the label on the rightmost branch, 1 − C1 (dd) represents the probability that the opponent will defect when both players defected on the previous move. The topmost nodes are shown with labels like cccd, representing a game position where both players cooperated two moves ago, and we cooperated but our opponent defected on the previous move. Since we are assuming that our opponent (and therefore we also) only consider the previous move, when deciding on his next play, this node might as well be labeled cd, and be identified with the other nodes labeled cd. Thus the infinite game tree collapses to become a finite graph (not drawn). Thus, to determine a counter-strategy, we need only decide on our choice at the start of the game, and at each of the games positions cc, cd, dc and
148
P. Hingston et al.
dd. Thus, there are 25 = 32 possible counter-strategies to consider. We can choose between these by calculating their expected payoffs. Let V (cc) be the value of the game at position cc, by which we mean the expected discounted future payoff starting from this position. Define the value of the game for the other positions similarly. Let δ be the probability of continuing the game, P be the penalty for mutual defection, R be the reward for mutual cooperation, T be the temptation to defect, and S be the sucker payoff. If we choose to cooperate at position cc, then the expected future payoff, V1 (cc), is equal to the probability that the opponent cooperates (that is C1 (cc)) times future payoff given that he cooperates, plus the probability that the opponent defects (that is 1 − C1 (cc)) times the future payoff given that he defects. The future payoff given that he cooperates is equal to the immediate payoff for both cooperating (R), plus the expected future payoff after that (V1 (cc)) times the probability that the game continues (δ). The future payoff given that he defects is equal to the immediate payoff for us cooperating while he defects (S), plus the expected future payoff after that (V1 (cd)) times the probability that the game continues (δ). Putting that all together: V (cc) = C1 (cc) × (R + δ × V (cc)) + (1 − C1 (cc)) × (S + δ × V (cd)) . Similarly, if we choose to defect, then: V (cc) = C1 (cc) × (T + δ × V (dc)) + (1 − C1 (cc)) × (P + δ × V (dd)) . Analogous equations hold for the other positions, giving a system of equations that can be solved for the values V ( ). Finally, the value of the game at the start of the game is either V = C0 × (R + δ × V (cc)) + (1 − C0 ) × (S + δ × V (cd)) , or V = C0 × (T + δ × V (dc)) + (1 − C0 ) × (P + δ × V (dd)) depending on whether we choose to cooperate or defect on the first move. The best counter-strategy is that set of 5 choices which maximizes the value of V ( ) for the current game position. After playing a move selected by this method, and observing the opponent’s move, the model is updated and the calculation above must be repeated to choose our next move. This technique can be extended to calculate a best response against any finite-order stochastic strategy, or indeed against any strategy defined by a probabilistic finite state automaton. This implementation of an opponent modeller can likely be improved (in terms of achieving a higher score in IPD competitions), by more careful
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
149
choices of the target class of opponent models, exploration/exploitation balance, and updating method. We intend to test this claim in future IPD competitions! However, the good performance of this simple implementation in the CIG’05 competition is evidence that opponent modelling is a viable strategy for IPD. 7.1.4. Competition performance While our main target was Competition 4, we also entered Modeller in Competitions 1 and 2. Competition 4 was a faithful reproduction of Axelrod’s original conception [Axelrod (1984)] Modeller performed very well in this competition, placing 3rd out of 50 entries in four runs out of five, and 5th on the remaining run. In all cases, it was just over 3% behind the winning entry, and 1% ahead of the next best defeated opponent. Despite this good performance, a detailed examination of individual games reveals some weaknesses in our implementation. One thorny issue that we side-stepped by using tit-for-tat as an a priori model, is the “cost” of learning. In order to develop an accurate model of an opponent, one would like to “explore” — that is, to sample the opponent’s moves in every possible game position, many times. As discussed earlier, there are two barriers to this. One is that the opponent’s play may be such that some game positions are never reached. The second, and more troublesome, is that such exploration does not come for free: if we deliberately play a certain move to see what our opponent will do, our decision will affect the payoff that we receive for this move, and possibly for future moves too. Playing against the grim strategy is an extreme example. A grim player, cooperates on the first move of a game, and continues to cooperate as long as the opponent does, but if ever the opponent defects, then grim continues to defect forever more. In Hingston and Kendall (2004), we used the device of deliberately playing the “wrong” move from time to time – the so-called “trembling hand” device. Because of this, the opponent modeller was frequently punished by grim-like opponents. A single experimental defection against grim ensures that the opponent will defect for the rest of the game, locking both players into low payoffs. For the purpose of the competition, we avoided this problem by playing tit-for-tat at first, and then the moves that we calculate to be optimal. This is simple and reduces our risk of offending grim-like opponents, but also reduces the accuracy of our models, so that we may miss the chance of truly optimal play. For example, Modeller loses badly in games against the
150
P. Hingston et al.
fixed strategy Always Defect, which simply defects at all times. The reason for this is outlined below. During the first 50 moves against Always Defect, Modeller gets no information about what the opponent would do if both players cooperated last move, or if we defected while he cooperated (because he never cooperates). Also, since we play tit-for-tat up to this point, we only cooperate on the first move, and thereafter defect, so we only see one example of the opponent’s play after we cooperate and he defects. The problem of this lack of data is the reason we begin with an a priori model, specifically, tit-for-tat. After 50 moves against Always Defect, the model looks like this: Probability of cooperating after we both cooperate = 1 Probability of cooperating after I cooperate and he defects = 0.5 Probability of cooperating after I defect and he cooperates = 0 Probability of cooperating after we both defect = 0 Thus, when first called on to apply the model, Modeller reasons like this: “We just both defected. If I cooperate on this move, I’m sure he’ll defect. After that, there’s a 50% chance he’ll cooperate on the next move. (If not, I can try again.) If I cooperate too, that makes a 50% chance that we will both cooperate. From then on, I’m sure we’ll keep cooperating.” So Modeller expects to reach mutual cooperation and good payoffs after a few more moves, if he continues to cooperate. The problem is that the 50% estimate is wrong (the true probability is 0). Although this incorrect value will continue to be updated, it will take many more sucker moves before the model is accurate enough for Modeller to make the right choice (defect). So we see that, for several reasons, the models learned by Modeller are imperfect. Sometimes, this hurts us, but we push ahead regardless, hoping that, on average, it will not hurt us too much. At least in this competition, this was a reasonable assumption. We expected that Modeller would have a tougher time in Competitions 1 and 2, because, in these competitions, collusion between entries was allowed. With collusion allowed, the best approaches will use a “champion” strategy that takes advantage of conditions by relying on “confederate” strategies to sacrifice themselves by cooperating whilst their champion constantly defects. In addition, confederate strategies can damage other competitors by constantly defecting against them. The idea is for champion and confederates to use the first few moves to identify each other. It is clear that no
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
151
non-colluding strategy can hope to compete in this environment. Nevertheless, we entered Modeller to provide an opponent for other entries, and out of curiosity. It performed creditably, finishing 62nd, 61st, 64th, 60th and 65th out of 192 entries in the five runs. It would be interesting to know where it was placed among the non-colluding entries. Modeller fared better in Competition 2, which allowed collusion, but introduced “noise” – that is, with low probability, signals may be misinterpreted by players. This upsets colluders by interfering with the identification of confederates, but doesn’t inconvenience Modeller much at all, as it is designed to deal with stochastic opponent strategies. In this competition, Modeller finished 20th, 18th, 5th, 13th and 18th out of 165 entries in the five runs. It could be argued that, by allowing collusion, Competitions 1 and 2 changed the nature of the problem under consideration. The problem becomes one of teamwork, rather than one of cooperation with a self-interested other. One can think of real-world scenarios that IPD-with-collusion usefully models — for example, teams of riders in the Tour de France, in which team members sacrifice their own chances in order to protect a teammate and improve his chance of a high-placed finish. The analogy is imperfect, though, as in the Tour, teammates do not compete against each other directly. It is harder to think of examples from Nature. At least at first sight, it would seem that colluding strategies would not work very well in simulated evolution experiments like those that we describe in the next section. Strategies acting as confederates would be selected against. In such a scenario, it would be the average fitness of all members of the species that determined reproductive success of the species as a whole. We wonder what the average scores of the teams entered in Competitions 1 and 2 were, but we cannot calculate this because we do not know who was colluding with whom. Perhaps there are examples in Nature that we are not aware of, and it may be a matter of appropriately structuring the simulation to make collusion profitable. It would be interesting to hear of such examples. 7.2. Opponent Modelling Versus Evolution The opponent modeller described in the previous section was based on that used in the CEC’04 study, which had nothing to do with the competition, or, really, with Axelrod’s original competitions. It did take inspiration, though, from Axelrods experiments with evolution and IPD.
152
P. Hingston et al.
Those experiments by Axelrod were motivated in part by an apparent anomaly in evolutionary theory. Cooperation between organisms in nature entails one organism changing its behaviour in order to benefit another, possibly to its own detriment. A commonly used example is that of the lookout in groups of social animals, that makes an alarm call to warn the rest of the group of the presence of a predator, placing itself at risk by calling attention to itself. If evolution favours survival of the fittest, then why does it not work against this kind of cooperation? Would it not rather favour the cheat, who benefits from the alarm calls of others, but stays silent when his own turn comes to act as lookout? One can make plausible arguments to resolve this puzzle, invoking ideas like kin-selection [Maynard-Smith (1988), pp. 192-193], or social reputation [Maynard-Smith and Harper (2003), pp. 121-122], or one can build and analyse mathematical models to test hypothetical mechanisms to explain it, as in evolutionary game theory [Maynard-Smith (1988), pp. 194-200]. Or one can design and carry out simulated evolution experiments, which is what Axelrod did, using IPD as his model. Subsequently, many others have carried out their own, similar experiments, using variations on the classic IPD model, exploring issues such as spatial effects [Nowak and May (1992)], more complex strategies [Fogel (1993); Miller (1996)], the ability to choose partners [Ashlock et al. (1996)] and so on. There is another natural phenomenon that evolutionary theory must explain – intelligence. The central question is: Why and how did intelligence evolve? This is a large topic, much debated, and one that has many facets. Theories include Calvin’s “throwing theory” (that bigger brains evolved in order to better throw rocks) [Calvin (1983)], the theory that greater intelligence resulted as a response to the last ice age [Calvin (1991)], that the evolution of intelligence was a result of sexual selection [Miller (1997)], and the idea that intelligence is about being better at deceiving and detecting deception – as in the Machiavellian intelligence hypothesis [Byrne and Whiten (1988); Whiten and Byrne(1997)]. Our CEC’04 study was in part our attempt to contribute to the debate. Just as Axelrod used IPD to study the evolution of cooperation, we used similar experiments to study the evolution of intelligence. To better explain what we did, we ask the reader to keep mind the following hypothetical scenario: Imagine a world inhabited by simple creatures who interact and exchange resources by playing IPD with each other. Those who get better payoffs live longer and have more offspring. These creatures are not
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
153
intelligent. Their moves are determined by their genes. They can recognize each other, and they can only remember what moves were played the last time they played each other — but no more than that. They act instinctively, and never learn anything at all. This is the world of Axelrod’s experiments. Now imagine that a strange mutation arises amongst these creatures. Mutants have abnormally large brains, large enough for them to remember quite a bit about what happened in previous encounters with each other creature. Enough for them to be able to make a good guess about what move the other will make the next time they play. In fact, their brains are so big and complex, that they can use this information to plan what move they should make next, and what would happen after that, and so on, and choose a move that will maximize the payoff in games against each other in the future. These mutants are intelligent. What will happen to these intelligent mutants? What will happen to the original, unintelligent creatures? This is the world, and these are the questions that were addressed in the CEC’04 study. This was not the first study that considered how an intelligent player might play IPD or to use simulation to study the evolution of intelligence, but it may be the first to consider opponent modelling as an approach to IPD, and also the first to consider combining evolution and intelligence in the context of IPD. The opponent modelling implementation for this study was similar to that used in the competition, except that there was no initial period in which the model was not used, the model update method was different, and the players were all equipped with a “trembling hand”. As explained earlier, the competition variant used an initial waiting period as a safeguard against grim-like opponents, and because we guessed that many competition opponents would be tit-for-tat-like. The model update method in the CEC’04 study used a “forgetting factor”, γ, to give greater weight to more recent events. After each move, both counters pertaining to the current game position were multiplied by γ, and the relevant counter was incremented by 2 × (1 − γ), keeping the sum of the two counters constant. All players in the CEC’04 study had a “trembling hand”. That is, they would occasionally play defect when they intended to cooperate, or vice versa. The advantage of this is that it makes all parts of an opponent model reachable, and offers some hope of recovery against a grim-like opponent. Players using 1st order lookup tables were used for the unintelligent players. Only pure strategies were used – that is, ones in which each
P. Hingston et al.
154
1
2.8
0.95
2.6
0.9
2.4
0.85
2.2
0.8
2 1.8
mean
0.75
coop
0.7
1.6
0.65
1.4
0.6
1.2
0.55
1 0
100
200
300
400
500
600
700
800
900
Cooperation
Mean fitness
3
0.5 1000
Generation
Fig. 7.2.
A typical run with unintelligent strategies.
cooperation probability is either 0 or 1. No crossover was used, and mutation, simply, was to flip a probability of 0 to 1, or vice versa. Though they were not reported in the paper, experiments were also conducted with stochastic strategies, giving broadly similar results. There were two experiments reported. In the first experiment, fixed strategy, unintelligent players were evolved in a simulation similar to Axelrod’s: An initial population is created. A round-robin IPD tournament is held between the members of the population. Every player plays every other player in a game of IPD in which the game continues to another round with probability δ (set to 0.96, for an average game length of 25 moves). The fitness of each individual is assigned to be that player’s average payoff per move in the tournament. Fitness-proportionate selection is used to select parents for the next generation (stochastic uniform selection). Each parent, when selected, produces one child, by a process of copying the genome of the parent (with a low mutation rate – the probability of mutation as each gene is copied), and the development of a new individual from this genome. The children become the next generation. Repeat steps 2-4 for 1000 generations. The results of this experiment were similar to those reported by Axelrod. The populations evolved in a few generations to a mixture of generally
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
155
Table 7.1. Summary statistics for evolution of unintelligent strategies, n = 20, mean ± std.dev. mean fitness 2.783 ± 0.013
Mean coop% 86.7 ± 0.8
grim% 26.4 ± 1.5
tft% 19.7 ± 1.8
80
70
60
nGrim
nTFT
Percent
50
40
30
20
10
0 0
100
200
300
400
500
600
700
800
900
1000
Generation
Fig. 7.3.
Percentage of grim and tit-for-tat strategies for the run in figure 7.2.
cooperative players, cooperating around 87% of the time. As can be seen in table 7.1, the mean reward was close to the mean of 2.783 in all the runs. The average percentages of grim and tit-for-tat (TFT) strategies were around 26% and 20% respectively. Figure 7.2 shows a typical run, with defection initially popular, and cooperation taking over after about 20 generations. Although the mean reward and degree of cooperation of the population have stabilised, the composition of the population is constantly fluctuating, with grim and tit-for-tat always present in large numbers, appearing to be loosely tied together in a cycle of period about 100 generations. Figure 7.3 shows the percentages for the same typical run. In the second experiment, the players’ genomes were extended by adding a “smart bit”. With the smart bit turned on, the player becomes an intelligent mutant, and plays as an opponent modeller. With the smart bit turned off, the player remains an unintelligent player. In the initial population, all the smart bits were off. The scene is set. The mutants are equipped to exploit the weak amongst the normal players. Will this ability enable them to take over the population? Will they merely weed out the exploitable players?
P. Hingston et al.
156
1 0.95
2.6
0.9
2.4
0.85
2.2
0.8
2 1.8
mean
0.75
coop
0.7
1.6
0.65
1.4
0.6
1.2
0.55
1 0
100
200
300
400
500
600
700
800
900
Cooperation
Mean fitness
3 2.8
0.5 1000
Generation
Fig. 7.4.
A typical run with unintelligent players and opponent modellers.
Table 7.2. Summary statistics for coevolution of unintelligent players with opponent modellers, n = 20, mean ± std.dev. Mean fitness 2.67 ± 0.01
mean modeler fitness 2.51 ± 0.01
Mean coop%
modeller%
grim%
tft%
80.6 ± 0.6
13.5 ± 0.7
21.7 ± 2.0
24.1 ± 2.0
Figure 7.4 shows the mean fitness and level of cooperation in a typical run. The picture is similar to that of the first experiment, with a slightly lower degree of cooperation at around 81%, and slightly lower mean rewards around 2.67. Figure 7.5 shows the percentage of tit-for-tat and grim strategies and also the percentage of opponent modellers for the same run. As table 7.2 shows, a significant number of opponent modellers, a mean of around 13.5% of the population, is able to survive. Compared to the first experiment, some of the grim strategies have been displaced, but the percentage of tit-for-tat strategies has actually increased. We conjecture that the increase in tit-for-tat was at the expense of more exploitable strategies, which are under pressure from the opponent modellers. While grim players can’t be exploited, they are involved in a lot of unprofitable mutual defection with opponent modellers, and also suffer. Although opponent modellers are able to survive in this simulated environment, their mean fitness is lower than that of the rest of the population. Without mutation, they would be driven to extinction. One problem for
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
157
60
nGrim
50
nTFT nSmart
40 Percent 30
20
10
0 0
100
200
300
400
500
600
700
800
900
1000
Generation
Fig. 7.5. Percentage of opponent modellers, grim and tit-for-tat strategies for the run in figure 7.4.
opponent modellers is the poor payoff from games with grim. But, the main reason for their relatively poor performance is that when two opponent modellers meet, their average payoff is only 1.69. (In the competition, this doesn’t happen, as after the first 50 moves, each thinks the other is playing tit-for-tat, so mutual cooperation is locked in. In any case, each player only plays itself once in the competition, so self-play is not an important factor.) As an explanation of how intelligence might evolve, this model has raised some questions. One could regard it as an illustration of the self-limiting nature of exploitative behaviour in human and animal societies. Taking these results as a starting point, one could ask under what conditions intelligent players would do better, or worse, against unintelligent opponents, than they did in this experiment. Answers to this question might provide clues as to how and why intelligence has evolved in Nature, and why various successful species have varying degrees of intelligence. In the next section, we describe some new experiments in which we investigate some of the effects that contributed to the results of this section. 7.2.1. The new experiments As described above, one finding of the CEC’04 work was that the presence of opponent modellers in an evolving population of IPD playing agents has
158
P. Hingston et al.
an influence on the kinds of fixed strategy players favoured by evolution. The experiments described in this section seek to explore this further. The main difference between these experiments and the earlier ones is that opponent modellers do not directly take part in the evolution process, but are used to test the fitness of the members of an evolving population of fixed strategy IPD players. This makes it possible to isolate and manipulate the influence of the opponent modellers. There are some minor differences between the implementation of opponent modelling used in these experiments and the one used in the earlier study. Instead of using a default model for an opponent strategy (based on TFT) as used in the CEC’04 work, for these new experiments, we instead start with an empty model of the opponent. As before, we count the number of times the opponent cooperated for each game state to determine a probability of cooperation for that game state based on observation of the opponent’s moves. From this probability of cooperation, we are able to calculate the next best move by calculating the best expectation for all possibilities by looking ahead in the game state graph to consider the consequences of each possible course of action. Since look-ahead is computationally expensive, we consider only a look-ahead of 5 moves (sufficient to prevent short-term gains from taking precedence over long-term considerations). Unlike the earlier work, we do not include a recency factor to discount older observations as we do not make any short-cut assumption about the opponent strategy at the start. Exploration of an opponent’s strategy is also undertaken differently. In the earlier CEC’04 work, a trembling hand was used for exploration of the opponent strategy. In this work, exploration is more immediate – the opponent modeller makes random decisions when it encounters games states for which it has no information about the opponent’s strategy. The advantage of this approach is that exploration occurs earlier in the modelling process, thus meaning more information is available earlier in the game, hopefully leading to better exploitation of an opponent’s weaknesses in the short term. Of course, the key difference between our new experiments and those presented at CEC’04 has to do with the effects of the opponent modeller on the course of the evolutionary process. In the CEC’04 work, the opponent modeller was considered another instance of the evolving population that could replicate (so there could be multiple copies of the opponent modeller), and needed to compete to earn their position in the population in order to survive (i.e., opponent modellers were subjected to the same evolutionary
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
159
selection pressure as the unintelligent players). Results from the CEC’04 work showed that due to poor performance against other opponent modellers (the average return for self-play was 1.69), the number of opponent modellers in the population fell away over the course of the evolutionary run. In these new experiments, we take a different approach – we do not involve the opponent modeller in the evolutionary process (so the opponent modeller is not subjected to the same evolutionary pressures), and it is instead treated separately from the evolving population. As before, we still maintain a population of unintelligent players that must compete for their right to remain (and reproduce) in the population, but now assessment of an individual’s ability (its fitness) is calculated as a weighted sum of its performance against the other (unintelligent) members of the evolving population and its performance against the separate opponent modeller. Below, we report on experiments with different weightings to determine and isolate the effects of the opponent modeller on the evolutionary process. While there are a number of differences between these two studies, analysis shows that the results are mostly robust with respect to these differences. Indeed, compensating for the effects of self-play in the earlier CEC’04 work yields results mostly similar to the results found using this new methodology (some differences occur due to the differences in exploration between the two approaches). We use our simpler approach in the experiments below, thus allowing us to explore longer-term effects and longer-term IPD games (these new experiments investigates games lasting 1000 rounds while the earlier work investigated games lasting only 25 rounds). Our baseline experiment is to play the opponent modeller against a selection of eight commonly known IPD strategies. Each strategy is played against each of the others for 1000 iterations, giving a total of 8000 iterations for each strategy. The results of the round-robin tournament are presented below in table 7.3. Table 7.4 reports a breakdown of the opponent modeller’s performance (average payoff) against each of the strategies in table 7.3. Table 7.3 lists a couple of strategies we have yet to describe. STST (Suspicious tit-for-tat) is like tit-for-tat except that it defects on the first move. Gradual is another variation on tit-for-tat: this strategy acts as titfor-tat, except that after the first defection of the other player, it defects one time and cooperates two times; After the second defection of the opponent, it defect two times and cooperate two times, and so on. The Pavlov strategy is similar to grim, except that it is more forgiving. Based around the
P. Hingston et al.
160
Table 7.3. Round-robin tournament results involving the opponent modeller against eight other commonly known IPD strategies. Rank 1 2 3 4 5 6 7 8 9
Strategy Opponent Modeller Gradual TFT Grim STFT Pavlov Always Cooperate Always Defect Random
Average payoff 2.74 2.68 2.59 2.26 2.22 2.15 2.07 2.05 1.64
Table 7.4. Round-robin tournament results involving the opponent modeller against eight other commonly known IPD strategies. Strategy Gradual TFT Grim STFT Pavlov Always Cooperate Always Defect Random
Average payoff Opponent Modeller Opponent 2.87 2.75 2.99 2.99 1.00 1.01 2.99 3.00 3.00 0.50 5.00 0.01 1.00 1.00 3.04 0.51
principle of continuing to do the same thing when performing well and only changing when performing poorly, Pavlov starts cooperating and continues to cooperate until its opponent defects. Upon defection, Pavlov switches to defection. The difference between grim and Pavlov is that Pavlov will return to cooperation if defection does not prove to be profitable (i.e., if its opponent also begins to defect), hoping to return back to a state of mutual cooperation. That the opponent modeller emerged as the winner of the tournament is encouraging, but is not particularly significant given the arbitrary selection of opponents in the tournament. What is more interesting is the performance of the opponent modeller against each individual strategy. The first thing to note is the ability of the opponent modeller to successfully identify Always Defect as the best counter-strategy against the non-reactive opponents (Always Cooperate, Always Defect, and random),
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
161
achieving near-perfect scores against Always Cooperate and Always Defect, and the best possible result against random. Against tit-for-tat and STFT, the opponent modeller is able to identify cooperation as the best course of action without falling into the defection echo trap. As expected, the inevitable strategy exploration against grim is punished, resulting in a poor score for the opponent modeller. The relatively poor performance of Pavlov in the round-robin tournament is at least partially due to the opponent modeller settling on an Always Defect counter-strategy, rather than the equally effective Always Cooperate alternative. Our next experiments examine the effect of the opponent modeller on the course of a population of IPD players subjected to evolutionary selection pressure. As seen in the earlier CEC’04 work, the presence of opponent modellers in the population effects the kinds (and distribution) of fixed strategy players selected by evolution. These new experiments further elaborate on these effects. First, we report on the performance of the opponent modeller against an evolving population of fixed pure strategies. Figure 7.6 plots the average payoff for the opponent modeller against each member of the population along with the average-payoff of the evolving population.
4
3.5
3
mean pay-off
2.5
2
1.5
population
1
modeller 0.5
0 0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
generation
Fig. 7.6. Average payoffs for the evolving population of fixed pure strategies and an opponent modeller playing against each member of the population over time.
P. Hingston et al.
162
We can see that in most generations, the opponent modeller is able to outperform the evolving population, obtaining a higher average payoff than the average payoff of the evolving population. However, there are several generations where the population outperforms the opponent modeller. Analysis of the population composition at these points reveals that this occurs when there are a large number of grim strategies in the population (recall that exploration against the unforgiving grim is fatal – one defection against grim locks the opponent modeller into a payoff at best 1.0 from then on). For example, in generation 988, where the opponent modeller is at its least effective (scoring on average 1.10 less than the evolving population), the number of grim strategies reaches its peak −68% of the population. The first row of table 7.6 reports the composition of the evolving population for the corresponding experiments plotted in figure 7.6. We can see grim, tit-for-tat, and Pavlov are the most prevalent in the population. The results from these experiments show that the opponent modeller is successful against an evolving population of fixed pure IPD strategies, provided the proportion of grim strategies in the population is not high. However, these experiments have not rewarded fixed strategies that score well against the opponent modeller, only those that perform well against the rest of the evolving population. Next, we examine experiments that incorporate scores achieved against the opponent modeller into the fitness evaluations of the fixed strategies. Table 7.5 reports the average payoffs for the members of the evolving population of pure strategies and the opponent modeller, along with the composition of selected strategies in the population (dashed entries indicate low numbers) for different ratios of the weighted sum that constitutes the fitness of a member of the evolving population.
Table 7.5. Average payoffs for the members of an evolving population of pure strategies and an opponent modeller playing against each member of the population, for different weightings in the weighted sum. Weighting (against Modeller) 0 0.05 0.1 0.2 0.5 1.0
Population average payoff Against Against Modeller population 2.50(0.19) 1.07 2.51(0.21) 1.31 2.60(0.12) 1.60 2.68(0.09) 2.08 2.65(0.09) 2.53 2.14(0.25) 2.72
Modeller average payoff 2.62(0.32) 2.81(0.27) 2.94(0.21) 2.99(0.12) 2.93(0.06) 2.91(0.05)
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
163
Table 7.6. Distribution of strategies for the members of an evolving population of pure strategies and an opponent modeller playing against each member of the population, for different weightings in the weighted sum as in table 7.5. Weighting (against Modeller) 0 0.05 0.1 0.2 0.5 1.0
Number of each Grim TFT 25(10) 15(7) 15(6) 25(10) 12(6) 33(9) 8(3) 49(8) 4(1) 61(6) 2(1) 38(11)
fixed strategy Pavlov STFT 10(5) 7(5) 5(3) 8(3) 13(4) 41(11)
The first obvious difference of the experiments that include performance against the opponent modeller in fitness calculations (rows 2 onwards in table 7.5) is the increased average payoff of the opponent modeller. In comparison to the first row of the Table, we see that the average payoff of the opponent modeller increases up to a point, before leveling off at around 2.95. This is due to the changes in the composition of the resulting evolved population (see table 7.6). As we saw in our baseline experiment, grim and Pavlov do not perform well against the opponent modeller (scoring 1.01 and 0.50 respectively) and hence even a very low degree of influence from the opponent modeller on fitness scores is enough to reduce the appearance in the evolving population of these strategies. With a weighting of 0.2, Pavlov is unable to score highly enough to survive in any significant quantities and the presence of grim is much reduced. The reduction in the number of grim strategies explains the increase in the average payoff of the opponent modeller (recall that the opponent modeller performs poorly against grim because of the high cost of strategy exploration). With a 0.5 weighting, grim becomes marginalised. The increasing number of STFT strategies at the higher weightings explains the small decrease in average payoff of the opponent-modeller – STFT is not exploitable and indeed may benefit from its suspicious nature at the beginning of the game. At the higher weightings, the only strategies other than tit-for-tat and STFT to appear in the population are single-step (differing in just one state) mutants from tit-for-tat and STFT (including grim) induced by the mutation in the evolutionary process. These mutants are not able to survive in the evolving population and are quickly eliminated. Variance in the performance of the opponent modeller also decreases as we increase the relative importance of performance against the opponent
164
P. Hingston et al.
modeller in the evaluation of the success of a population member. This is because the opponent modeller acts as a stabilising influence on the fitness of the evolving population since it is a constant in the environment. The more that the fitness is derived from games against the remainder of the population (low weightings), the more performance is affected by changes in the population. As seen in column 3 of table 7.5, the average payoff of the evolving population against the opponent modeller increases as the relative importance of performance against the opponent modeller in fitness calculations for a population member increases. This is as expected, because survival in the population now depends more and more on this metric than performance against the other members of the evolving population. Indeed, at a weighting of 1.0, performance against the opponent modeller is maximal, at a sacrifice of performance against the other members of the evolving population. Somewhat strangely though, as the weighting increases from 0 to 0.2, performance of the evolving population against other members of the population increases, even though fitness now depends more on performance against the opponent modeller. This is due to the decreased numbers of grim strategies (driven out by the opponent modeller) – defection is no longer as costly as it was before. At the highest weighting, even the small numbers of grim strategies ensure that the abundance of STFT strategies perform relatively poorly, lowering the average-payoff of the evolving population in play against each other. Importantly, we observe in table 7.5 that while the performance of the evolving population against the opponent modeller increases as the weighting increases, the evolving population is never able to obtain a level of performance comparable to that of the opponent modeller (contrast column 4 of table 7.5 against column 3). Of course, evolution does its best – evolving a population consisting of predominately non-exploitable strategies (tit-for-tat and STFT ). However, due to the stochastic nature of the evolutionary process in the mutation operation, other strategies find their way into the population, thus allowing the opponent modeller to exploit weaknesses and obtain an average payoff higher than that of the evolving population. Our analysis of table 7.5 shows that the opponent-modeller to be effective against populations of pure strategies, outperforming the evolving population in terms of average payoff in play against each other. The opponent modeller is able to outperform the evolving population, learning with a high degree of certainty what its opponent will do in any given situation
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
165
(game state). In our next experiment, we repeat these tests using stochastic strategies in place of pure strategies. Stochastic IPD strategies differ from pure IPD strategies as they allow the player the flexibility of selecting a cooperate/defect action probabilistically given a particular game state. Whereas a pure strategy will always select the same action for a given game state, a stochastic strategy may (probabilistically) decide which action to take. This means that successive calls of a stochastic strategy for the same input game state may produce different output actions. This cannot occur for a pure strategy – the pure strategy will always select the same response given an input game state. Stochastic strategies are implemented as follows: for each unique game state (recall, we are assuming 1st order strategies only), the stochastic strategy stores a probability that determines the probability of cooperating in this game state. Choice of an action depends directly on this probability – this probability of cooperating is this stored probability. Mutation of a stochastic strategy occurs by adjusting each internal probability by a randomly sampled variable taken from a Gaussian distribution with mean 0 and a standard deviation of 0.025. Against stochastic strategies, the opponent modeller can still observe the probability with which its opponent will cooperate, but it cannot be sure that the opponent will cooperate on any given move. This experiment against stochastic strategies will report on the effects of this uncertainty in behaviour on the performance of the opponent modeller. Table 7.7 reports the average payoffs for the members of the evolving population of stochastic strategies and the opponent modeller, along with the composition of selected strategies in the population (dashed entries
Table 7.7. Average payoffs for the members of an evolving population of stochastic strategies and an opponent modeller playing against each member of the population, for different weightings in the weighted sum. Weighting (against Modeller) 0 0.05 0.1 0.2 0.5 1.0
Population average payoff Against Against population Modeller 2.08(0.61) 1.29 2.43(0.22) 2.23 2.59(0.14) 2.49 2.61(0.12) 2.76 2.54(0.11) 2.96 2.13(0.18) 3.08
Modeller average payoff 2.16(0.59) 2.54(0.22) 2.68(0.12) 2.68(0.10) 2.62(0.09) 2.44(0.08)
P. Hingston et al.
166
Table 7.8. Distribution of strategies for the members of an evolving population of stochastic strategies and an opponent modeller playing against each member of the population, for different weightings in the weighted sum as in table 7.7. Weighting (against Modeller) 0 0.05 0.1 0.2 0.5 1.0
Number of each fixed strategy Grim TFT Pavlov STFT 22(18) 12(14) 7(8) 2(2) 6(10) 23(23) 23(22) 22(19) 37(27) 40(29) 29(28) 40(30) 36(27) 60(29) 31(28)
indicate low numbers) for different ratios of the weighted sum that constitutes the fitness of a member of the evolving population. Against a population of evolved stochastic strategies (row 1 of table 7.7), the opponent modeller, on average, does outscore the evolving population, performing well in certain generations, but not in others. As in the equivalent experiment against pure strategies, this performance depends on the number of grim-like strategies in the evolving population – when the number of grim-like strategies is high, performance is relatively weak; when the number of grim-like strategies is low, performance is relatively high. However, unlike the experiment involving pure strategies, the performance of the opponent modeller is more unstable, perhaps due to large dynamic changes in the composition of the opponent strategies observed in the evolution of a population of stochastic strategies. No such large-scale changes in strategy composition were evident in the evolution of a population of pure strategies (contrast the variance in the numbers of each fixed strategy in table 7.5 and table 7.8). As in the experiment with pure strategies, as the importance of the performance against the opponent modeller increases, the average payoff of the evolving population against the opponent modeller increases. However, in contrast to the experiments involving pure strategies, we see that the evolving population is able to obtain a higher average payoff than the opponent modeller for weightings greater than 0.2 (recall previously that an evolving population of pure strategies was unable to surpass the performance of an opponent modeller regardless of the relative weighting). Indeed, at a weighting of 1.0, the evolving population is able to achieve an average payoff of greater than 3 against the opponent modeller, whilst the opponent modeller scores less than 2.5 on average, suggesting that the evolving
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
167
population is exploiting the opponent modeller. Why is the opponent modeller scoring less than its opponent in these scenarios? Does this represent a failure for our opponent modelling approach, or even opponent modelling general? The key to understanding these observations has to do with the composition of the evolving population. At the higher weightings, tit-for-tat-like and STFT-like strategies account for the majority of strategies making up the evolving population (indeed, grim-like strategies have mostly disappeared). If these were pure strategies, we would expect to see them achieve an average payoff of no more than 3 (mutual cooperation). However, these strategies are not pure, instead behaving stochastically, acting mostly like their pure strategy counterpart, but sometimes not. This means that a stochastic TFT-like strategy will typically play like tit-for-tat and enter into mutual cooperation. However, occasionally, this stochastic tit-for-tat-like strategy will attempt an unprovoked defection. To understand why these stochastic variants are successful, particularly against the opponent modeller, we need to consider the nature of the game. As IPD is not a zero-sum game, and since the objective of the opponent modeller is to achieve the highest-payoff it can (and not to achieve a higher payoff than its opponent), it is often better for the opponent modeller to accept the occasional defection without retaliating in order to achieve a higher average payoff in the long run (provided the defection doesn’t occur too frequently). Indeed, if the opponent modeller was to reciprocate every defection by its opponent, it would be able to prevent its opponent from significantly out-scoring it, but at the cost of lowering its own average payoff (for example, it is better to accept an average payoff of 2 and allow your opponent an average payoff of 4 than to retaliate and restrict both players to an average payoff of 1). This scenario provides an interesting demonstration of the interactions between evolution and learning in a competitive environment. We have observed that evolution produces IPD players that can improve their average payoffs against the opponent modeller by employing occasional unprovoked defections. It would seem that as long as the evolved strategies do not defect often enough to evoke retaliation from the opponent modeller, they will achieve higher average payoffs than the opponent modeller. In response, the opponent modeller seemingly recognises that, although it is being exploited, it will achieve better future rewards by not retaliating, since its opponent will resume cooperation after each unprovoked defection. Indeed, this suggests that Axelrod’s third guideline for playing IPD (“always reciprocate
168
P. Hingston et al.
cooperation and defection”) does not apply against stochastic strategies. We still deem this a success for the opponent modeller – indeed, the opponent modeller is still able to achieve the highest payoff possible against this particularly “nasty” opponent. Sometimes, you just have to grin and bear it. 7.3. Conclusions IPD is a game that models human choices in self-interested environments. Previous studies of the game have focused on both evolution and standard artificial intelligence techniques to study game strategies. However, something has been missed in these previous investigations – the role of a theory of mind, specifically, of adapting one’s play based upon a learned model of an opponent’s strategy. This is the area of opponent modelling – building a representation of an opponent’s strategy, typically from experience, in order to exploit weaknesses in their play. The trade-off between exploration (searching for better ways to exploit an opponent) and exploitation (taking advantage of the weaknesses in an opponent’s strategy) is paramount to the success of the opponent modeller – too much strategy exploration and the opponent modeller may not solidify its advantage; too little strategy exploration and the opponent modeller may be sacrificing potential gains. A balance between the two must be achieved for near-optimal play. Using an observational model of the choices made by an opponent and a simple technique to select the best choice given the next most likely action of the opponent, we have introduced a simple approach to construct computer IPD players capable of exploiting observable strategy weaknesses in opponents’ play. Our experiments show that a computer opponent modelling IPD player is able to outperform an evolving population of fixed purestrategy opponents in terms of average payoff in play against each other and perform as well as possible against a population of stochastic-strategy opponents. Further, the strong performance of our entry in the IPD competition held at CIG’05, the 2005 IEEE Computational Intelligence in Games conference, supports our claims of the benefits of opponent modelling – our entry, based on the ideas presented in this work, consistently finished in the top five in the classical IPD competitions, and performed honourably in the collusion-based competitions. Beyond the IPD game, this work makes a contribution to the question of how intelligent behaviour evolves. Higher intelligence is more than simple mimicry or rote learning, requiring the ability to predict and respond to
Opponent Modelling, Evolution, and the Iterated Prisoner’s Dilemma
169
specific “opponent” choices. Our work reflects on a Machiavellian view of intelligence, in which the manipulation of the behaviour of other individuals is crucial. High levels of intelligence are not universal in Nature – the majority of life is simple and unintelligent, and human level intelligence is unique. A traditional explanation for this invokes cost in terms of energy needs of a highly developed brain. One of our reviewers pointed out that our approach offers a fundamentally different explanation in terms of the cost of exploration. Another explanation again is the self-limiting dynamics of having an intelligent sub-population. Our experiments show, for example, that opponent modelling is a viable strategy in an IPD environment, and moreover, that the presence of opponent modellers affects the success of other strategies, which in turn alters the characteristics of that environment. This may be an important factor to consider in any study of the evolution of intelligence. The subtleties and parameters of such interactions might offer an explanation as to why the varying requirements of different ecological niches lead to co-existence of species having different levels of intelligence. Further study is needed to understand such interactions and the factors that determine their outcomes. References Ashlock, D., Smucker, M. D., Stanley, E. A., and Tesfatsion, L. (1996) Preferential partner selection in an evolutionary study of prisoner’s dilemma, BioSystems, 37, pp. 99-125. Axelrod, R. (1984) The Evolution of Cooperation. New York, Basic Books. Byrne, R. W. and Whiten, A. (1988) Machiavellian Intelligence: Social Expertise and the Evolution of Intellect in Monkeys, Apes and Humans. Oxford, Clarendon Press. Calvin, W. H. (1983) A Stone’s Throw and its Launch Window: Timing Precision and its Implications for Language and Hominid Brains, Journal of Theoretical Biology, 104, pp. 121-135. Calvin, W. H. (1991) The Ascent of Mind, Bantam. Fogel, D. B. (1993) Evolving behaviors in the iterated prisoner’s dilemma. Evolutionary Computation, 1, 1, pp. 77-97. Hingston, P., and Kendall, G (2004) Learning versus Evolution in Iterated Prisoner’s Dilemma, Proceedings of the IEEE Congress on Evolutionary Computation (CEC’05), Portland, IEEE, pp. 364-372 Markovitch, S. and Reger, R. (2005) Learning and Exploiting Relative Weaknesses of Opponent Agents, Autonomous Agents and Multi-Agent Systems, 10, pp. 103-130. Maynard-Smith, J. (1988) Did Darwin get it right? Essays on Games, Sex and Evolution, Penguin Books.
170
P. Hingston et al.
Maynard-Smith, J. and Harper, D. (2003) Animal Signals. Oxford, Oxford University Press. Miller, G. F. (1997) Protean primates: The evolution of adaptive unpredictability in competition and courtship. Machiavellian Intelligence II: Extensions and Evaluations. Cambridge, Cambridge University Press: 312-340. Miller, J. H. (1996) The coevolution of automata in the repeated prisoner’s dilemma, Journal of Economic Behavior and Organization, 29, pp. 87-112. Nowak, M. and May, R (1992) Evolutionary games and spatial chaos, Nature, 359, pp. 826-829. Parker, S. T., Mitchell, R.W. and Boccia, M.L., Ed. (1994). Self-awareness in Animals and Humans: Developmental Perspectives. Cambridge, Cambridge University Press. Whiten, A. B., and Byrne, R. W. (1997) Machiavellian Intelligence II: Extensions and Evaluations. Cambridge, Cambridge University Press.
Chapter 8 On some winning strategies for the Iterated Prisoner’s Dilemma or Mr. Nice Guy and the Cosa Nostra
Wolfgang Slany and Wolfgang Kienreich Technical University, Graz, Austria We submitted two kinds of strategies to the iterated prisoner’s dilemma (IPD) competitions organized by Graham Kendall, Paul Darwen and Xin Yao in 2004 and 2005.a Our strategies performed exceedingly well in both years. One type is an intelligent and optimistic enhanced version of the well known TitForTat strategy which we named OmegaTitForTat. It recognizes common behaviour patterns and detects and recovers from repairable mutual defect deadlock situations, otherwise behaving much like TitForTat. OmegaTitForTat was placed as the first or second individual strategy in both competitions in the leagues in which it took part. The second type consists of a set of strategies working together as a team. The call for participation of the competitions explicitly stated that cooperative strategies would be allowed to participate. This allowed a form of implicit communication which is not in keeping with the original IPD idea, but represents a natural extension to the study of cooperative behaviour in reality as it is aimed at through the study of the simple, yet insightful, iterated prisoner’s dilemma model. Indeed, one’s behaviour towards another person in reality is very often influenced by one’s relation to the other person. In particular, we submitted three sets of strategies that work together as groups. In the following, we will refer to these types of strategies as group strategies. We submitted the CosaNostra,b the StealthCollusion, and the EmperorAndHisClones group strategies. These strategies each have one distinguished individual strategy, respectively called the CosaNostraGodfather a See
http://www.prisoners-dilemma.com/ for more details. of us, Slany, had submitted the CosaNostra group strategy previously to an iterated prisoner’s dilemma competition organized by Thomas Grechenig in 1988. Our submitted group strategies are inspired by this first formulation of such a group strategy that we are aware of. b One
171
172
W. Slany & W. Kienreich
(called ADEPT in 2004), the Lord strategy, and the Emperor, that heavily profit from the behaviour of the other members of their respective groups: the CosaNostraHitmen (10 to 20 members), the Peons (open number of members), and the CloneArmy (with more than 10,000 individually named members), which willingly let themselves being abused by their masters but themselves lowering the scores of all other players as much as possible, thus further maximizing the performance of their masters in relation to other participants. Our group strategies were placed first, second and third places in several leagues of the competitions and also likely were the most efficient of all group strategies that took part in the competitions. Such group strategies have since been described as collusion group strategies. We will show that the study of collusion in the simplified framework of the iterated prisoner’s dilemma allows us to draw parallels to many common aspects of reality both in Nature as well as Human Society, and therefore further extends the scope of the iterated prisoner’s dilemma as a metaphor for the study of cooperative behaviour in a new and natural direction. We further provide evidence that it will be unavoidable that such group strategies will dominate all future iterated prisoner’s dilemma competitions as they can be stealthy camouflaged as non-group strategies with arbitrary subtlety. Moreover, we show that the general problem of recognizing stealth colluding strategies is undecidable in the theoretical sense. The organization of this chapter is as follows: Section 0 introduces the terminology. Section 0 evaluates our results in the competitions. Section 0 describes our strategies. Section 0 analyses the performance of our and similar strategies and proves the undecidability of recognizing collusion. Section 0 relates the findings to phenomena observed in Nature and Human Society and draws conclusions. 8.1. Introduction The payoff values in an iterated prisoner’s dilemma are traditionally called T (for temptation to betray a cooperating opponent), S (for sucker’s payoff when being betrayed while cooperating oneself), P (for punishment when both players betray each other), and R (for reward when both players cooperate with each other). Their values vary from formulation to formulation of the prisoner’s dilemma. Nevertheless, the inequalities S < P < R < T and 2R > T + S are always observed between them. The last one ensures that cooperating twice (2R) pays more than alternating one’s own betrayal of one’s partner (T) with allowing oneself to be betrayed by him or her (S)
On some winning strategies for the Iterated Prisoner’s Dilemma
173
[Kuhn (2003)]. In the iterated prisoner’s dilemma competitions organized by Graham Kendall, Paul Darwen and Xin Yao in 2004 and 2005, these values were, respectively, S = 0, P = 1, R = 3, and T = 5. Note that the general results in Section 0 are true for arbitrary values constrained by the inequalities stated above. 8.2. Analysis of the Tournament Results The strategies we submitted to the competitions were the OmegaTitForTat individual, single-player strategy (OTFT), the CosaNostra group strategy, the StealthCollusion group strategy, and the EmperorAndHisClones group strategy. The following subsections summarize the results, followed by two sections commenting on real and presumed irregularities in some of the results. 8.2.1. 2004 competition, league 1 (standard IPD rules, with 223 participating strategies) • Our OTFT was the best non-group, individual strategy. • Our Godfather strategy (called ADEPT in 2004) of our CosaNostra group was the second best group strategy (with less than 10 members) after the STAR group strategy of Gopal Ramchurn (with 112 members, though we are not sure that all strategies colluded as one group). Note that even badly performing group strategies can score arbitrarily higher than individually better group strategies by sheer numerical superiority (see below and Section 0). We also initially noted with one eyebrow raised that 112 is exactly the smallest integer larger than 223 divided by 2, so the STAR group members were just more than 50% of the total population. However, we now believe that this might have been just a coincidence. • Our EmperorAndHisClones group strategy was not allowed to fully compete but would have won by large (it had more than 10,000 individually named clones of which unfortunately only one was eventually allowed to participate), for payoff values see below. EMP scored as good as ADEPT as it was following the same recognition protocol. • Our StealthCollusion group strategy (sent in by a virtual person Constantin Ionescu and called LORD and PEON) participated as a proof of the collusion concept, apparently without detection of the collusion by the organizers, as further variants of members of the CosaNostra group strategy. Constantin asked the organizers to clone his PEON strategy
174
W. Slany & W. Kienreich
as often as possible; however, only one copy was eventually allowed to participate. Read more about Constantin later in Section 0. Simple calculations show that a numerical advantage would have vastly improved the results of our ADEPT and EmperorAndHisClones strategies. In all the following calculations we neglect protocol losses among group members as they insignificantly increase the numbers reported below compared to the scores that would really have been achieved had the competitions taken place as described. Table 8.1a shows the results of the tournament with the number of clones actually allocated. Table 8.1b shows the estimated results if 100 additional clones had been allowed for our collusion strategy. Table 8.1c shows how 10,000 additional clones would have influenced the results. These results were computed for an average of 200 turns per game, giving on the one hand full temptation payoff value t to EMP/ADEPT from their CosaNostraHitmen, Peons, and clones of the CloneArmy, whereas EMP/ADEPT played OmegaTitForTat against all strategies outside our group and thus achieving the same result against these as if the very well performing OmegaTitForTat strategy would have been used by itself. CosaNostraHitmen, Peons, and clones of the CloneArmy, and EMP/ADEPT on the other hand always cooperated with their EMP/ADEPT bosses while permanently betraying all strategies outside our group and thus resulting in full punishment payoff value p or even sucker’s payoff value s to strategies outside our group to themselves and to their opponents. Clearly, had our strategies been composed of as many members as the STAR strategy or, even better, as many as we had submitted, it very plausibly would have won by large factors (43% with additional 100 members, 800% with additional 10,000 members as we had submitted). We can therefore plausibly conjecture, under the assumption that the STAR strategy had more then 100 strategies colluding with each other, that our group strategies would be vastly more efficient than the winning STAR group strategy and would have won had we been allowed to play as we had submitted our strategies and as it was positively hinted at by one of the organisers when we submitted our strategies, in a mail received from Graham Kendall on May 29, 2004, as otherwise we would have inflated our stealth collusion strategies — we had prepared a respectable number of virtual persons similar to Constantin Ionescu as described in Section 0. Also note that a sufficiently large group of real people (e.g., one of us, Slany, has to teach 750 computer science students each year that in theory could all be enticed to participate) would have produced a similar effect.
On some winning strategies for the Iterated Prisoner’s Dilemma
Table 8.1a. Rank 1 2 3
Original tournament results.
5
Player Gopal Ramchurn Gopal Ramchurn Gopal Ramchurn GRIM (GRIM Trigger) 1 Wolfgang Kienreich
6
Wolfgang Kienreich
4
7 8 9 10
Emp 1 Bingzhong Wang Hannes Payer Nanlin Jin
175
Strategy StarSN (StarSN) StarS (StarS) StarSL (StarSL) GRIM (GRIM Trigger) OTFT (Omega tit for tat) ADEPT (ADEPT Strategy) EMP (Emperor) (noname) Probbary HCO (HCO)
Score 117,057 110,611 110,511 100,611 100,604 96,291 95,927 94,161 94,123 93,953
Table 8.1b. Tournament results with additional 100 clones. Rank Player Strategy Score ADEPT (ADEPT 1 Wolfgang Kienreich 196,291 Strategy) 2 Emp 1 EMP (Emperor) 195,927 3 Gopal Ramchurn StarSN (StarSN) 137,057 4 Gopal Ramchurn StarS (StarS) 130,611 5 Gopal Ramchurn StarSL (StarSL) 130,511 GRIM (GRIM 6 GRIM (GRIM Trigger) 120,611 Trigger) 1 7 Wolfgang Kienreich OTFT (Omega tit for tat) 120,604 8 Bingzhong Wang (noname) 114,161 9 Hannes Payer Probbary 114,123 10 Nanlin Jin HCO (HCO) 113,953
8.2.2. 2004 competition, league 2 (uncertainty IPD variant, same 223 participating strategies as in the first league) • OTFT was a very close 2nd. • ADEPT and other Godfather variants ranked as the 2nd group strategy. 8.2.3. 2005 competition, league 1 (standard IPD rules, with 192 participating strategies) • CosaNostra Godfather was overall winner, with 20 CosaNostra Hitmen participating in the CosaNostra group strategy. • OTFT did not participate; it remains unclear why.
W. Slany & W. Kienreich
176
Table 8.1c. Rank 1 2 3 4 5 6 7 8 9 10
Tournament results with additional 10,000 clones. Player
Wolfgang Kienreich Emp 1 Gopal Ramchurn Gopal Ramchurn Gopal Ramchurn GRIM (GRIM Trigger) 1 Wolfgang Kienreich Bingzhong Wang Hannes Payer Nanlin Jin
Strategy ADEPT (ADEPT Strategy) EMP (Emperor) StarSN (StarSN) StarS (StarS) StarSL (StarSL) GRIM (GRIM Trigger) OTFT (Omega tit for tat) (noname) Probbary HCO (HCO)
Score 10,096,291 10,095,927 2,117,057 2,110,611 2,110,511 2,100,611 2,100,604 2,094,161 2,094,123 2,093,953
• Our StealthCollusion group strategy member LORD was placed 5th, the collusion again apparently being undetected by the organizers. 8.2.4. 2005 competition, league 4 (standard IPD rules, but only non-group, individual strategies were allowed to participate; 50 participating strategies) OTFT was a very close 2nd. Detailed analysis of results initially suggested that the first placed strategy APavlov OTFT might have been a member of a stealth colluding group strategy — this later turned out to most likely not being true. However, our most likely mistaken analysis of some strategies that seemed to be involved illustrates how difficult it can be to clearly differentiate between stealth collusion strategies and strategies that only appear to behave as colluding strategies, seemingly showing a cooperative behaviour that in fact emerges randomly among strategies that actually are not consciously cooperating with each other. A more detailed analysis follows in the discussion below. 8.2.5. Analysis of OmegaTitForTat’s (OTFT) performance In the following, we review the performance of our single player, individual OTFT strategy in more detail. In the first league of the 2004 competition, which was intended to be a replay of the famous first iterated prisoner’s dilemma competition organized by Robert Axelrod in 1984 [Axelrod (1984)], our OTFT strategy was arguably placed second together with the default GRIM strategy out of a total of 223 participating strategies. Actually OTFT was placed third after the GRIM strategy, GRIM leading by a mere 0.007% points. However, this lead was later seriously put into
On some winning strategies for the Iterated Prisoner’s Dilemma
177
question by the fact that GRIM on average had played 0.92% more games than OTFT in the tournament, as pointed out by Abraham Heifets in an email sent to the organizers on March 29 2005 which the organizers kindly forwarded to us. More rounds obviously add to the score so this difference was significant. When results are scaled to reflect the difference, OTFT would have been placed as the first non-group strategy before GRIM, with an estimated payoff of 101,530 points compared to the 100,604 of GRIM. OTFT and GRIM were clearly outperformed only by a winning strategy being member of the same stealth colluding group of strategies sent in by Gopal Ramchurn. In the following we will refer to Ramchurn’s group as the STAR group strategy. More on group strategies against individual strategies will follow in Section 0. Let us just remark here that we will show in Section 0 that group strategies can perform arbitrarily better than non-group, singleplayer strategies. This basically means that OTFT was the best singleplayer strategy. Moreover, the good results of GRIM are very likely due to the tournament having been dominated by the STAR group strategy, with its individual group members accounting for more than 50% of the participating strategies. GRIM scores best against STAR group members that always defect against members outside their group, the purpose being to damage competing strategies by always defecting (ALLD), because GRIM has a very short (one turn) interval of determination before it switches to ALLD itself. OTFT loses some points in comparison because of interspaced recovery trials during which OTFT cooperates instead of continuing to defect. However, in Section 0 we show that, with and without a high percentage of ALLD strategies OTFT is robustly superior to GRIM. In the second league of the 2004 competition, which was the league with a small probability of erroneous interpretation of the other player’s last move, OTFT was placed as the second best non-group, individual strategy, placed third after three members of Ramchurn’s STAR group and an individual strategy sent in by Colm O’Riordan.c GRIM again ranked high but was slightly outperformed by OTFT, a result that was to be expected in the slightly randomized setting of this league. Miscommunication does happen in the real world, so this illustrates again that in a non-perfect environment an optimistic strategy like OTFT fares better than one with a pessimistic world-view such as GRIM. It also shows that OTFT was again among the c One
of our reviewers learned from ORiordan that this strategy is actually very similar to OTFT.
W. Slany & W. Kienreich
178
best single-player strategies, now also in an environment in which miscommunication happens inherently. For reasons that remain unclear to the authors, OTFT was not allowed to participate in the first and second leagues in the 2005 competition. However, OTFT achieved a second place in league number four in the 2005 competition, which was the league allowing participation of only one strategy by each team, thereby supposedly eliminating the participation of group strategies. Winner was the strategy APavlov sent in by Jia-Wei Li, outperforming our second placed OTFT by 1.2%. 8.2.6. The practical difficulty of detecting collusion The small margin by which APavlov outperformed OTFT caused us to take a very close look at the tournament results of the single-player league. We first note that in the general results, there were strategies present which achieved a lower score than ALLC (always cooperates), RAND (randomly cooperates or defects), NEG (always plays the opposite from what the opponent played last, first move is random) and the other standard strategies usually ranking lowest in tournaments with only single-player strategies present. These scores are shown in Table 8.2. It takes quite an amount of ingenuity to achieve scores as low as the last three candidates. Each one scored even lower than standard RAND and NEG, and all the scores are within an interval below the variance introduced by the RAND strategy. We initially suspected that the last three strategies represented part of a collusion strategy somebody tried to introduce into Table 8.2. Strategies having the lowest score in 2005’s league 4. Rank 39 40 41 42 43 44 45 46 47 48 49 50
Player (Standard) Oscar Alonso Oliver Jackson Bin Xiang Quek Han Yang (Standard) Kaname Narukawa (Standard) (Standard) Bernat Ricardo Yusuke Nojima Yannis Aikater
Strategy ALLC IBA OJ A1 SPILA ALLD (noname) RAND NEG ALT (noname) TCO3
Score 22,182 22,054 21,694 19,586 19,518 18,764 18,592 18,153 17,176 16,934 16,383 16,228
On some winning strategies for the Iterated Prisoner’s Dilemma
Table 8.3. TCO3 ALT APav
Collusion suspects: TCO3 and ALT cooperating with Apav.
C C C
D D C
Table 8.4. TCO3 ALT OTFT
D D D
C C D
C C C
D D C
D D D
C C D
C C D
C C D
C C D
C C D
C... C... D...
Collusion suspects: TCO and ALT cooperating with OTFT.
C C C
D D C
Table 8.5. TCO3 TFT
179
C C
D D D
C C D
C C C
D D C
D D D
C C D
C C C
D D C
D D D
C C D
C... C... D...
Collusion suspect: TCO3 showing TFT a cold shoulder. D C
D D
C D
C C
D C
D D
C D
C C
D C
D D
C... D...
the single player league and therefore took a closer look at their style of play in respect to standard strategies and to player strategies, including the winning strategy Apavlov and our OTFT strategy. Analysis of two suspect strategies looked very much as if they cooperated with the winning APavlov strategy (compare Table 8.3) but also with our OTFT strategy (compare Table 8.3), raising their score by cooperating in the face of continuous defection. On the other hand, the suspect strategies did not exhibit this kind of cooperative behaviour against defection by standard strategies (compare Table 8.5). Obviously, a trigger sequence of moves similar to the protocol exchange employed by our CosaNostra strategy (see 1.3.2) caused the switch to an exploitable ALLC behaviour in the strategies analysed above. Now, we cannot speak for the authors of APavlov, but we swear on our honour and solemnly declared that we did not consciously implement collusion features into OTFT, nor did we introduce any of the suspect strategies above ourselves. Both OTFT and APavlov, if its name is any indicator of the type of algorithm used, are strategies that try to correct for occasional mistakes. Such strategies have generally been known to outperform TitForTat (see, for example, [Nowak and Sigmund (1993)]) and rank highly in single player tournaments. In this case, the correction algorithm in both d One
reviewer suggested that swearing on our honour and solemnly declaring this would not be necessary. However, since this chapter involves so many aspects of stealth collusion, we felt it would help making sure that readers would trust us that OTFT was not involved in any collusion.
180
W. Slany & W. Kienreich
strategies obviously triggered the exploitable behaviour in the collusion suspects, effectively “taking over someone else’s hitman” in the terminology of our CosaNostra collusion strategy (compare Section 0). We conclude that in the presence of strategies which exhibit exploitable behaviour based on very simple trigger mechanisms, collusion as a concept is essentially undetectable. It is not possible to denounce a strategy for using collusion if the behaviour triggering the collusion is entirely reasonable in the context of standard strategies playing to win. In case of IPD competitions in which cooperation and defection can be done in a gradual way, that is, when more than one payoff and multi-choice as in league 3 of the two competitions of 2004 and 2005 exist, this cooperation can be hidden with even more subtlety. In Section 0 we will show that in general deciding whether a set of strategies are involved in a collusion group is among the most difficult questions that theoretically can arise. 8.3. Details of Our Strategies 8.3.1. OmegaTitForTat, or Mr. Nice Guy meets the iterated prisoner’s dilemma The OmegaTitForTat (OTFT) strategy is based on heuristics targeting several tournament situations which have been identified, by tests and statistical analysis, as being both common and damaging to conventional strategies for the IPD. In a tournament environment, certain types of strategy behaviour are very common both in standard strategies added to get a performance comparison base as well as in custom strategies designed to dominate. Several such types of behaviour have been identified, and solutions to optimize the interaction with them have been implemented in OTFT. Let us note that, while we constructed OTFT from scratch, similar forgiving strategies have been described in the literature, see, for example, [Nowak and Sigmund (1993); Beaufils, Delahaye, and Mathieu (1996); Tzafestas 2000; O’Riordan 2000]. 8.3.1.1. Suspicion A common trait of many strategies, including the SuspiciousTitForTat (STFT) strategy from the standard set of strategies used in the tournament, is suspicion: The strategy starts by playing defect, or plays defect after a succession of mutual cooperation. Such a move can prove beneficial for a strategy if the opponent strategy does not immediately counter a defection;
On some winning strategies for the Iterated Prisoner’s Dilemma
181
Table 8.6. Deadlock between TFT and STFT. TFT C D C D C D CD... STFT D C D C D C DC...
for example, TFTT (TitForTwoTat) would not react to occasional, singular defections, thus giving a suspicious strategy a clear advantage. Note that suspicious strategies do not need to keep defecting after an initial defect: The STFT strategy, for example, simply plays standard TFT but starts each game with a defection. The problem many strategies encounter when facing suspicion is that of deadlock: If a strategy is programmed to counter defection in a TitForTat manner, and the suspicious strategy itself is programmed the same way, one suspicious defection can cause a mutual exchange of defects between two strategies which could cooperate perfectly if only one player would once forgive a defection. In general, we define deadlock as any situation where a succession of defects is being played by two strategies because of an out-of-phase TitForTat behaviour, as shown in Table 8.6. OTFT counters deadlocks by forgiving a certain number of defections when a strategy has cooperated for a long time. OTFT starts by cooperating and then tracks the number of cooperations encountered. The initial idea was that for a certain amount of cooperation, a certain number of defections would be forgivable. The final OTFT algorithm incorporates this idea, together with other adaptations, into a single strategy as described below. 8.3.1.2. Randomness Randomness, in the form of cooperative and defective moves varying without any discernible pattern, can be introduced by simulated noise in the command transmission, as used in several specific tournament environments, or it can be a trait of a strategy as such. Strategies trying to gain by finding a cooperative base with an opponent are faced with a difficult problem when the opponent is acting erratically: Finding a cooperative base requires some small sacrifice (for example, STFT and TFTT, in contrast to TFT, can cooperate for the whole game because TFTT sacrifices the initial defection). However a random strategy is highly likely to not stick to a cooperative behaviour, resulting in the sacrifice cost mounting and damaging the score of an otherwise successful, cooperative strategy.
W. Slany & W. Kienreich
182
As a consequence, randomness must be detected in an opponent’s behaviour, and countered appropriately: By playing ALLD (full defect). There is no way to gain from mutual cooperation if an opponent plays completely random. Nevertheless, a strategy can at least deny such an opponent gains by playing defection itself, and moreover, thereby profit from defecting on any unrelated cooperative moves from the random strategy. OFTF counters randomness by playing ALLD when a strategy exhibited a certain amount of random behaviour. The initial idea was to cut losses against the standard RAND strategy. However, in the final OTFT algorithm, the random detection routine was merged with other traits into a single strategy described below. 8.3.1.3. Exploits Many strategies can be devised that try to exploit forgiving behaviour. For example, a simple strategy could be designed to check once if it is playing against any type of TFTT opponent, who forgives one defection “for free”, and to exploit such behaviour. Table 8.7 shows the result of such an exploit strategy at work on TFTT. Fully countering such exploits leads to a strategy similar to PAV: Constant checks would ensure that an opponent does not gain more from the current play mode than oneself. When devising a scheme to implement such checks, a solution was found which incorporates the above mentioned problems of randomness and suspicion. The result is the final version of the OTFT algorithm. 8.3.1.4. OTFT The OTFT algorithm starts by playing C, then TFT. It then maintains a variable noting the behaviour of the opponent according to typical situations as described above: For every time the opponent’s move differs from the opponents previous move, and for every time the opponent’s move differs from OTFT’s previous move, the variable is increased. For every time the opponent cooperated with OTFT, the variable is decreased. These rules allow tracking of randomness and exploits: Based on mutual cooperation Table 8.7. EXPL TFTT
D C
D C
A strategy exploiting TFTT. C D
D C
D C
C D
D C
D C
CDD... DCC...
On some winning strategies for the Iterated Prisoner’s Dilemma
183
as the mutually most beneficial case, each change of move of the opponent indicates some kind of either randomness, or of a try of exploitation of the TFT behaviour used by OTFT. When the so-called exploit tracker in OTFT reaches a certain value, the algorithm switches to all-out defection ALLD to cut losses against an opponent repeatedly breaking cooperation. A second mechanism is at work and allows recovery from deadlocks as described above. When OTFT plays standard TFT, it is vulnerable to deadlock, so independently of the exploit tracker described, a second variable counts the number of times the opponent’s move was the opposite of OTFT’s move. If this so-called deadlock tracker encounters a certain number of exchanges of C and D, an additional C is played and the deadlock counter is reset. As a consequence, OTFT is able to recover from deadlocks occurring anywhere in a given exchange of moves. 8.3.1.5. Examples Table 8.8 demonstrates how the desired avoidance of deadlocks is achieved in a game played by OTFT versus STFT. 8.3.1.6. OTFT’s behaviour laid bare In the end, there is no more detailed and exact description of OTFT’s inner workings than the source code of its implementation. Luckily, the code is short and easy to understand. We therefore reproduce it in Table 8.10, leaving aside only the general parts required for the IPDLX framework that was used in the competitions.e Table 8.8. OTFT STFT
C D
Deadlock resolved by OTFT.
D C
C D
D C
C D
C C
C C
C C
C... C...
Table 8.9 shows how OTFT counters random strategies with all-out defection after a certain amount of random behaviour has been detected. Table 8.9. OTFT RAND
e For
C C
C D
D C
Random recognized and countered by OTFT. C D
D D
C D
C C
D C
C C
C D
C D
D C
D D
D C
D... Cs&Ds...
details of IPDLX see http://www.prisoners-dilemma.com/competition.html#java
184
W. Slany & W. Kienreich
Table 8.10. Main parts of OTFT’s source code.
private static final int DEADLOCK_THRESHOLD = 3; private static final int RANDOMNESS_THRESHOLD = 8; public void reset() { super.reset(); deadlockCounter = 0; randomnessMeasure = 0; opponentMove = COOPERATE; opponentsPreviousMove = COOPERATE; myPreviousMove = COOPERATE; } public double getMove() { if( deadlockCounter >= DEADLOCK_THRESHOLD ) { // OTFT assumes a deadlock and tries to break it cooperating myReply = COOPERATE; // ... twice ... if( deadlockCounter == DEADLOCK_THRESHOLD ) deadlockCounter = DEADLOCK_THRESHOLD + 1; else // ... and then assumes the deadlock has been broken deadlockCounter = 0; } else // OTFT assumes that there is no deadlock (yet) { // OTFT assesses the randomness of the opponent’s behaviour if( opponentMove == COOPERATE && opponentsPreviousMove == COOPERATE randomnessMeasure-; if(opponentMove != opponentsPreviousMove) randomnessMeasure++; if(opponentMove != myPreviousMove) randomnessMeasure++; if(randomnessMeasure >= RANDOMNESS_THRESHOLD) { // OTFT switches to ALLD (randomnessMeasure can only increase) myReply = DEFECT; } else // OTFT assumes the opponent is not (yet) behaving randomly { // OTFT behaves like TFT ... myReply = opponentMove; // ... but checks whether a deadlock situation seems to arise if( opponentMove != opponentsPreviousMove ) deadlockCounter++; else // OTFT recognizes that there is no sign of a deadlock deadlockCounter = 0; } } // OTFT memorizes the current moves for the next round opponentsPreviousMove = opponentMove; myPreviousMove = myReply; return(super.getFinalMove(myReply)); }
On some winning strategies for the Iterated Prisoner’s Dilemma
185
8.3.2. Our group strategies 8.3.2.1. The CosaNostra group strategy, or Organized crime meets the iterated prisoner’s dilemma The CosaNostra strategy is based on the concept of one strategy, denoted Godfather, exploiting another strategy, denoted Hitman, to achieve a higher total score in an IPD tournament scenario. In this context, exploitation denotes the ability to deliberately extract cooperative moves from a strategy while playing defect, a situation yielding high payoff for the exploiting strategy. It is obvious that most opponents would avoid such a situation, stopping to cooperate with an opponent who repeatedly played defection in the past. Hence, a special opponent strategy, the Hitman, is designed to provide this kind of behaviour, and is introduced into the tournament in as large a number as possible. A Hitman strategy which indiscriminatingly plays cooperation, however, is of no use for a Godfather. In mimicking the ALLC standard strategy, such a Hitman would be beneficial for all other strategies in a tournament able to recognize and exploit ALLC. Consequentially, the Hitman must be able to conditionally exhibit two types of behaviour: • By default, Hitman must play a strategy which does not benefit other strategies, which is not easily exploitable. Extending the idea, Hitman should play a strategy most damaging to other strategies to lower their score. Such a strategy is simple ALLD. • When confronted with a certain stimulus, Hitman must switch to the cooperative behaviour defined above. Complementing the Hitman, Godfather should by default play the best standard strategy available against any non-Hitman and switch to ALLD when it encounters a Hitman, relying on the Hitman’s unconditional cooperation to raise its score. In our case, the Godfather plays OTFT when not playing against a Hitman. The critical part of CosaNostra is the identification of opponents, the way in which Godfather detects a Hitman, and a Hitman detects a Godfather. We have employed sequences of Defections and Cooperations to implement a bit-wise protocol which both sides use to mutually establish, and check, identities (in case of multiple choices and multiple payoffs, this protocol could be made very short, depending on the number of choices, possibly to one exchange). If Godfather is aware he is not facing a Hitman,
186
W. Slany & W. Kienreich
he must switch to a good non-group strategy like OTFT or GRIM, and if Hitman is aware it is not facing a Godfather, he must switch to the ALLD strategy strafing all strategies that are not in their group. This occurs in the following cases: • “Unhonorable behaviour”: A presumed Hitman defecting or a presumed Godfather cooperating outside protocol exchanges • “Protocol breach”: Both not following the rules during protocol exchanges Putting the rules in other words, the CosaNostra strategy is based on a Godfather which can be sure that the next n moves of its opponent will be cooperation, because it identifies the opponent through a simple exchange protocol. A problematic aspect of such a strategy is the notion of Godfather or Hitman being “taken over”: Both are prone to wrongly identify an opponent as their strategic counterpart and grant it an advantage (in the case of Hitman) or depend on predefined behaviour (in the case of Godfather) and thus lower their score. The effects if Godfather is taken over: Godfather thinks it is exploiting a Hitman, plays DEFECT, but the opponent plays DEFECT, too, so Godfather gets the lowest possible score for the exchange. This situation is easy to counter: If Godfather detects any defects when it believes it is exploiting a Hitman, it assumes takeover and switches to its good non-group strategy like OTFT or GRIM. The effect of a Hitman being taken over is more subtle: Hitman thinks he is being exploited by Godfather and plays COOP, a behaviour which benefits the opponent. Countering this situation is complex: A first solution would be for Hitman to start playing ALLD as soon as it detects a cooperative move outside the defined protocol exchanges (Hitman assumes to be exploited). But another strategy could still play mostly DEFECT and sometimes cooperate, thus fooling a Hitman: For example, a random opponent strategy with 1/10 of all its moves being cooperative could by chance emulate a protocol exchange which takes place when a interval of fixed length ten is used by Hitman (and Godfather), at least for some time. CosaNostra solves the takeover problem by varying intervals of cooperation-protocol exchange, with the time between exchanges (the number of turns) in one interval being communicated within the protocol exchange. Godfather and Hitman both have an internal counter which tells them when to synchronize by executing a protocol exchange, and check for
On some winning strategies for the Iterated Prisoner’s Dilemma
187
the other strategy truly being part of CosaNostra. Godfather communicates to the Hitman a modification to the interval during each handshake. Thus, no other strategy is likely to take over a Hitman or manipulate a Godfather. The communication protocol contains a 1 bit signature plus a 2 bit sequence coding the length of the next interval, as depicted in Table 8.11 (the numbers at the beginning of the lines are countdown steps until the start of the next interval). A sample exchange will then look as illustrated in Table 8.12: In this example an offset of 2 (CD = 01, binary = 2) is encoded. Internally, the offset is used to select an interval length from a table roughly adjusted to match the number of iterations in the tournament, in our implementation as follows: DD → 26, DC → 15, CD → 19, and CC → 8. Thus, after sending DD, Godfather will defect 23 times, then again starting the hand shake with a single COOPERATE followed by the next two bits encoding the length of the following interval, and so on. On his side, the Hitman will expect 23 DEFECTs and then one COOPERATE from
Table 8.11.
CosaNostra communication protocol. Godfather Description Godfather plays a single signature COOPERATE
CountdownIndex
Move
3
C
2
D/C
Godfather plays first message bit
C
1
D/C
Godfather plays second message bit
C
D
Godfather starts new interval and plays DEFECT from now on until the next shake hand, when CountIndex == 3 again
C
Countdown restart, e.g., 26
Move D
Hitman Description Hitman plays a single signature DEFECT Hitman COOPERATEs to minimize protocol loss Hitman stores first message bit, COOPERATEs to minimize protocol loss Hitman stores second message bit, starts new interval, plays cooperate from now on until the next shake hand when CountIndex == 3 again
W. Slany & W. Kienreich
188
Table 8.12. # Godfather Hitman
Synchronization between Godfather and Hitman. ... D C
6 D C
5 D C
4 D C
3 C D
2 C C
1 D C
19 D C
... D... C...
his opponent to continue to believe the opponent is a Godfather and thus allowed to exploit the Hitman. Note that it would be easy to make the communication protocol more efficient by increasing the length of each new interval according to a function known only to the Hitman and his Godfather, e.g., by adding a number proportional to the number of intervals completed so far. Indeed, the likelihood that a non-Godfather strategy by coincidence can continually fool a Hitman into believing he is serving his Godfather while instead allowing the non-Godfather to take advantage of the Hitman, is decreasing very quickly with each successful exchange. Conversely, the longer the opponent of Hitman is following the Godfather’s behaviour, the more likely it is that the opponent really is his Godfather, and so it becomes safer and safer for the Hitman to let the opponent abuse him for longer and longer interval lengths. The bootstrap for the two strategies is that the Hitman starts with a defection and the Godfather with cooperation, mimicking step 3 as shown above. The initial cooperation move is important for Godfathers standard strategy: To achieve a good score against certain standard opponents (GRIM being an extreme example), it is necessary to start off with Cooperation. Godfather’s protocol loss per interval is at a minimum 5 points (the single protocol cooperation), at a maximum 9 for the Godfather: A base loss of 5 for the single protocol bit is inevitable. Then, at worst, Godfather sends CC, the Hitman cooperates to minimize loss, yielding 3 + 3 = 6 instead of 5 + 5 = 10 in the best case where Godfather sends two defections as protocol bits. The CosaNostra group strategies have not been designed to fare well in a noisy environment as in league 2 of the 2004 competition, though they in practice did quite well (see Section 0). Note that it would not be very difficult to make them more noise resistant by introducing some error correcting mechanism such as, e.g., allowing a certain number of mistakes (or unexpected replies but explainable as answers to possibly wrongly communicated signals from oneself) of the other player until deciding that he is not part of one’s group.
On some winning strategies for the Iterated Prisoner’s Dilemma
189
Table 8.13. Main parts of CosaNostra Godfather’s source code.
>> private variables and constants like in Table 8.10 > Content of OTFT's reset() method from Table 8.10 SYNC_GF_COOPERATES ) myReply = DEFECT; // Godfather thus exploits Hitman else if( countdownIndex == SYNC_GF_COOPERATES ) { myReply = COOPERATE; // COOPERATE once to synchronize nextCountdownRestartValue = 9; // GF starts to prepare } else if( countdownIndex == GF_SENDS_FIRST_MESSAGE_BIT ) { myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?7:0; } else // if( countdownIndex == GF_SENDS_SECOND_MESSAGE_BIT ) { myReply = (Math.random()>0.5) ? DEFECT : COOPERATE; nextCountdownRestartValue += (myReply==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue; // restart } countdownIndex--; } } else // Opponent surely is no Hitman and thus Godfather plays OTFT >> Content of OTFT's getMove() method from Table 8.10 SYNC_GF_REPLIES_WITH_COOPERATE && opponentMove == COOPERATE ) ) { // Yes, so the opponent cannot be a Godfather, so Hitman ... myReply = DEFECT; // ... defects and switches... opponentPlayedSoFarLikeGodfather = false; // ... to ALLD } else // No, the opponent again played like a Godfather. { if( countdownIndex != SYNC_HM_DEFECTS ) { myReply = COOPERATE; // Godfather thus can exploit Hitman if( countdownIndex == FIRST_MESSAGE_BIT_FROM_GF ) nextCountdownRestartValue += (opponentMove==DEFECT)?7:0; else if( countdownIndex == SECOND_MESSAGE_BIT_FROM_GF ) { nextCountdownRestartValue += (opponentMove ==DEFECT)?11:0; countdownIndex = nextCountdownRestartValue - 1; // restart } } else // if( countdownIndex == SYNC_HM_DEFECTS ) { myReply = DEFECT; // Hitman DEFECTs once to synchronize nextCountdownRestartValue = 9; // HM starts to prepare } countdownIndex--; } } else // Opponent surely is no Godfather and thus Hitman ... myReply = DEFECT; // ... plays ALLD return(super.getFinalMove(myReply)); }
On some winning strategies for the Iterated Prisoner’s Dilemma
191
8.3.2.2. The gory details of the CosaNostra group strategy As in OTFT’s case, there is no more detailed and exact description of the CosaNostra group strategy’s inner workings than the source code of its implementation. Again, the code is short and easy to understand. We therefore reproduce it in Tables 8.13 for the Godfather and 8.14 for the Hitman strategy, again leaving aside only the general parts required for the IPDLX framework that was used in the competitions5. As Godfather uses the OTFT strategy against strategies other than Hitman, the part of the code of Godfather that is identical to the one of OTFT in Table 8.10 is not repeated but referred to. 8.3.2.3. TheEmperorAndHisCloneWarriors This group strategy is based on the same principles as the CosaNostra group strategy, with one emperor playing the role of the Godfather, and his clone warriors playing the Hitman strategy in large numbers (the number being the major difference), each clone strategy having an individual number in its name since it was required in the submission procedure to the competition to give each individual strategy a different name. We had trusted the organizers after enquiring via email that open group strategies would be allowed in the 2004 competition and accordingly had submitted the EmperorAndHisClones strategy with altogether 11,110 individually numbered clones as one group strategy, as it was not clear how large groups would be permitted to be. For reasons that, especially in hindsight, are not entirely clear to us, the organizers decided to let altogether only one clone (with the emperor) participate in the competitions. We are still perplexed with respect to this point. In particular, we were initially prepared to submit a much larger collusion group within the CosaNostra group strategy but — after hearing that groups would be allowed — decided to submit only one such collusion strategy as a proof of concept, counting on the fact that our clone army would evaporate all competitors. 8.3.2.4. The StealthCollusion group strategy As a proof of concept (see previous section), we submitted under the name of Constantin Ionescu a group strategy that cooperates with our CosaNostra group strategy, though not perfectly so. The mail with which we submitted the strategy was written on purpose with some typos, a few grammatical glitches, and sloppy formatting, all in order to add to the look
W. Slany & W. Kienreich
192
of authenticity of the submission by distracting from the real intention. It was sent from a free mail account hosted in Romania, the sender claiming to be a Student of informatica from the technical school of Timisoara. As expected the deception went undetected. 8.4. Analysis of the Performance of the Strategies 8.4.1. OmegaTitForTat Table 8.15 shows how OTFT clearly dominates a standard tournament with strategies commonly used as test cases. Table 8.16 illustrates how OTFT dominates in harsh environments where a lot of unconditional defection occurs. Table 8.17 demonstrates OTFT’s dominance in random environments. The slight lead of GRIM in league 4 of the 2005 competition was due to the higher number of games GRIM was allowed to play as we explained already in Section 0. 8.4.2. Group strategies In this section we study general characteristics of important possible group strategies. We first classify and name group strategy classes as follows: • Democracy during peace (DP): All group members are equals and treat each other nicely by always cooperating, and play TFT or a better strategy such as OTFT or GRIM outside of their community. • Democracy at war (DW): All group members are equals and treat each other nicely, however they continually defect (ALLD) against all other strategies (after a short recognition interval). Table 8.15. OTFT in a standard environment, standard strategy sample, 200 turns. Rank Strategy Score 1 OTFT 5,978 2 GRIM 5,538 3 TFT 5,180 4 TFTT 5,134 5 ALLC 4,515 6 RAND 4,062 7 STFT 4,018 8 ALLD 4,016 9 NEG 3,726
On some winning strategies for the Iterated Prisoner’s Dilemma
193
Table 8.16. OTFT in a harsh environment, 50% ALLD opponents, 200 turns. Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Strategy OTFT GRIM TFT TFTT ALLD ALLD ALLD ALLD ALLD ALLD STFT ALLD ALLD RAND ALLC NEG
Score 7,358 6,959 6,577 6,524 5,512 5,464 5,452 5,428 5,428 5,416 5,415 5,404 5,400 4,658 4,530 3,728
Table 8.17. OTFT in a random environment with 50% RAND opponents, 200 turns. Rank Strategy Score 1 OTFT 10,114 2 GRIM 9,867 3 TFT 8,338 4 ALLD 8,236 5 TFTT 7,806 6 RAND 7,357 7 RAND 7,212 8 RAND 7,195 9 STFT 7,192 10 RAND 7,150 11 RAND 7,150 12 RAND 7,099 13 RAND 7,099 14 RAND 7,082 15 NEG 6,947 16 ALLC 6,624
• Empire during peace (EP): There is one special group member, the emperor, which is allowed to take advantage of all other members of his empire by playing defect while they cooperate with him. The subjects otherwise cooperate among each other, and play TFT or a better strategy
194
W. Slany & W. Kienreich
such as OTFT or GRIM outside their community, after a short recognition interval. • Empire at war (EW): Again, the emperor is allowed to take advantage of all other members of his empire by playing defect while they cooperate with him. Again, the subjects otherwise cooperate among each other, but now they play, after a short recognition interval, ALLD against all other strategies. In the following, we will show that groups can be arbitrarily better performing than individual strategies, and that, under equal group size, EW groups can achieve arbitrarily higher payoffs (for the emperor) than EP groups, and that EP groups can achieve arbitrarily higher payoffs (for the emperor) than members of an DP group, which can achieve arbitrarily higher payoffs than members of a DW group. When group sizes vary, we show that even the weak DW group members can achieve arbitrarily higher payoffs than the emperor of a competing EW group by sheer numerical superiority. First some preliminaries: We know that the payoff values observe the relations S < P < R < T and 2R > T + S of Section 0. Let us assume in the following that the group in the democracy variants and the group of subjects in the empire variants are of size m (for members), and that there are altogether n players in total (so m < n) which play i iterations during the IPD competition. We further assume that: • The best single-player (non-group) strategy IOPT (for individual optimal strategy) achieves payoff X · i after i iterations. • The emperor strategy achieves payoff E · i after i iterations. • The individual members (or subjects) achieve payoff M·i after i iterations. • The loss due to recognition of members of the same group is negligible due to the size of i. • We further assume that the emperor always plays the best non-group strategy against non-members of his group. • During peace, individual members always play the best non-group strategy against non-members of their group. • We assume that the best single-player strategy achieves an average payoff of A against other non-group strategies. The relations P < A < T are plausible, and a value of A near R is likely under the assumption that most individual strategies are similar to TFT. We therefore assume
On some winning strategies for the Iterated Prisoner’s Dilemma
195
that A = R in the following unless stated otherwise. This implies that members of groups of type DP achieve more or less the same payoff as the best individual strategy IOPT, so we assume that MDP = XDP . This assumption simplifies the calculations in the following claim without sacrificing the fundamental relations between the different strategies. • We also assume that most single-player strategies achieve an average score near A (and thus near R according to the previous assumption) when playing against other single-player strategies (so more or less all of them are optimal) and against DP, EP, or emperors of EW strategies (so they all play fairly against each other), and an average score of P when playing against members of groups at war. This would roughly correspond to the pay-off achievable by OTFT and similar strategies. Again, this assumption simplifies the calculations in the following claim without sacrificing the fundamental relations between the different strategies. Claim 8.1: Under the above assumptions and unless stated otherwise, the following relations hold: (1) Members of groups of type DW can achieve larger payoffs than members of groups of type DP only when the DW members constitute more than 50% of the total population. When group sizes are equal and there are other strategies, DP has an advantage over DW. By increasing i, this advantage can be made arbitrarily large: mDP ≥ mDW → MDP · i >> MDW · i. (2) Emperors from EP groups can achieve larger payoffs than members of groups of type DP (assuming equal group size). By increasing i, this advantage can be made arbitrarily large: EEP · i >> MDP · i. Because of our assumption that MDP = XDP the relation also holds for the best individual strategy IOPT, so emperors from EP groups can achieve arbitrarily larger payoffs than the best individual strategy. (3) Emperors from EW groups can achieve larger payoffs than an emperor from an EP group (assuming equal group size). By increasing i, this advantage can be made arbitrarily large: EEW · i >> EEP · i. (4) When two groups of unequal size compete, then: (a) Independently of the group sizes and the values of S, P, R, and T, emperors (at war or during peace) fare better than democrats at peace. By increasing i, this advantage can be made arbitrarily large: EE · i >> MDP · i.
W. Slany & W. Kienreich
196
(b) Depending on the values of P, R, and T, and when i increases, a democracy at war can fare arbitrarily better than an emperor (at war or during peace) when it is sufficiently large: mDW >> mE → MDW · i >> EE · i. (5) We now assume that IOPT scores a higher average payoff value A against non-group strategies than the group strategies achieve against non-group strategies; let B with B ¡ A ¡ T be the (bad) score that an emperor achieves on average against non-group strategies (we here deliberately drop the initial assumption that emperors play IOPT against non-group strategies). In order for the emperor to nevertheless win despite playing worse in general than IOPT, the following inequalities must be satisfied: In case of EP, (6) mEP > (A − B)/(T − B)n , and in case of EW, mEW > (A − B)/(T − B − P + A)n . Again, larger group size helps even when the strategies are badly performing. We also see that as B approaches A, emperors can win against IOPT even with very few other group members. (7) When two DW, EP, or EW groups of the same type but of different size and with different “efficiencies” compete (we here again deliberately drop the initial assumptions that emperors play IOPT against nongroup strategies), larger group size can compensate for less efficiency, and vice versa. Note that this is not true for DP groups. Proof. (1) MDP = R (n − mDW ) + P mDW and MDW = R mDW < +P (n − mDW ), assuming that no other group at war is present in the population. Thus, MDW > MDP if and only if mDW > n/2. (2) MDP = R n and EEP = R(n − m) + T m. Since T > R, EEP > MDP . (3) EEP = R (n − 2m) + T m + P m and EEW = R (n − m) + T m. Since R > P, EEW > EEP . (4) For groups of unequal size: (a) It suffices to show that EEP > MDP is independent of the size of the groups. EEP = R (n − mEP ) + T mEP and MDP = R n. Since T > R, EEP > MDP holds independently of the size of the groups.
On some winning strategies for the Iterated Prisoner’s Dilemma
197
(b) It suffices to show that there exists a large enough mDW such that MDW > EEW . MDW = R (n − mEW ) + P mEW and EEW = R (n − mEW − mDW ) + T mEW + P mDW . Then MDW > EEW if and only if mDW > (T − P)/(R − P)mEW . In the 2004 and 2005 competitions, P = 1, R = 3, and T = 5, so mDW would have to be larger than 2mEW . In case only the two group strategies would compete, this would mean that the DW strategy would need 2/3 of the strategies in the whole population. (5) In case of EP: EEP = B (n − mEP ) + T mEP and XEP = A n. Then EEP > XEP if and only if mEP > (A − B)/(T − B)n (assuming that T > A > B). In case of EW: EEW = B (n − mEW ) + T mEW and XEW = A (n − mEW ) + P mEW . Then EEW > XEW if and only if mEW > (A − B)/(T − B − P + A)n. (6) We show it here for two unequal EW strategies, and note that similar arguments work for the cases EP and DW. Let B1 and B2 be the scores that the two emperors achieve on average against non-group strategies, with B1 < B2 and |B1 −B2 | = α (T−P) with 0 < α < 1. Then E1 > E2 if and only if m1 > (1 − α)/(1 + α)m2 + α/(1 + α)n . Example: suppose B1 = 2.5 and B2 = 2.6, and as before P = 1 and T = 5 so that α = 0.025, and m1 > 0.9513m2 + 0.0244n . Thus, when m2 = 20 and n = 100 then m1 must be at least 22 so that the first emperor can triumph above his more efficient opponent. 8.4.3. Collusion detection is an undecidable problem The difficulty of detecting collusion practically has been shown in previous parts of this chapter. The difficulty of recognizing collusion is also supported by the difficulty of solving the problem from a theoretical point of view: We show below that the general question of whether two strategies of which the source code is known and that do not depend on any third party source of randomness are actually colluding or not, is undecidable — of course it is even harder when the strategies only are known as black boxes, without having access to their source code. Simpler arguments than ours would also do but we try in our approach to define the formal collusion problem as closely to the practical collusion detection problem as possible.
198
W. Slany & W. Kienreich
Remember the definition of the Halting problem: Is there a finite deterministic Turing machine H that is able to decide in finitely many steps whether an arbitrary finite deterministic Turing machine M ultimately will halt or not? It is well known that the Halting problem has been shown to be undecidable by Turing. Exact definitions of Turing machines and other notions appearing in this section as well as references to the original sources can easily be found, e.g., in any theoretical computer science reference book such as Papadimitriou (1994). Let the Simplified Collusion problem formally be defined as follows: Is there a deterministic Turing machine SC that is able to decide in finitely many steps whether, given two arbitrary integers i and j, two arbitrary finite deterministic Turing machines S1 and S2 will both output a sequence of at least length i+j characters (one character per tape position) composed only of the letters “C” and “D” on their two separate write-once output tapes T1 and T2, such that the j letters starting from tape position i + 1 will all be “D”s on T1 and all be “C”s on T2? This simplistic definition covers many (but surely not all) real collusion cases. It also would imply that strategies usually not considered colluding consciously like ALLD as T1 and ALLC as T2 would be classified as colluding in the Simplified Collusion terminology. However, ALLD really could be colluding with a large group of ALLC where other more cautious strategies like OTFT would not be able to take advantage of ALLC since they never would defect first. Thus, when a player or a group of players are able to introduce an ALLD and many ALLC into a competition, they could well be part of an intentional collusion, and thus the classification in the Simplified Collusion terminology would not be completely wrong. Eventually, deciding what really is a collusion and what not cannot be solved by formal methods alone. Nevertheless, we can at least show the following: Claim 8.2: The Simplified Collusion problem is undecidable. Proof. To formally show the undecidability of the Simplified Collusion problem, we follow the standard argument by reducing the Halting problem to it. Take any finite deterministic one-tape Turing machine M for which we want to know whether it halts or not. Without loss of generality, we assume that the tape of M is infinite in both directions, that each combination of the finitely many characters of the alphabet, which includes the letters “C” and “D”, and of the finitely many states of M defines exactly one of the finitely many rules of M, and that only the special state h stops M.
On some winning strategies for the Iterated Prisoner’s Dilemma
199
To decide whether M halts or not, we construct for each M two new Turing machines N1 and N2. N1, in comparison to M, is defined as follows: It has an additional initially empty output tape T, an additional tape IJ that initially contains the numbers i and j in binary with the character “:” written between the two numbers, an additional state s, and a constant number of other states needed to be able to countdown the two binary numbers and do the other things described below, and almost the same set of rules as M, with only the following changes: each rule of M leading to h instead leads to state s, and there is a constant number of additional rules that make sure the following: When N1 enters state s, it will countdown from i to zero, each time writing one letter “C” on IJ and then moving one position to the right on IJ, so that at the end a sequence of i “C”s is written on IJ. Then it will countdown from j to zero, each time writing one letter “D” on IJ and then moving one position to the right on IJ, so that at the end a sequence of i “C”s followed by j “D”s is written on IJ. Then it will change to state h and halt. N2 is defined as follows: it simply writes i + j letters “C” to its output tape T. Finally, we choose the two numbers i and j, e.g., i = 1 and j = 1. It is clear that this construction always leads to a valid instance of the Simplified Collusion problem. It is also clear that if and only if M halts, then the question posed in the Simplified Collusion problem will have a positive answer for the constructed instance of Simplified Collusion problem. Now, if a finite deterministic Turing machine SC that is able to decide the Simplified Collusion problem in finitely many steps would exist, then we could also decide the Halting in finitely many steps, as follows: We would define a new finite deterministic Turing machine R that for any given Turing machine M (properly encoded for R on R’s input tape), first constructs (in finitely many steps) an encoding of corresponding finite deterministic Turing machines N1 and N2 with i = 1 and j = 1 as described above (this surely can be done in finitely many steps), then simulates SC applied to this instance of the Simplified Collusion problem, thereby deciding in finitely many steps (SC takes only finitely many steps, and simulating it on R is also easily feasible in finitely man steps) whether it is a yes or a no-instance, and returns this answer of SC as the answer of R, which must also be the answer to the question of whether M halts or not. So, if the Simple Collusion problem is decidable, then the Halting problem must also be decidable. Since we know for sure the latter is not true, the former also cannot be true, and thus the Simple Collusion problem is undecidable.
200
W. Slany & W. Kienreich
8.5. Conclusion We have described our submissions to the iterated prisoner’s dilemma (IPD) competitions of 2004 and 2005, the OmegaTitForTat (OTFT) single-player strategy and the CosaNostra group strategy composed of one Godfather (CNGF) and several Hitman (CNHM). We also studied their performance in the different leagues of the competitions. The observed slight superiority of OTFT in comparison to GRIM psychologically is a reassuring result. The charm of OTFT compared to GRIM is that OTFT is an intelligent forgiving strategy whereas GRIM, as the name implies, is an unforgiving iron-handed pig-head that falls in an eternal revenge mode after being deceived a single time. We also have established a taxonomy of generalized group strategies for IPD competitions. In it, the types of group strategies are classified according to their behaviour towards other members of the same group and towards strategies outside of their group. We labelled the four classes of group strategies studied as democracies during peace (DP), democracies at war (DW), empires during peace (EP), and empires at war (EW). As we have shown in the previous section, group strategies can easily outperform any individual strategy by sheer numerical superiority. Group strategies appear at every place in Nature and Human Society, and group strategies competing in IPD competitions can serve as simplified study objects of the former. It is interesting to note that in the analysis in the last section, individual strategies member of a DW group fare less well than those of a DP group, and that this relation is reversed for empires, EW faring better than EP, not because the emperor itself fares better, but because his competitors are more harmed. This is clear from the fact that members of DW lose individually more than members of DP, whereas emperors at war (EW) fare better than emperors during peace (EP), and these better than DW and DP. E.g., emperors at war do not have to suffer from their aggressive acts, and actually do better in comparison than their opponents by letting the payoff of individuals that are not members of their group get lowered by their other, underling members, while at the same time retaliation from others does not hit them directly (think of real emperors, Mafia bosses, etc). But it not even has to be fights for life and death, wars, or outright genocide: the same pattern appears in business where larger or more advanced companies (in particular their owners) that are more or less aggressive
On some winning strategies for the Iterated Prisoner’s Dilemma
201
can crush competitors or, in extremity, take advantage of cheap child-slave labour, thus extremely abusing their own workforce. It is also interesting to note that better resources, be it people, money, or technology, corresponding to a higher number of individual strategies in the group, or better average payoff values against non-group strategies, positively influence the overall payoff values of the groups. Thus, numerical superiority does not have to mean that the number of soldiers is higher, but can also be due to better technology, be it military, commercial, or biological. It is also not surprising that, as described point 4.a of Claim 8.1 of Section 0, individual strategies in democracies during peace always “lose” against emperors, the latter always being able to get more from his subjects than what he gives in return, and certainly more than his unorganized competitors. However, given enough superiority, again either in number, money, or technology, even democracies at war can win against empires at war (point 4.b of the claim in Section 0), the Second World War for instance having several examples of such situations. We also showed that group strategies can be subtly camouflaged to look like unrelated single-player strategies. These stealth collusion group strategies will elude detection with high probability, e.g., by introducing a certain amount of noise in the interaction with one’s group members to make the collusion less evident. We showed that the differentiation between colluding and non-colluding behaviour can be very difficult in practice and is generally undecidable from a theoretical point of view. In the study of economics, collusion takes place within an industry when rival companies cooperate for their mutual benefit. According to game theory, the independence of suppliers forces prices to their minimum, increasing efficiency and decreasing the price determining ability of each individual firm. If one firm decreases its price, other firms will follow suit in order to maintain sales, and if one firm increases its price, its rivals are unlikely to follow, as their sales would only decrease. These rules are used as the basis of kinked-demand theory. If firms collude to increase prices as a cooperative, however, loss of sales is minimized as consumers lack alternative choices at lower prices. This benefits the colluding firms at the cost of efficiency to society [Wikipedia: Collusion 2005]. There was some discussion whether collusion group strategies were actually cheating in the 2004 and 2005 IPD competitions, but since the organizers clearly said that cooperating strategies were to be allowed, it would have been strange to deny participation to such group strategies. What we can say at least is that the detection of StealthCollusion, both in future IPD
202
W. Slany & W. Kienreich
competitions as well as in real life, in practice is very difficult. The Mafia, or for that matter, any human organization that is not readily recognizable as a group, be it Masonic lodges, secret religious groups, or corporate cartels, exist and as such are certainly worth to be modelled. Being able to secretly communicate, thereby “colluding” in a general sense, is quite common, and in practice forbidding it is nearly infeasible whenever intelligent individuals exchange information repeatedly. An exception where a biological occurrence of an IPD without information exchange has been reported to take place has been described by Turner and Chao (1999). They show that certain viruses that infect and reproduce in the same host cells seem to be engaged in a survival of the fittest-driven prisoner’s dilemma. However, in light of the ways different types of bird’s flu viruses infecting the same human cells can exchange RNA in order to increase their fitness, it can be argued that such emerging colluding group behaviour appears already at this relatively low level of life. In commerce, collusion is largely illegal due to antitrust law, but implicit collusion in the form of price leadership and tacit understandings is unavoidable. Several recent examples of explicit collusion in the United States include [Wikipedia: Collusion 2005]: • Price fixing and market division among manufacturers of heavy electrical equipment in the 1960s. • An attempt by Major League Baseball owners to restrict players’ salaries in the mid-1980s. • Price fixing within food manufacturers providing cafeteria food to schools and the military in 1993. • Market division and output determination of livestock feed additive by companies in the US, Japan and South Korea in 1996. There are many ways that implicit collusion tends to develop [Wikipedia: Collusion 2005]: • The practice of stock analyst conference calls and meetings of industry almost necessarily cause tremendous amounts of strategic and price transparency. This allows each firm to see how and why every other firm is pricing their products. Again, the line between insider information and just being better informed is often very thin. • If the practice of the industry causes more complicated pricing, which is hard for the consumer to understand (such as risk based pricing, hidden taxes and fees in the wireless industry, negotiable pricing), this can cause
On some winning strategies for the Iterated Prisoner’s Dilemma
203
competition based on price to be meaningless (because it would be too complicated to explain to the customer in a short ad). This causes industries to have essentially the same prices and compete on advertising and image, something theoretically as damaging to a consumer as normal price fixing. We predict that all iterated prisoner’s dilemma competitions in the future will be dominated by group strategies. Even when in a future IPD competition all strategies will be chosen by the same single person who consciously tries to avoid that any “group cooperation” happens among his strategies, then random and involuntary cooperation that mathematically is identical to voluntary cooperation can never be excluded. Actually, group cooperation can be self-emerging in a population, some strategies involuntarily faring better together and possibly against other groups or individuals, however loosely they are constituted. We predict that when evolutionary algorithms are used to breed new species of IPD strategies, such cooperation will automatically emerge at a certain point. Cooperation in groups of strategies in IPD competitions mimics cooperation of groups in Nature and Human Society — it therefore allows modelling another common aspect of cooperative behaviour that so far was not explicitly studied in the IPD framework: more or less open cooperation of subgroups versus other subgroups or individuals. The number of members of the group does not have to correspond to the actual number of individuals. Instead, it could also mean the amount of money involved, or the technological advantage of one subgroup relative to another one. Acknowledgments The authors would like to thank the anonymous reviewers for many useful comments and corrections. References Axelrod, R. (1984) The evolution of cooperation. Basic Books. Beaufils, B., Delahaye, J.-P., and Mathieu, P. (1996). Our meeting with gradual: A good strategy for the iterated prisoner’s dilemma, Proceedings Artificial Life V, Nara, Japan, 1996. Kuhn, S. (2003) Prisoner’s Dilemma. The Stanford Encyclopedia of Philosophy (Fall 2003 Edition), Edward N. Zalta (ed.), http://plato.stanford.edu/ archives/fall2003/entries/prisoner-dilemma/.
204
W. Slany & W. Kienreich
Mehlmann, A. (2000) The Game’s Afoot! Game Theory in Myth and Paradox. AMS Press. Nowak, M. and K. Sigmund (1993) A strategy of win-stay, lose-shift that outperforms tit-for-tat in the Prisoner’s Dilemma game, Nature, 364, pp. 56-58. O’Riordan, C. A (2000) Forgiving Strategy for the Iterated Prisoner’s Dilemma. Journal of Artificial Societies and Social Simulation, 3, 4. Papadimitriou, C. H. (1994) Computational Complexity. Addison-Wesley. Turner, P. and L. Chao (1999). Prisoner’s dilemma in an RNA virus, Nature, 398, pp. 441-443. Tzafestas, E.S. (2000) Toward adaptive cooperative behavior, From Animals to animats, Proceedings of the 6th International Conference on the Simulation of Adaptive Behavior (SAB-2000), 2, pp. 334-340. Wikipedia: Collusion (2005). http://en.wikipedia.org/w/index.php?title= Collusion&oldid=33029071.
Chapter 9 Error-Correcting Codes for Team Coordination within a Noisy Iterated Prisoner’s Dilemma Tournament
Alex Rogers, Rajdeep K. Dash, Sarvapali D. Ramchurn, Perukrishnen Vytelingum, Nicholas R. Jennings University of Southampton
9.1. Introduction The mechanism by which cooperation arises within populations of selfish individuals has generated significant research within the biological, social and computer sciences. Much of this interest derives from the original research of Axelrod and Hamilton[Axelrod and Hamilton (1981)], and, in particular, the two computer tournaments that Axelrod organised in order to investigate successful strategies for playing the Iterated Prisoner’s Dilemma (IPD)[Axelrod (1984)]. These tournaments were so significant as they demonstrated that a simple strategy based on reciprocity, namely titfor-tat, was extremely effective in promoting and maintaining cooperation when playing against a wide range of seemingly more complex opponents. To mark the twentieth anniversary of the publication of this work, these two computer tournaments were recently recreated (see http://www. prisoners-dilemma.com/) with separate events being hosted at the 2004 IEEE Congress on Evolutionary Computing (CEC’04) and the 2005 IEEE Symposium on Computational Intelligence and Games (CIG’05). To stimulate novel research, the rules of Axelrod’s original tournaments were extended in two key ways. Firstly, noise was introduced, whereby the moves of each player would be mis-executed with some small probability. Secondly, and most significantly, researchers were invited to enter more than one player into the round-robin style tournament. This second extension to the original rules, prompted several researchers to enter teams of players into the tournament. This choice being motivated by the intuition that the members of such a team could, in principle, recognise and collaborate with 205
206
A. Rogers et al.
one another in order to gain an advantage over other competing players. This proved to be the case, and teams of players performed well in both competitions. Indeed, a member of such a team, entered by the authors, won the noisy IPD tournaments held at both events. Now, for this approach to be effective in practice, two key questions have to be addressed. Firstly, the players, who have no access to external means of communication, have to be able to recognise one another when they meet within the IPD tournament. Secondly, having achieved this recognition, the players have to adopt a strategy that increases the probability that one of their own kind wins the tournament. In this chapter, we present our work investigating these two questions. Specifically: (1) We show how our players are able to use a pre-agreed sequence of moves, that they make at the start of each interaction, to transmit a covert signal to one another, and thus detect whether they are facing a competing player or a member of their own team. (2) We show that by recognising and then cooperating with one another, the members of the team can act together to mutually improve their performance within the tournament. In addition, by recognising and acting preferentially toward a single member of the team, the team can further increase the probability that this member wins the overall tournament. In both cases, this can be achieved with a team that is small in comparison to the population (typically less than 15%). (3) Given this approach, we show with an experimental IPD tournament that the performance of our team is highly dependent on the length of the pre-agreed sequence of moves. The length of this sequence determines both the cost and the effectiveness of the signalling between team members, and these factors contribute to an optimum sequence length that is independent of both the size of the team and the number of competing players within the tournament. (4) Using the results of these experimental IPD tournaments, we show that signalling with a pre-agreed sequence of moves, within the noisy IPD tournament, is exactly analogous to the problem, studied in information theory, of communicating reliably over a noisy channel. Thus we demonstrate that we can implement error correcting codes in order to further optimise the performance of the team. (5) Finally, we discuss how the results of these investigations guided the design of the teams that we entered into the two recent IPD competi-
Error-Correcting Codes for Team Coordination
207
tions, and thus we follow this analysis with a discussion of the results of these competitions. The remainder of this chapter is organised as follows: section 9.2 describes the Iterated Prisoner’s Dilemma setting and related work. Section 9.3 describes the team players that we implemented in our investigations and section 9.4 describes the results of the experimental IPD tournaments that we implemented. In section 9.5 we analyse these results and in section 9.6 we discuss our use of coding theory to optimise the performance of the team. Finally, we discuss the application of these techniques within the two computer tournaments in section 9.7 and we conclude in section 9.8. 9.2. The Iterated Prisoner’s Dilemma and Related Work In our investigations, we consider the standard Iterated Prisoner’s Dilemma (IPD) as used by Axelrod in his original computer tournaments. Thus, in each individual IPD game, two players engage in repeated rounds of the normal form Prisoner’s Dilemma game, where, at each round, they must choose one of two actions: either to cooperate (C) or to defect (D). These actions are chosen simultaneously and depending on the combination of moves revealed, each player receives the payoff indicated in the game matrix shown in table 9.1. For example, should player 1 cooperate (C) whilst player 2 defects (D), then player 1 receives zero points whilst player 2 receives five points. The scores of each player in the overall IPD game are then simply the sum of the payoffs achieved in each of these rounds. In our experiments we assume that each IPD game consists of 200 such rounds, however, this number is of course unknown to the players participating. As in the original tournaments, a large number of such players (each using a different strategy to choose its actions in each individual IPD game) are entered into a round-robin tournament. In such a tournament, each player faces every other player (including a copy of itself) in separate IPD games, and the winner of the tournament is the player whose total score, summed over each of these individual interactions, is the greatest. Given this problem description, the goal of Axelrod’s original tournaments was to find the most effective strategies that the players should adopt. Whilst in a single instance of the Prisoner’s Dilemma game it is a dominant strategy for each player to defect, in the iterated game this immediate temptation is tempered by the possibility of cooperation in future rounds. This is often termed the shadow of the future[Trivers (1971)], and, thus, in
208
A. Rogers et al.
Table 9.1. Pay-off matrix of the normal form Prisoner’s Dilemma game.
Player 1
C D
Player 2 C D 3,3 0,5 5,0 1,1
order to perform well in an IPD tournament, it is preferable for a player to attempt to establish mutual cooperation with the opponent. Thus, strategies based on reciprocity have proved to be successful, and, indeed, the simplest such strategy, tit-for-tat (i.e. start by cooperating and then defect whenever the opponent defected in the last move) famously won both tournaments[Axelrod (1984)]. More recent research has extended this reciprocity based approach, and has lead to strategies that out-perform tit-for-tat in general populations. For example, Gradual[Beaufils et al. (1997)] is an adaption of tit-for-tat that incrementally increases the severity of its retaliation to defections (i.e. the first defection is punished by a single defection, the second by two consecutive defections, and so on). Likewise, Adaptive[Tzafestas (2000)] follows the same intuition as Gradual but addresses the fact that the opponent’s behaviour may change over time and thus a permanent count of past defections may not be the best approach. Rather, it maintains a continually updated estimate of the opponent’s behaviour, and uses this estimate to condition its future actions. However, this reciprocity is challenged within the noisy IPD tournament. Here, there is a small possibility (typically around 1 in 10) that the moves proposed by either of the players is mis-executed. Thus a player who intended to cooperate, may defect accidentally (or vice versa)a and this noise makes maintaining mutual cooperation much more difficult. For example, a single accidental defection in a game where two players are using the tit-for-tat strategy, will lead to a series of mutual defections in which each player scores are reduced. This detrimental effect is often resolved a Note
that this noise can be implemented in two different ways: either the cooperation is actually mis-executed as a defection, or it is simply perceived by the other player as a defection. The difference between these two implementations results in different payoffs to the players in that round on the IPD game. Whilst this does result in slightly different scores in the overall IPD tournament, it does not significantly effect the results, as, in general, the performance of a player is determined by its actions in the moves that follow either the real or perceived defection. In our experiments, we use the first implementation and assume that noisy moves are actually mis-executed.
Error-Correcting Codes for Team Coordination
209
by implementing more generous strategies which do not retaliate immediately. For example, tit-for-two-tats (TFTT) will only retaliate after two successive defections[Axelrod (1997); Axelrod and Wu (1995)] and generous tit-for-tat (GTFT) only retaliates a small percentage of the times that tit-for-tat would[Axelrod and Wu (1995)]. However, whilst these strategies manage to maintain mutual cooperation when playing against similar generous strategies, their generosity is also vulnerable to exploitation by more complex strategies. Thus effective strategies for noisy IPD tournaments must carefully balance generosity against vulnerability to exploitation, and in practise, this is difficult to achieve. Now, the possibility of entering a team of players within a noisy IPD tournament offers an alternative to this reciprocity based approach. If the members of the team are able to recognise one another, they can unconditionally mutually cooperate and thus do not need to retaliate against defections that are the result of mis-executed moves. In addition, by defecting against players who they do not recognise as fellow team members, they are immune to exploitation from these competing players. As such, this approach resembles the notion of kin selection from the evolutionary biology literature, where individuals act altruistically toward those that they recognise as being their genetic relatives[Hamilton (1963, 1964)]. However, to use this approach in practise, we must address two specific issues. Firstly, we must enable the players to recognise one another and we do so by using a pre-agreed sequence of moves that each player makes at the start of each IPD interaction. Secondly, since our goal is to ensure that one member of the team wins the tournament, we explicitly identify one team member as the team leader, and have the other team members favour this individual. We describe these steps, in more detail, in the next section. 9.3. Team Players Thus, as described in the previous section, we initially implement a team of players who recognise one another through the initial sequence of moves they make at the start of each IPD interaction. To this end, each team player uses a fixed length binary code word to describe this initial sequence of moves. Specifically, we denote 0 as defect and 1 as cooperate, and the binary code word indicates the fixed sequence of moves that the player should make, regardless of the actions of the opponent. This binary code word is known to all members of the team, and by comparing the moves
A. Rogers et al.
210
Team Member
-
CCCCCCCC
recognise team member
-
DDDDDDDD
otherwise.
‘team member code’
Fig. 9.1.
Diagram showing the sequence of actions played by each of the team members.
of their opponents against this code word, players within the team can recognise if they are playing against another member of the team or against an unknown opponentb . Now, whenever a team member meets another team member within the IPD tournament, they can recognise one another and then cooperate with one another unconditionally. In addition, the team members can recognise when they are playing against a competing player and then defect continually (see figure 9.1). In this way, since the team players no longer have to reciprocate any mis-executed moves in order to maintain cooperation, they achieve close to the maximum possible score whenever they play against other team members. In addition, since they defect against competing players, they are also immune to exploitation from these players. Thus given a sufficient number of team members within the IPD tournament, the team players perform well, compared to reciprocity based strategies. However, our goal is to form a team that maximises the probability that one of its members will be the most successful player within the IPD tournament. Thus, we can improve the performance of the team by identifying one of the team members as the team leader, and allowing the other ordinary team members to act preferentially towards this team leader. Thus, when the ordinary team members encounter the team leader, they continually cooperate, whilst allowing the team leader to exploit them by continually defecting. In this way, whilst competing players derive the minimum possible score in interactions with the ordinary team members, the team leader derives the maximum possible score in these same interactions. Hence, by allowing the team leader to exploit them, the ordinary team members sacrifice their own chance of winning the tournament, but by changing the tournament environment, they are able to increase the chance that the team leader will winc . b Note
that this recognition will not be perfectly reliable; the code word may be corrupted by noise or competing players may accidentally make a sequence of moves that matches the team code word. These are effects that we explicitly consider in section 6. c Thus the team that we implement is similar to the ‘master’ and ‘slave’ approach sug-
Error-Correcting Codes for Team Coordination
Team Leader ‘team leader code’
-
DDDDDDDD
recognise team member
-
CCCCCCCC
recognise team leader
-
Team Member ‘team member code’
Fig. 9.2.
CC
– TFT –
211
otherwise.
-
CCCCCCCC
recognise team member
-
CCCCCCCC
recognise team leader
-
DDDDDDDD
otherwise.
Diagram showing the sequence of actions played by each of the team players.
The case above describes the instances in which the team leader encounters another team member. However, when the team leader encounters any other competing players it should adopt some default strategy. Clearly, using the best performing strategy available will increase the chances of the team leader winning the tournament. However, since our purpose here is to demonstrate the factors that influence the effectiveness of the team, rather than to optimise a single example case, in the investigations that we present here, we use tit-for-tat as this default strategy. As such, titfor-tat is well understood, and whilst it does not exploit other strategies as effectively as the more recently developed alternatives discussed in the previous section, it is immune to being exploited itself. Thus in the case that the team leader does not recognise another team player, it cooperates on the next two moves in an attempt to reestablish cooperation and then continues by playing tit-for-tat for the rest of the interaction. Finally, since the rules of the IPD tournament mean that each player must play against a copy of themselves, we also enable the team leader to recognise and cooperate with a copy of itself. Thus, the actions of both the ordinary team members and the team leader are shown schematically in figure 9.2. Note, that it is not strictly necessary to implement two different codes (i.e. one for the team leader and one for ordinary team members), however, we do so to reduce the chances of a competing player exploiting the ordinary team members (see section 9.7 for a more detailed discussion). gested by Delahaye and Mathieu[Delahaye and Mathieu (1993)]. However, unlike this example, where the slaves were simple strategies that could potentially be exploited by any member of the population, all of our team players explicitly recognise one another and condition their actions on this recognition.
212
A. Rogers et al.
9.4. Experimental Results Now, given the team players described in the previous section, two immediate questions are posed: (i) how does the number of team players within the population effect the probability that the team leader does in fact win the tournament? and (ii) how does the length of the code word (i.e. the length of the initial sequence of moves that the team players use to signal to one another) affect the performance of the team leader? In order to address these questions and to test the effectiveness of the team, we implement an IPD tournament (with and without noise) using a representative population of competing players. To ensure consistency between different comparisons within the literature, we adopt the same test population ´ as previous researchers[Beaufils et al. (1997); ORiodan (2000); Tzafestas (2000)] and thus the population consists of eighteen players implementing the base strategies used in the original Axelrod competition (e.g. All C, All D, Random and Negative), simple strategies that play periodic moves (e.g. periodic CD, CCD and DDC) and state-of-the-art strategies that have been shown to outperform these simple strategies (e.g. Adaptive, Forgiving and Gradual). A full list and description of the strategies adopted by these players is provided in Appendix A. We first run this tournament, using this fixed competing population, whilst varying the number of team players within the population, from 2 to 5 (i.e. one team leader and 1 to 4 ordinary team members), and varying the length of code word, L, from 1 to 16 bits. To ensure representative results, we also average over all possible code words, and in total, we run the tournament 1000 times and average the results. Since our aim is to show the benefit that the team has yielded, compared to the the default strategy of the team leader (in this case tit-for-tat), we divide the total score of the team leader by the total score of the player adopting the simple titfor-tat strategy. Thus, we calculate hScoreLeader i / hScoreT F T i and note that the greater this value, the better the performance of the team. The results of these experiments are shown in figure 9.3 for the noise free IPD tournament and in figure 9.5 for the noisy IPD tournament. In these figures, the experimental results are plotted with error bars, along with a continuous best fit curve (see section 9.5 for a discussion of the calculation of this line). Now, in order to investigate the effect of larger population sizes, we also run experiments where we fix the number of team players within the population to be five (again composed of one team leader and four ordinary team members), but then generate competing populations of differ-
Error-Correcting Codes for Team Coordination
213
<ScoreLeader > / <ScoreTFT > 1.3
1.2
2 team players 3 team players 4 team players 5 team players
1.1
1
0.9
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.3. Experimental results showing the benefit of the team in a noise free IPD tournament. Results show code word lengths from 1 to 16 bits where the total population consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members) and 18 competing players. Results are averaged over 1000 tournament runs.
<ScoreLeader > / <ScoreTFT > 1.7
1.5
6 competing players 12 competing players 18 competing players 24 competing players 30 competing players
1.3
1.1
0.9
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.4. Experimental results showing the benefit of the team in a noise free IPD tournament. Results show code word lengths from 1 to 16 bits where the total population consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6, 12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.
A. Rogers et al.
214
<ScoreLeader > / <ScoreTFT > 1.15
2 team players 3 team players 4 team players 5 team players
1.1
1.05
1
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.5. Experimental results showing the benefit of the team in a noisy IPD tournament. Results show code word lengths from 1 to 16 bits where the total population consists of 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members) and 18 competing players. Results are averaged over 1000 tournament runs.
<ScoreLeader > / <ScoreTFT > 1.4
1.3
6 competing players 12 competing players 18 competing players 24 competing players 30 competing players
1.2
1.1
1
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.6. Experimental results showing the benefit of the team in a noisy IPD tournament. Results show code word lengths from 1 to 16 bits where the total population consists of 5 team players (i.e. one team leader and 4 ordinary team members) and 6, 12, 18, 24 and 30 competing players. Results are averaged over 10000 tournament runs.
Error-Correcting Codes for Team Coordination
215
ent sizes by randomly selecting players from our pool of 18 base strategies (always ensuring that we have at least one player using the tit-for-tat strategy). We run the tournament 10000 (more than before as we must also average over the stochastic competing population) and again calculate hScoreLeader i / hScoreT F T i. Figure 9.4 shows these results for the noise free IPD tournament and figure 9.6 show results for the noisy IPD tournament The results clearly indicate that, as expected, increasing the number of team players, or more exactly, increasing the percentage of the population represented by the team, improves the performance of the team (i.e. increases hScoreLeader i / hScoreT F T i). In addition, in both the noise free and noisy IPD tournaments there is clearly an optimum code word length whereby the benefit of the team decreases when the code word length is longer or shorter than this optimum. Most significantly, this optimum code word length is clearly independent of both the size of the team and the population. In addition, in the case of the noisy IPD tournament, the results are very sensitive to this optimum code word length and, overall, the benefit of the team is much less than that achieved in the noise free IPD tournament. In the next section, we analyse these results and propose error correcting codes to improve performance in the noisy IPD tournament. 9.5. Analysis The optimum code word lengths observed in the previous experimental results are the result of a number of opposing factors. If we initially consider the noise free IPD tournament, we can identify two such factors. The first represents the cost of the signalling between team players. As the length of the code word is increased, the team players have less available remaining moves in which to manipulate the outcome of the tournament and, thus, this factor favours shorter code word lengths. However, for this signalling to be effective, the team players must be able to distinguish between competing players and other team players. If the code word becomes too short, it becomes increasingly likely that a competing player will through pure chance make the sequence of moves that correspond to either of the code words of the team players. Thus the second factor represents the effectiveness of the signalling. It has the opposite effect of the first and thus favours longer code word lengths. The balance of these two opposing factors give rise to the behaviour seen in figures 9.3 and 9.4 where we observe an optimum code length near seven bits; at greater lengths we observe an approximately linear decrease in performance, whilst at shorter lengths, we
A. Rogers et al.
216
Probability of Discimination (Pd ) 1
0.8
0.6
0.4
0.2
0
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.7. Experimental and theoretical results showing the probability of a team player successfully discriminating between another team player and a competing player in an IPD tournament.
observe a more rapid decrease in performance. When noise is added to the IPD tournament, a third factor, which also affects the effectiveness of the signalling, becomes apparent. In order for the team players to recognise one another, the sequence of moves made by each player must be correctly executed. In the noisy IPD tournament, there is a small probability that one or more of the moves that constitute these code words will be mis-executed and, in this case, the team players will fail to recognise one another. The effect of this additional factor is clearly seen in a comparison of figures 9.3 and 9.4 and figures 9.5 and 9.6. In the noisy IPD tournament the optimum code word length is significantly shorter than the noise free case and there is a very rapid non-linear decrease in performance at code word lengths greater than this optimum. This final factor is very significant, and thus in the noisy IPD tournament, the team yields much less benefit than that in the noise free IPD tournament. Now, the two factors that describe the effectiveness of the signalling can usefully be expressed as two probabilities. These are the probability that a team player will successfully discriminate a competing player from another team player, Pd , and the probability that two team players will successfully recognise one another, Pr . We can directly measure these probabilities from the experimental results presented in the last section, and then compare
Error-Correcting Codes for Team Coordination
217
them to theoretical predictions. Thus, to calculate the probability of successful discrimination, Pd , we consider that out of the 2L possible code words, one is required for the team leader code and one for the team member code. Thus, when we consider the average over all possible code words, this probability is given by: 2 (9.1) 2L In the case of the probability of successful recognition, Pr , we require that both code word sequences are played with no mis-executed moves. If the probability of mis-executing a move is γ (in our case γ = 1/10), then this probability is simply given by: Pd = 1 −
Pr = (1 − γ)2L
(9.2)
Figures 9.7 and 9.8 show a comparison of these analytical results against the probabilities measured from the experimental results presented in the last section. Clearly the theoretical predictions match the experimental data extremely welld and these results indicate that the benefit of the team is strongly dependent on the effectiveness of the signalling between the team members. Most surprising, is that in the case of the noisy IPD tournament, with anything but the very shortest code word lengths, the chances of two team players successfully recognising one another is extremely small. At first sight, this result suggests that the use of teams is unlikely to be very effective in noisy environments. However, the problem that we face here (i.e. how to reliably recognise code words in the presence of mis-executed moves), is exactly analogous to that studied in information theory of communicating reliably over a noisy channel. As such, we can use the results of this field (specifically error correcting codes), to increase the probability that the team members successfully recognise one another, and thus, in turn, increase the benefit that the team will yield. 9.6. Error Correcting Codes The problem of communicating reliably over a noisy channel, or in our case, reliably recognising code words when moves of the IPD game are subject to d Further
confirmation of this analysis is provided by the observation that the best-fit lines shown in figures 9.3 to 9.6, are calculated by postulating that the shape of the line is given by y = A + Bx + 2Cx + D(1 − γ)2x . The coefficients A, B, C and D are then found via regression so as to minimise the sum of the squared error between observed and calculated results. In the case of the noise free IPD tournament, the value of D is fixed at zero.
A. Rogers et al.
218
Probability of Recognition (Pr ) 1
0.8
0.6
0.4
0.2
0
2
4
6
8
10
12
14
16
Code Word Length (L) Fig. 9.8. Experimental and theoretical results showing the probability of two team players successfully recognising one another in a noisy IPD tournament.
mis-execution, is fundamental to the field of information theory[Shannon (1948)]. One of the most widely used results of this work is the concept of error correcting codes; codes that allow random transmission errors to be detected and corrected[MacKay (2003); Peterson and Weldon (1972)]. Such codes typically take a binary code word of length Lc and encode it into a longer binary message of length Lm (i.e. Lm > Lc ). Should any errors occur in the transmission of this message (e.g. a 1 transmitted by the sender is interpreted as a 0 by the receiver), the decoding procedure and the redundancy that has been incorporated into the longer message, mean that these errors can be corrected and the original code word retrieved. Different coding algorithms are distinguished by the length of the initial code word, the degree of redundancy added to the message and by the number of errors that they can correct. Thus, in our application, all the team members must implement the same coding algorithm, but now, rather than using the code word directly to describe their initial sequence of moves, they use the longer encoded message. Likewise, they observe the moves of their opponent and then compare the results of the decoding algorithm to their reference code words. The improvement that such error-correcting codes can achieve is significant but we have several requirements when selecting an appropriate coding
Error-Correcting Codes for Team Coordination
219
algorithm. The coding algorithm should increase the effectiveness of the signalling, by increasing the probability that the team members can successfully discriminate between team members and other competing players (i.e. increase Pd ) and by increasing the probability that the team members recognise one another successfully (i.e. increase Pr ). However, it should not increase the cost of the signalling such that this increase in effectiveness is lost. The need to limit the increase in the cost of signalling, and thus limit the length of the encoded message, Lm , is the key factor in restricting our choice of coding algorithm. As shown in figures 9.3 and 9.4, even with the perfect recognition that is achieved in the noise free case, the performance of the team begins to degrade when Lm > 7, and whilst many coding algorithms exist, the vast majority generate message lengths far in excess of this value[Peterson and Weldon (1972)]. Thus, our choice of coding algorithm is limited to the three presented below: (1) A single block Hamming code that takes a 4 bit code word and generates a seven bit message that can be corrected for a single error. (2) A two block Hamming code that simply concatenates two four bit words and thus produces a fourteen bit message that can be corrected for a single error in each 7 bit block. (3) A [15,5] Bose-Chaudhuri-Hochquenghem (BCH) code that encodes a five bit code word into a fifteen bit message, but is capable of correcting up to three errors. Now, in each case, the probability of successfully discriminating between team players and competing players is still determined by the initial code word length (i.e. the decoding algorithm maps the 2Lm possible encoded messages onto 2Lc possible code words), and thus, as before, is given by: Pd = 1 −
2 2 Lc
(9.3)
However, the probability that the team players successfully recognise one another is determined by the message length and by the error correcting ability of the code. Thus, for the Hamming code with n blocks, this probability is given by the probability that less than two error occurs in each seven bit encoded message: Pr =
"
1 X k
7
k=0
γ (1 − γ) k
7−k
#2n
(9.4)
A. Rogers et al.
220
Table 9.2. Calculated results for the probability of discrimination, Pd , and the probability of recognition, Pr , for three different error correcting codes considered.
Lc Lm Pd Pr
– – – –
Code Word Length Message length Probability of Discrimination Probability of Recognition
Direct L=3 3 3 0.750 0.531
Hamming 1 block 2 blocks 4 8 7 14 0.875 0.992 0.723 0.527
BCH [15,5] 5 15 0.937 0.892
For the [15,5] BCH code, the probability of recognition is given by considering that the code word can be correctly decoded if less than four errors occur in the fifteen bit encoded message, and thus: " 3 #2 X k k 15−k Pr = γ (1 − γ) (9.5) 15 k=0
These calculated values are shown in table 9.2 for the three coding algorithms considered, along with the original case results in which the direct code words are used (we use the value of L = 3 which was shown to be optimal for the noisy IPD tournament presented in section 9.4). Note, that all of the coding algorithms result in improvements in Pd since they all implement a code word of length greater than three. However, only the single block Hamming code and the [15,5] BCH code improve upon Pr . In the case of the two block Hamming code, the error correcting ability is not sufficient to overcome the long message length that results. Of the three algorithms, the [15,5] BCH code is superior; it creates the longest message length, yet its error correcting ability is such that it also displays the best probability of recognition. This result is confirmed by implementing the different coding algorithms within the team players and repeating the experimental noisy IPD tournament, with a fixed competing population, described in section 9.4. As before, to ensure representative results, we run the tournament 1000 times and average over all possible choices of code words. Table 9.3 shows the results of this comparison when 2 to 5 team players (i.e. one team leader and 1 to 4 ordinary team members) are included within the population. As expected, the [15,5] BCH code outperforms the others and, in the case where there are five team members, the performance of the [15,5] BCH algorithm is very close to the best achieved in the noise free IPD tournament presented in figure 9.3.
Error-Correcting Codes for Team Coordination
221
Table 9.3. Experimental results for hScore Leader i / hScore T F T i for the three different error correcting codes considered here. Tournaments are averaged over 1000 runs and the standard error of the mean is ±0.002.
Number of Team Players
2 3 4 5
Direct L=3 1.043 1.079 1.112 1.141
Hamming 1 block 2 blocks 1.055 1.044 1.101 1.083 1.145 1.121 1.184 1.159
BCH [15,5] 1.062 1.120 1.173 1.221
Finally, we present results from implementing this [15,5] BCH code in the noisy IPD tournament, again with a fixed competing population. In table 9.4 we show the total scores achieved by each player when the number of team players increases from 2 to 5. To enable comparison with other populations, we normalise these scores and divide the total score achieved by each player, by the size of the population and by the number of rounds in each IPD game (in this case 200). Thus, the values shown are the ranked average pay-off received by the player in each round of the Prisoner’s Dilemma game. Within this table, the competing players are denoted by the mnemonic given in Appendix A, the team leader is denoted by LEAD and the ordinary team members by MEMB . Clearly, as more team members are added to the population, they are increasingly able to change the environment in which the team leader must interact and thus they are able to influence the outcome of the tournament in favour of the team leader. In three out of the four cases, the team leader is in fact the winner of the tournament, despite the fact that this player is based upon the tit-for-tat strategy that performs relatively poorly against this population (see the results shown in Appendix A). In addition, these results also clearly show that the mutual cooperation of the other team members, also leads them to perform well. Indeed, when the team consists of five (or more) such team members, all five occupy the top positions. In table 9.5, rather than showing the averaged scores of the tournament players, we present the probability that one of the team players actually wins the overall noisy IPD tournament. In addition to the previous results where the probability that a move was mis-executed was 1/10, we present a range of values from 0 to 1/5. The results indicate that whilst we have assumed a noise level of 1/10 throughout the analysis, our results are not particularly sensitive to this value. Indeed, the more significant factor is
A. Rogers et al.
222
Table 9.4. Experimental results showing the results of the noisy IPD tournament when the team players implement a [15,5] BCH coding algorithm and there are increasing numbers of team players (a). . .(d). The tournaments are averaged over 1000 runs and the standard error of the mean is ±0.002.
(a)
(b)
(c)
(d)
Player GRAD
Score 2.347
Player
Score
Player
Score
Player
Score
LEAD
2.427
LEAD
2.344
GRAD
2.298
ADAP SMAJ GRIM ALLD
2.263 2.256 2.239 2.219
MEMB MEMB
2.246 2.246
LEAD MEMB MEMB MEMB
2.503 2.273 2.272 2.271
2.219
TFT TFTT FORG GTFT PCD PCCD STFT HMAJ RAND PAVL PDDC NEG ALLC
2.207 2.175 2.171 2.160 2.138 2.136 2.124 2.109 2.101 2.099 2.072 2.049 1.996
2.228 2.221 2.221 2.192 2.168 2.135 2.126 2.114 2.091 2.090 2.084 2.078 2.058 2.047 2.033 1.991 1.934
GRAD ADAP SMAJ GRIM ALLD TFT TFTT FORG GTFT STFT HMAJ PCD PCCD RAND PDDC PAVL NEG ALLC
2.256 2.191 2.186 2.181 2.161 2.133 2.099 2.086 2.068 2.061 2.054 2.047 2.027 2.013 2.005 2.004 1.938 1.877
2.568 2.296 2.294 2.294 2.292
MEMB
ADAP SMAJ GRIM ALLD TFT TFTT FORG GTFT PCD STFT HMAJ PCCD RAND PAVL PDDC NEG ALLC
LEAD MEMB MEMB MEMB MEMB
GRAD ADAP SMAJ GRIM ALLD TFT TFTT FORG STFT GTFT HMAJ PCD PCCD RAND PDDC PAVL NEG ALLC
2.218 2.164 2.157 2.156 2.136 2.103 2.062 2.054 2.036 2.031 2.030 1.999 1.982 1.969 1.969 1.966 1.886 1.820
the loss of performance of the competing players as the noise level increases. The table shows that with just two team members and no noise, a team player will win the tournament just 3.4% of the time. However, as the noise level increases, the performance of the other players within the tournament degrades at a faster rate than that at which the effectiveness of the signalling between team members diminishes. At a noise level of 1/5 the same team members win 70.2% of the time. Indeed with 3 or 4 team members, the results are independent of the noise level within this range.
Error-Correcting Codes for Team Coordination
223
Table 9.5. Experimental results showing the probability that one of the team members wins the noisy IPD tournament. Results are for different numbers of team members and a range of noise levels. Results are averaged over 1000 tournament runs and the standard error of the mean for each result is ±0.5.
Number of Team Players
2 3 4 5
0.00 2.8 % 3.4 % 97.6 % 97.4 %
Noise Level (γ) 0.05 0.10 0.15 10.6 % 22.4 % 30.0 % 81.0 % 80.4 % 81.6 % 99.0 % 96.4 % 96.6 % 96.6 % 97.2 % 96.6 %
0.20 32.6 % 70.2 % 97.2 % 96.8 %
9.7. Competition Entry The results of the previous sections clearly indicate that there is an advantage to be gained by entering a team of players into the noisy IPD tournament. However, when using these results to actually design the players for the IPD competition entries, a number of additional factors must be considered. Firstly, in our experimental investigations we have averaged over all possible code words to produce representative results. However, for the competition entry we must actually select two code words: one for the team members and one for the team leader. Whilst the probability of recoginising a team player is independent of the choice of code word (this is a property of the codes that are implemented), the probability of succesfully discriminating between team and competing players is not. Clearly, code words that are close (in Hamming distance) to the initial moves of competing players are more likely to be corrupted by noise and thus falsely recognised. Thus we must select code words that are most unlike the moves that we expect to observe from competing players. Actually making this choice is complicated by the fact that we do not know the strategies that the competing players will use, and the moves that they make will themselves depend on the actual code words that the team players use. Thus, we again use our test population of eighteen default strategies, and by exhaustive test, we select two code words which most often lead to the correct recognition of team players and the correct discrimination of competing players. Secondly, throughout these investigations, we have not considered the possibility of another competing player learning the code words of the team members and then attempting to exploit them. Within our competition entries, we greatly reduce the possibility of this occurring by having each team player monitor the behaviour of their opponent, in order to check that they
224
A. Rogers et al.
behave as expected. Thus, if an ordinary team member recognises their opponent to be another ordinary team member, they check that the opponent does in fact cooperate in the subsequent rounds of the game. Should the opponent attempt to defect (with some allowance for the possibility of mis-executed moves), it is assumed that the opponent has been falsely recognised and thus the team member begins to defect to avoid the possibility of being exploited. Given this additional checking, the only possibility of exploitation is that a competing player learns the code word of the team leader, and thus tricks the ordinary team members into allowing themselves to be exploited. However, in the IPD tournament, this is extremely unlikely to occur. The players within the tournament only interact with each other once, thus, whilst a competing player may encounter several ordinary team members, there is little possibility of them learning the code word of the team leader in this single interaction. This is the reason for implementing separate team member and team leader code words. Finally, we must decide how many team members to submit into the competition. Clearly, our results indicate that the larger the number of players, the better the performance of the team leader. However, typically, this number is limited by the rules of the competition (e.g. the rules of the second IPD tournament capped this number at 20), and thus, we should submit the maximum allowable number of players. Thus, the teams that we entered into the two recent IPD competitions held at the 2004 IEEE Congress on Evolutionary Computing (CEC’04) and the 2005 IEEE Symposium on Computational Intelligence and Games (CIG’05), followed these guidelines and were successful. In the first competition, we entered several teams, that used the single block Hamming code, and a range of default strategies for the team leader. Whilst a few other researchers entered teams of players, the policy was not widely adopted and the team leader from the largest team won with a clear advantage. In the second round of competitions we entered a single team using the more complex [15,5] BCH coding scheme, and, as in our investigations here, we used tit-for-tat as the default strategy of the team leader. In this competition, separate noise free and noisy IPD tournaments were held, and these tournaments were more competitive, as given the results of the first competition, many more researchers adopted the policy of submitting a team of players. Within the noise free IPD tournament, three of the top four positions were occupied by representatives of different teams. However, within the noisy IPD tournament, our team leader again won with a clear advantage, despite using the tit-for-tat as a default strategy. The other
Error-Correcting Codes for Team Coordination
225
teams entered into this tournament performed poorly compared to the noise free IPD tournament. Thus, these results clearly illustrate the advantage that the use of error-correcting codes has yielded by enabling our team players to recognise one another in the noisy environment. 9.8. Conclusions In this chapter, we presented our investigations into the use of a team of players within an Iterated Prisoner’s Dilemma tournament. We have shown that if the team players are capable of recognising one another, they can condition their actions to increase the probability that one of their members wins the tournament. Since, outside means of communication are not available to these players, we have shown that they are able to make use of a covert channel (specifically, a pre-agreed sequence of moves that they make at the start of each interaction) to signal to one another and thus perform this recognition. By carefully considering both the cost and effectiveness of the signalling, we have shown that we can use error correcting codes to optimise the performance of the team and that this coding allows the teams to be extremely effective in the noisy IPD tournament; a noisy environment which initially appears to preclude their use. Our future work in this area concerns the use of these team players in an evolutionary model of the IPD tournament. That is, rather than the static IPD tournament presented here (where the population of competing players is fixed), we consider a model where the population of competing players evolves over time (i.e. the survival of any individual within the population is dependent on their performance within an IPD tournament held at each generation). Here we are particularly interested in searching for evolutionary stable strategies (ESS), and thus are interested whether an explicit team leader is required (or indeed, can even be implemented) and how team players may attempt to exploit other team players to their own advantage. As such, this work attempts to compare the roles of kin selection and reciprocity for maintaining cooperation in noisy environments. A.1. Test Population The test population consists of eighteen players implementing the base strategies used in the original Axelrod competition (e.g. All C, All D, Random and Negative) plus simple strategies that play periodic moves (e.g. periodic CD, CCD and DDC) and state-of-the-art strategies that have been
226
A. Rogers et al.
shown to outperform these simple strategies (e.g. Adaptive, Forgiving and Gradual). A full list and description of the strategies adopted by these players is shown in table A.1, and table A.2 shows the results of running noise free and noisy IPD tournaments using just these players. To ensure repeatable results, we run the tournament 1000 times and present the average results. To allow easy comparison with other publications, we normalise the scores and thus divide them by the size of the population and the number of rounds in each IPD game (in this case 200). Thus, the values shown are the ranked average pay-off received by the player in each round of the Prisoner’s Dilemma game. Note, that in this population, tit-for-tat performs relatively poorly and is easily beaten by a number of strategies. In addition, in general the scores in the noisy IPD tournament are less than those in the noise free tournament, since it is far harder to ensure mutual cooperation in the presence of accidental defections.
Error-Correcting Codes for Team Coordination
227
Table A.1. Description of the strategies adopted by the competing players in the test population. Strategy Adaptive
Name ADAP
All C All D Forgiving
ALLC ALLD FORG
Gradual
GRAD
Grim
GRIM
Generous Tit-For-Tat
GTFT
Hard Majority
HMAJ
Negative
NEG
Pavlov
PAVL
Periodic CD Periodic CCD
PCD PCCD
Periodic DDC Random Suspicious Tit-For-Tat Soft Majority
PDDC RAND STFT SMAJ
Tit-For-Tat
TFT
Tit-For-Two-Tats
TFTT
Description Uses a continuously updated estimate of the opponent player’s propensity to defect to condition future actions[Tzafestas (2000)]. Cooperates continually. Defects continually. Modified tit-for-tat strategy that attempts to reestablish mutual cooperation after ´ a sequence of mutual defections[ORiodan (2000)]. Modified tit-for-tat strategy that use progressively longer sequences of defections in retaliation[Beaufils et al. (1997)]. Cooperates until a strategy defects against it. From that point on defects continually. Like tit-for-tat but cooperates 1/3 of the times that tit-for-tat would defect[Axelrod and Wu (1995)]. Plays the majority move of the opponent. On the first move, or when there is a tie, it cooperates. Plays the negative of the opponents last move. Plays win-stay, lose-shift[Nowak and Sigmund (1993)]. Plays ‘cooperate, defect’ periodically. Plays ‘cooperate, cooperate, defect’ periodically. Plays ‘defect, defect, cooperate’ periodically. Cooperates and defects at random. Identical to tit-for-tat but starts by defecting. Plays the majority move of the opponent. On the first move, or when there is a tie, it defects. Starts by cooperating and then plays the last move of the opponent. Like tit-for-tat but only defects after two consecutive defections against it.
A. Rogers et al.
228
Table A.2. Reference performance of the test population in the (a) noise free and (b) noisy IPD tournament. Results are averaged over 1000 repeated tournaments and the standard error of the mean is ±0.002. (a) Strategy ADAP GRAD GRIM TFT FORG GTFT SMAJ TFTT PAVL ALLC PCD HMAJ STFT PCCD ALLD RAND NEG PDDC
(b) Score 2.888 2.860 2.773 2.647 2.627 2.591 2.575 2.544 2.390 2.332 2.279 2.277 2.233 2.190 2.175 2.114 2.111 2.081
Strategy GRAD ADAP GRIM SMAJ ALLD TFT FORG TFTT GTFT PCCD PCD STFT RAND PAVL HMAJ NEG PDDC ALLC
Score 2.410 2.329 2.297 2.292 2.278 2.245 2.211 2.204 2.198 2.185 2.179 2.155 2.143 2.140 2.134 2.112 2.110 2.043
References Axelrod, R. (1984). The Evolution of Cooperation (Basic Books). Axelrod, R. (1997). The Complexity of Cooperation (Princeton University Press). Axelrod, R. and Hamilton, W. D. (1981). The evolution of cooperation, Science 211, pp. 1390–1396. Axelrod, R. and Wu, J. (1995). How to cope with noise in the iterated prisoner’s dilemma, Journal of Conflict Resolution 39, 1, pp. 183–189. Beaufils, B., Delahaye, J. P. and Mathieu, P. (1997). Our meeting with gradual: A good strategy for the iterated prisoner’s dilemma, in Proceedings of the Fifth International Workshop on the Synthesis and Simulation of Living Systems (MIT Press), pp. 202–212. Delahaye, J. P. and Mathieu, P. (1993). L’altruisme perfectionn´e, Pour la Science 187, pp. 102–107. Hamilton, W. D. (1963). The evolution of altruistic behaviour, Am. Nat. 97, pp. 354–356. Hamilton, W. D. (1964). The genetical evolution of social behaviour, J. Theor. Biol. 7, pp. 1–16. MacKay, D. J. C. (2003). Information Theory, Inference and Learning Algorithms (Cambridge University Press).
Error-Correcting Codes for Team Coordination
229
Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58. ´ ORiodan, C. (2000). A forgiving strategy for the iterated prisoner’s dilemma, Journal of Artificial Societies and Social Simulation 3, 4, pp. 56–58. Peterson, W. W. and Weldon, E. J. (1972). Error-Correcting Codes (MIT Press). Shannon, C. E. (1948). A mathematical theory of communication, The Bell System Technical Journal 27, pp. 379–423, 623–656. Trivers, R. (1971). The evolution of reciprocal altruism, Quarterly Review of Biology 46, pp. 35–57. Tzafestas, E. S. (2000). Toward adaptive cooperative behavior, in Proceedings of the Sixth International Conference on the Simulation of Adaptive Behavior (SAB-2000), Vol. 2, pp. 334–340.
This page intentionally left blank
Chapter 10 Is it Accidental or Intentional? A Symbolic Approach to the Noisy Iterated Prisoner’s Dilemma
Tsz-Chiu Au, Dana Nau University of Maryland
10.1. Introduction The Iterated Prisoner’s Dilemma (IPD) has become well known as an abstract model of a class of multi-agent environments in which agents accumulate payoffs that depend on how successful they are in their repeated interactions with other agents. An important variant of the IPD is the Noisy IPD, in which there is a small probability, called the noise level, that accidents will occur. In other words, the noise level is the probability of executing “cooperate” when “defect” was the intended move, or vice versa. Accidents can cause difficulty in cooperations with others in real-life situations, and the same is true in the Noisy IPD. Strategies that do quite well in the ordinary (non-noisy) IPD may do quite badly in the Noisy IPD [Axelrod and Dion (1988); Bendor (1987); Bendor et al. (1991); Molander (1985); Mueller (1987); Nowak and Sigmund (1990)]. For example, if two players both use the well-known Tit-For-Tat (TFT) strategy, then an accidental defection may cause a long series of defections by both players as each of them punishes the other for defecting. This chapter reports on a strategy called the Derived Belief Strategy (DBS), which was the best-performing non-master-slave strategy in Category 2 (noisy environments) of the 2005 Iterated Prisoner’s Dilemma competition (see Table 10.1). Like most opponent-modeling techniques, DBS attempts to learn a model of the other player’s strategy (i.e., the opponent model∗ ) during the ∗ The
term “opponent model” appears to be the most common term for a model of the other player, even though this player is not necessarily an “opponent” (since the IPD is not zero-sum).
231
232
T-C. Au and D. Nau
Table 10.1. Scores of the best programs in Competition 2 (IPD with Noise). The table shows each program’s average score for each run and its overall average over all five runs. The competition included 165 programs, but we have listed only the top 25. Score Rank Program Author Run1 Run2 Run3 Run4 Run5 Avg. 1 BWIN P. Vytelingum 441.7 431.7 427.1 434.8 433.5 433.8 2 IMM01 J.W. Li 424.7 414.6 414.7 409.1 407.5 414.1 3 DBSz T.C. Au 411.7 405.0 406.5 407.7 409.2 408.0 4 DBSy T.C. Au 411.9 407.5 407.9 407.0 405.5 408.0 5 DBSpl T.C. Au 409.5 403.8 411.4 403.9 409.1 407.5 6 DBSx T.C. Au 401.9 410.5 407.7 408.4 404.4 406.6 7 DBSf T.C. Au 399.2 402.2 405.2 398.9 404.4 402.0 8 DBStft T.C. Au 398.4 394.3 402.1 406.7 407.3 401.8 9 DBSd T.C. Au 406.0 396.0 399.1 401.8 401.5 400.9 10 lowESM. Filzmoser 391.6 395.8 405.9 393.2 399.4 397.2 TFT classic 11 TFTIm T.C. Au 399.0 398.8 395.0 396.7 395.3 397.0 12 Mod P. Hingston 394.8 394.2 407.8 394.1 393.7 396.9 13 TFTIz T.C. Au 397.7 396.1 390.7 392.1 400.6 395.5 14 TFTIc T.C. Au 400.1 401.0 389.5 388.9 389.2 393.7 15 DBSe T.C. Au 396.9 386.8 396.7 394.5 393.7 393.7 16 TTFT L. Clement 389.1 395.8 394.1 393.4 394.7 393.4 17 TFTIa T.C. Au 389.5 394.4 395.1 389.6 397.7 393.3 18 TFTIb T.C. Au 391.7 390.0 390.5 401.0 392.4 393.1 19 TFTIx T.C. Au 398.3 391.3 390.8 391.0 393.7 393.0 20 mediumES- M. Filzmoser 396.7 392.6 398.3 390.8 386.0 392.9 TFT classic 21 TFTIy T.C. Au 391.7 394.6 390.8 392.1 394.9 392.8 22 TFTId T.C. Au 395.6 393.1 388.8 385.7 391.3 390.9 23 TFTIe T.C. Au 396.7 391.1 385.2 388.2 393.5 390.9 24 DBSb T.C. Au 393.2 386.1 392.6 391.1 391.0 390.8 25 T4T D. Fogel 391.5 387.6 400.4 387.3 383.5 390.0
games. Our main innovation involves how to reason about noise using the opponent model. The key idea used in DBS is something that we call symbolic noise detection—the use of the other player’s deterministic behavior to tell whether an action has been affected by noise. More precisely, DBS builds a symbolic model of how the other player behaves, and watches for any deviation from this model. If the other player’s next move is inconsistent with its past behavior, this inconsistency can be due either to noise or to a genuine change in its behavior; and DBS can often distinguish between these two cases by waiting to see whether this inconsistency persists in the
Is it Accidental or Intentional?
233
next few iterations of the game.† Of the nine different version of DBS that we entered into the competition, all of them placed in the top 25, and seven of them placed among top ten (see Table 10.1). Our best version, DBSz, placed third; and the two players that placed higher were both masters of master-and-slave teams. DBS operates in a distinctly different way from the master-and-slaves strategy used by several other entrants in the competition. Each participant in the competition was allowed to submit up to 20 programs as contestants. Some participants took advantage of this to submit collections of programs that worked together in a conspiracy in which 19 of their 20 programs (the “slaves”) worked to give as many points as possible to the 20th program (the “master”). DBS does not use a master-and-slaves strategy, nor does it conspire with other programs in any other way. Nonetheless, DBS remained competitive with the master-and-slaves strategies in the competition, and performed much better than the master-and-slaves strategies if the score of each master is averaged with the scores of its slaves. Furthermore, a more extensive analysis [Au and Nau (2005)] shows that if each master-and-slaves team had been limited to 10 programs or less, DBS would have placed first in the competition. 10.2. Motivation and Approach The techniques used in DBS are motivated by a British army officer’s story that was quoted in (Axelrod, 1997, page 40): I was having tea with A Company when we heard a lot of shouting and went out to investigate. We found our men and the Germans standing on their respective parapets. Suddenly a salvo arrived but did no damage. Naturally both sides got down and our men started swearing at the Germans, when all at once a brave German got onto his parapet and shouted out: “We are very sorry about that; we hope no one was hurt. It is not our fault. It is that damned Prussian artillery.” (Rutter 1934, 29)
Such an apology was an effective way of resolving the conflict and preventing a retaliation because it told the British that the salvo was not the intention of the German infantry, but instead was an unfortunate accident that the German infantry did not expect nor desire. The reason why the apology was convincing was because it was consistent with the German infantry’s past † An
iteration has also been called a period or a round by some authors.
234
T-C. Au and D. Nau
behavior. The British had was ample evidence to believe that the German infantry wanted to keep the peace just as much as the British infantry did. More generally, an important question for conflict prevention in noisy environments is whether a misconduct is intentional or accidental. A deviation from the usual course of action in a noisy environment can be explained in either way. If we form the wrong belief about which explanation is correct, our response may potentially destroy our long-term relationship with the other player. If we ground our belief on evidence accumulated before and after the incident, we should be in a better position to identify the true cause and prescribe an appropriate solution. To accomplish this, DBS uses the following key techniques: (1) Learning about the other player’s strategy. DBS uses an induction technique to identify a set of rules that model the other player’s recent behavior. The rules give the probability that the player will cooperate under different situations. As DBS learns these probabilities during the game, it identifies a set of deterministic rules that have either 0 or 1 as the probability of cooperation. (2) Detecting noise. DBS uses the above rules to detect anomalies that may be due either to noise or a genuine change in the other player’s behavior. If a move is different from what the deterministic rules predict, this inconsistency triggers an evidence collection process that will monitor the persistence of the inconsistency in the next few iterations of the game. The purpose of the evidence-collection process is to determine whether the violation is likely to be due to noise or to a change in the other player’s policy. If the inconsistency does not persist, DBS asserts that the derivation is due to noise; if the inconsistency persists, DBS assumes there is a change in the other player’s behavior. (3) Temporarily tolerating possible misbehaviors by the other player. Until the evidence-collection process finishes, DBS assumes that the other player’s behavior is still as described by the deterministic rules. Once the evidence collection process has finished, DBS decides whether to believe the other player’s behavior has changed, and updates the deterministic rules accordingly. Since DBS emphasizes the use of deterministic behaviors to distinguish noise from the change of the other player’s behavior, it works well when the other player uses a pure (i.e., deterministic) strategy or a strategy that makes decisions deterministically most of the time. Fortunately, deterministic behaviors are abundant in the Iterated Prisoner’s Dilemma. Many
Is it Accidental or Intentional?
235
well-known strategies, such as TFT and GRIM, are pure strategies. Some strategies such as Pavlov or Win-Stay, Lose-Shift strategy (WSLS) [Kraines and Kraines (1989, 1993, 1995); Nowak and Sigmund (1993)] are not pure strategies, but a large part of their behavior is still deterministic. The reason for the prevalence of determinism is discussed by Axelrod in [Axelrod (1984)]: clarity of behavior is an important ingredient of long-term cooperation. A strategy such as TFT benefits from its clarity of behavior, because it allows other players to make credible predictions of TFT’s responses to their actions. We believe the success of our strategy in the competition is because this clarity of behavior also helps us to fend off noise. The results of the competition show that the techniques used in DBS are indeed an effective way to fend off noise and maintain cooperation in noisy environments. When DBS defers judgment about whether the other player’s behavior has changed, the potential cost is that DBS may not be able to respond to a genuine change of the other player’s behavior as quickly as possible, thus losing a few points by not retaliating immediately. But this delay is only temporary, and after it DBS will adapt to the new behavior. More importantly, the techniques used in DBS greatly reduce the probability that noise will cause it to end a cooperation and fall into a mutual-defect situation. Our experience has been that it is hard to reestablish cooperation from a mutual-defection situation, so it is better avoid getting into mutual defection situations in the first place. When compared with the potential cost of ending an cooperation, the cost of temporarily tolerating some defections is worthwhile. Temporary tolerance also benefits us in another way. In the noisy Iterated Prisoner’s Dilemma, there are two types of noise: one that affects the other player’s move, and the other affects our move. While our method effectively handles the first type of noise, it is the other player’s job to deal with the second type of noise. Some players such as TFT are easily provoked by the second type of noise and retaliate immediately. Fortunately, if the retaliation is not a permanent one, our method will treat the retaliation in the same way as the first type of noise, thus minimizing its effect.
10.3. Iterated Prisoner’s Dilemma with Noise In the Iterated Prisoner’s Dilemma, two players play a finite sequence of classical prisoner’s dilemma games, whose payoff matrix is:
T-C. Au and D. Nau
236
Player 1
Cooperate Defect
Player 2 Cooperate Defect (uCC , uCC ) (uCD , uDC ) (uDC , uCD ) (uDD , uDD )
where uDC > uCC > uDD > uCD and 2uCC > uDC + uCD . In the competition, uDC , uCC , uDD and uCD are 5, 3, 1 and 0, respectively. At the beginning of the game, each player knows nothing about the other player and does not know how many iterations it will play. In each iteration, each player chooses either to cooperate (C) or defect (D), and their payoffs in that iteration are as shown in the payoff matrix. We call this decision a move or an action. After both players choose a move, they will each be informed of the other player’s move before the next iteration begins. If ak , bk ∈ {C, D} are the moves of Player 1 and Player 2 in iteration k, then we say that (ak , bk ) is the interaction of iteration k. If there are N iterations in a game, then the total scores for Player 1 and Player 2 are P P 1≤k≤N uak bk and 1≤k≤N ubk ak , respectively. The Noisy Iterated Prisoner’s Dilemma is a variant of the Iterated Prisoner’s Dilemma in which there is a small probability that a player’s moves will be mis-implemented. The probability is called the noise level.‡ In other words, the noise level is the probability of executing C when D was the intended move, or vice versa. The incorrect move is recorded as the player’s move, and determines the interaction of the iteration.§ Furthermore, neither player has any way of knowing whether the other player’s move was executed correctly or incorrectly.¶ For example, suppose Player 1 chooses C and Player 2 chooses D in iteration k, and noise occurs and affects the Player 1’s move. Then the interaction of iteration k is (D, D). However, since both players do not know that the Player 1’s move has been changed by noise, Player 1 and Player 2 perceive the interaction differently: for Player 1, the interaction is (C, D), but for Player 2, the interaction is (D, D). As in real life, this misunderstanding would become an obstacle in establishing and maintaining ‡ The
noise level in the competition was 0.1. a mis-implementation is different from a misperception, which would not change the interaction of the iteration. The competition included mis-implementations but no misperceptions. ¶ As far as we know, the definitions of “mis-implementation” used in the existing literature are ambiguous about whether either of the players should know that an action has been mis-executed. § Hence,
Is it Accidental or Intentional?
237
cooperation between the players. 10.4. Strategies, Policies, and Hypothesized Policies A history H of length k is the sequence of interactions of all iterations up to and including iteration k. We write H = h(a1 , b1 ), (a2 , b2 ), . . . , (ak , bk )i. Let H = h(C, C), (C, D), (D, C), (D, D)i∗ be the set of all possible histories. A strategy M : H → [0, 1] associates with each history H a real number called the degree of cooperation. M (H) is the probability that M chooses to cooperate at iteration k + 1, where k = |H| is H’s length. For examples, TFT can be considered as a function MT F T , such that (1) MT F T (H) = 1.0 if k = 0 or ak = C (where k = |H|), and (2) MT F T (H) = 0.0 otherwise; Tit-for-Two-Tats (TFTT), which is like TFT except it defects only after it receives two consecutive defections, can be considered as a function MT F T T , such that (1) MT F T T (H) = 0.0 if k ≥ 2 and ak−1 = ak = D, and (2) MT F T T (H) = 1.0 otherwise. We can model a strategy as a policy. A condition Cond : H → {True, False} is a mapping from histories to boolean values. A history H satisfies a condition Cond if and only if Cond(H) = True. A policy schema Ω is a set of conditions such that each history in H satisfies exactly one of the conditions in Ω. A rule is a pair (Cond, p), which we will write as Cond → p, where Cond is a condition and p is a degree of cooperation (a real number in [0, 1] ). A rule is deterministic if p is either 0.0 or 1.0; otherwise, the rule is probabilistic. In this paper, we define a policy to be a set of rules whose conditions constitute a policy schema. MT F T can be modeled as a policy as follows: we define Conda,b to be a condition about the interactions of the last iteration of a history, such that Conda,b (H) = True if and only if (1) k ≥ 1, ak = a and bk = b, (where k = |H|), or (2) k = 0 and a = b = C. For simplicity, we also write Conda,b as (a, b). The policy for MT F T is πT F T = {(C, C) → 1.0, (C, D) → 1.0, (D, C) → 0.0, (D, D) → 0.0}. Notice that the policy schema for πT F T is Ω = {(C, C), (C, D), (D, C), (D, D)}. Given a policy π and a history H, there is one and only one rule Cond → p in π such that Cond(H) = True. We write p as π(H). A policy π is complete for a strategy M if and only if π(H) = M (H) for any H ∈ H. In other words, a complete policy for a strategy is one that completely models the strategy. For instance, πT F T is a complete policy for MT F T . Some strategies are much more complicated than TFT—we need a large number of rules in order to completely model these strategies. If the number
238
T-C. Au and D. Nau
of iterations is small and the strategy is complicated enough, it is difficult or impossible for DBS to obtain a complete model of the other player’s strategy. Therefore, DBS does not aim at obtaining a complete policy of the other player’s strategy; instead, DBS leans an approximation of the other player’s strategy during a game, using a small number of rules. In order to distinguish this approximation from the complete policies for a strategy, we call this approximation a hypothesized policy. Given a policy schema Ω, DBS constructs a hypothesized policy π whose policy schema is Ω. The degrees of cooperation of the rules in π are estimated by a learning function (e.g., the learning methods in Section 10.6), which computes the degrees of cooperation according to the current history. For example, suppose the other player’s strategy is MT F T T , the given policy schema is Ω = {(C, C), (C, D), (D, C), (D, D)}, and the current history is H = {(C, C), (D, C), (C, C), (D, C), (D, C), (D, D), (C, D), (C, C)}. If we use a learning method which computes the degrees of cooperation by averaging the number of time the next action is C when a condition holds, then the hypothesized policy is π = {(C, C) → 1.0, (C, D) → 1.0, (D, C) → 0.66, (D, D) → 0.0}. Notice that the rule (D, C) → 0.66 does not accurately model MT F T T ; this probabilistic rule is just an approximation of what MT F T T does when the condition (D, C) holds. This approximation is inaccurate as long as the policy schema contains (D, C)—there is no complete policy for MT F T T whose policy schema contains (D, C). If we want to model MT F T T correctly, we need a different policy schema that allows us to specify more complicated rules. We interpret a hypothesized policy as a belief of what the other player will do in the next few iterations in response to our next few moves. This belief does not necessarily hold in the long run, since the other player can behave differently at different time in a game. Even worse, there is no guarantee that this belief is true in the next few iterations. Nonetheless, hypothesized policies constructed by DBS usually have a high degree of accuracy in predicting what the other player will do. This belief is subjective—it depends on the choice of the policy schema and the learning function. We formally define this subjective viewpoint as follows. The hypothesized policy space spanned by a policy schema Ω and a learning function L : Ω×H → [0, 1] is a set of policies Π = {π(H) : H ∈ H}, where π(H) = {Cond → L(Cond, H) : Cond ∈ Ω}. Let H be a history of a game in which the other player’s strategy is M . The set of all possible hypothesized policies for M in this game is {π(Hk ) : Hk ∈ prefixes(H)} ⊆ Π, where prefixes(H) is the set of all prefixes of H, and Hk is the prefix
Is it Accidental or Intentional?
239
of length k of H. We say π(Hk ) is the current hypothesized policy of M in the iteration k. A rule Cond → p in π(Hk ) describes a particular behavior of the other player’s strategy in the iteration k. The behavior is deterministic if p is either zero or one; otherwise, the behavior is random or probabilistic. If π(Hk ) 6= π(Hk+1 ), we say there is a change of the hypothesized policy in the iteration k + 1, and the behaviors described by the rules in (π(Hk ) \ π(Hk+1 )) have changed. 10.5. Derived Belief Strategy In the ordinary Iterated Prisoner’s Dilemma (i.e., without any noise), if we know the other player’s strategy and how many iterations in a game, we can compute an optimal strategy against the other player by trying every possible sequence of moves to see which sequence yields the highest score, assuming we have sufficient computational power. However, we are missing both pieces of information. So it is impossible for us to compute an optimal strategy, even with sufficient computing resource. Therefore, we can at most predict the other player’s moves based on the history of a game, subject to the fact that the game may terminate any time. Some strategies for the Iterated Prisoner’s Dilemma do not predict the other player’s moves at all. For example, Tit-for-Tat and GRIM react deterministically to the other player’s previous moves according to fixed sets of rules, no matter how the other player actually plays. Many strategies adapt to the other player’s strategy over the course of the game: for example, Pavlov [Kraines and Kraines (1989)] adjusts its degree of cooperation according to the history of a game. However, these strategies do not take any prior information about the other player’s strategy as an input; thus they are unable to make use of this important piece of information even when it is available. Let us consider a class of strategies that make use of a model of the other player’s strategy to make decisions. Figure 10.1 shows an abstract representation of these strategies. Initially, these strategies start out by assuming that the other player’s strategy is TFT or some other strategy. In every iteration of the game, the model is updated according to the current history (using UpdateModel). These strategies decide which move it should make in each iteration using a move generator (GenerateMove), which depends on the current model of the other player’s strategy of the iteration. DBS belongs to this class of strategies. DBS maintains a model of the other player in form of a hypothesized policy throughout a game, and
240
T-C. Au and D. Nau
Procedure StrategyUsingModelOfTheOtherPlayer() π ← InitialModel() // the current model of the other player H ←∅ // the current history a ← GenerateMove(π, H) // the initial move Loop until the end of the game Output our move a and obtain the other player’s move b H ← hH, (a, b)i π ← UpdateModel(π, H) a ← GenerateMove(π, H) End Loop Fig. 10.1. An abstract representation of a class of strategies that generate moves using a model of the other player.
makes decisions based on this hypothesized policy. The key issue for DBS in this process is how to maintain a good approximation of the other player’s strategy, despite that some actions in the history are affected by noise. A good approximation will increase the quality of moves generated by DBS, since the move generator in DBS depends on an accurate model of the other player’s behavior. The approach DBS uses to minimize the effect of noise on the hypothesized policy has been discussed in Section 10.2: temporarily tolerate possible misbehaviors by the other player, and then update the hypothesized policy only if DBS believes that the misbehavior is due to a genuine change of behaviors. Figure 10.2 shows an outline of the implementation of this approach in DBS. As we can see, DBS does not maintain the hypothesized policy explicitly; instead, DBS maintains three sets of rules: the default rule set (Rd ), the current rule set (Rc ), and the probabilistic rule set (Rp ). DBS combines these rule sets to form a hypothesized policy for move generation. In addition, DBS maintains several auxiliary variables (promotion counts and violation counts) to facilitate the update of these rule sets. We will explain every line in Figure 10.2 in detail in the next section. 10.6. Learning Hypothesized Policies in Noisy Environments We will describe how DBS learns and maintains a hypothesized policy for the other player’s strategy in this section. Section 10.6.1 describes how DBS uses discounted frequencies for each behavior to estimate the degree of
Is it Accidental or Intentional?
241
Procedure DerivedBeliefStrategy() 1. Rd ← πT F T // the default rule set 2. Rc ← ∅ // the current rule set 3. a0 ← C ; b0 ← C ; H ← h(a0 , b0 )i ; π = Rd ; k ← 1 ; v ← 0 4. a1 ← MoveGen(π, H) 5. Loop until the end of the game 6. Output ak and obtain the other player’s move bk 7. r+ ← ((ak−1 , bk−1 ) → bk ) 8. r− ← ((ak−1 , bk−1 ) → ({C, D} \ {bk })) 9. If r+ , r− 6∈ Rc , then 10. If ShouldPromote(r + ) = true, then insert r + into Rc . 11. If r+ ∈ Rc , then set the violation count of r + to zero 12. If r− ∈ Rc and ShouldDemote(r − ) = true, then 13. R d ← R c ∪ R d ; Rc ← ∅ ; v ← 0 14. If r− ∈ Rd , then v ← v + 1 15. If v > RejectT hreshold, or (r + ∈ Rc and r− ∈ Rd ), then 16. Rd ← ∅ ; v ← 0 17. Rp ← {(Cond → p0 ) ∈ ψk+1 : Cond not appear in Rc or Rd } 18. π ← R c ∪ Rd ∪ Rp // construct a hypothesized policy 19. H ← hH, (ak , bk )i; ak+1 ← MoveGen(π, H) ; k ← k + 1 20. End Loop Fig. 10.2. An outline of the DBS strategy. ShouldPromote first increases r + ’s promotion count, and then if r + ’s promotion count exceeds the promotion threshold, ShouldPromote returns true and resets r + ’s promotion count. Likewise, ShouldDemote first increases r − ’s violation count, and then if r − ’s violation count exceeds the violation threshold, ShouldPromote returns true and resets r − ’s violation count. Rp in Line 17 is the probabilistic rule set; ψk+1 in Line 17 is calculated from Equation 10.1.
cooperation of each rule in the hypothesized policy. Section 10.6.2 explains why using discounted frequencies alone are not sufficient for constructing an accurate model of the other player’s strategy in the presence of noise, and how symbolic noise detection and temporary tolerance can help overcome the difficulty in using discounted frequencies alone. Section 10.6.3 presents the induction technique DBS uses to identify deterministic behaviors in the other player. Section 10.6.4 illustrates how DBS defers judgment about whether an anomaly is due to noise. Section 10.6.5 discusses how DBS updates the hypothesized policy when it detects a change of behavior.
T-C. Au and D. Nau
242
10.6.1. Learning by Discounted Frequencies We now describe a simple way to estimate the degree of cooperation of the rules in the hypothesized policy. The idea is to maintain a discounted frequency for each behavior: instead of keeping an ordinary frequency count of how often the other player cooperates under a condition in the past, DBS applies discount factors based on how recent each occurrence of the behavior was. Given a history H = {(a1 , b1 ), (a2 , b2 ), . . . , (ak , bk )}, a real number α between 0 and 1 (called the discount factor ), and an initial hypothesized policy π0 = {Cond1 → p01 , Cond2 → p02 , . . . , Condn → p0n } whose policy schema is C = {Cond1 , Cond2 , . . . , Condn }, the probabilistic policy at iteration k + 1 is ψk+1 = {Cond1 → pk+1 , Cond2 → pk+1 , Condn → pk+1 n }, 1 2 k+1 where pi is computed by the following equation: P k−j gj 0≤j≤k α k+1 (10.1) pi = P k−j f ) j 0≤j≤k (α and where
0 pi if j = 0, gj = 1 if 1 ≤ j ≤ k, Condi (Hj−1 ) = True and bj = C, 0 otherwise; 0 pi if j = 0, fj = 1 if 1 ≤ j ≤ k, Condi (Hj−1 ) = True, 0 otherwise; ∅ if j = 1, Hj−1 = {(a1 , b1 ), (a2 , b2 ), . . . , (aj−1 , bj−1 )} otherwise. In short, the current history H has k+1 possible prefixes, and fj is basically a boolean function indicating whether the prefix of H up to the j − 1’th iteration satisfies Condi . gj is a restricted version of fj . When α = 1, pi is approximately equal to the frequency of the occurrence of Condi → pi . When α is less than 1, pi becomes a weighted sum of the frequencies that gives more weight to recent events than earlier ones. For our purposes, it is important to use α < 1, because it may happen that the other player changes its behavior suddenly, and therefore we should forget about its past behavior and adapt to its new behavior (for instance, when GRIM is triggered). In the competition, we used α = 0.75. An important question is how large a policy schema to use for the hypothesized policy. If the policy schema is too small, the policy schema won’t
Is it Accidental or Intentional?
243
provide enough detail to give useful predictions of the other player’s behavior. But if the policy schema is too large, DBS will be unable to compute an accurate approximation of each rule’s degree of cooperation, because the number of iterations in the game will be too small. In the competition, we used a policy schema of size 4: {(C, C), (C, D), (D, C), (D, D)}. We have found this to be good enough for modeling a large number of strategies. It is essential to have a good initial hypothesized strategy because at the beginning of the game the history is not long enough for us to derive any meaningful information about the other player’s strategy. In the competition, the initial hypothesized policy is πT F T = {(C, C) → 1.0, (C, D) → 1.0, (D, C) → 0.0, (D, D) → 0.0}. 10.6.2. Deficiencies of Discounted Frequencies in Noisy Environments It may appear that the probabilistic policy learned by the discountedfrequency learning technique should be inherently capable of tolerating noise, because it takes many, if not all, moves in the history into account: if the number of terms in the calculation of the average or weighted average is large enough, the effect of noise should be small. However, there is a problem with this reasoning: it neglects the effect of multiple occurrences of noise within a small time interval. A mis-implementation that alters the move of one player would distort an established pattern of behavior observed by the other player. The general effect of such distortion to the Equation 10.1 is hard to tell—it varies with the value of the parameters and the history. But if several distortions occur within a small time interval, the distortion may be big enough to alter the probabilistic policy and hence change our decision about what move to make. This change of decision may potentially destroy an established pattern of mutual cooperation between the players. At first glance, it might seem rare for several noise events to occur at nearly the same time. But if the game is long enough, the probability of it happening can be quite high. The probability of getting two noise events in two consecutive iterations out of a sequence of i iterations can be computed recursively as Xi = p(p + qXi−2 ) + qXi−1 , providing that X0 = X1 = 0, where p is the probability of a noise event and q = 1−p. In the competition, the noise level was p = 0.1 and i = 200, which gives X200 = 0.84. Similarly, the probabilities of getting three and four noises in consecutive iterations are 0.16 and 0.018, respectively.
244
T-C. Au and D. Nau
In the 2005 competition, there were 165 players, and each player played each of the other players five times. This means every player played 825 games. On average, there were 693 games having two noises in two consecutive iterations, 132 games having three noises in three consecutive iterations, and 15 games having four noises in four consecutive iterations. Clearly, we did not want to ignore situations in which several noises occur nearly at the same time. Symbolic noise detection and temporary tolerance outlined in Section 10.2 provide a way to reduce the amount of susceptibility to multiple occurrences of noise in a small time interval. Deterministic rules enable DBS to detect anomalies in the observed behavior of the other player. DBS temporarily ignores the anomalies which may or may not be due to noise, until a better conclusion about the cause of the anomalies can be drawn. This temporary tolerance prevents DBS from learning from the moves that may be affected by noise, and hence protects the hypothesized policy from the influence of errors due to noise. Since the amount of tolerance (and the accuracy of noise detection) can be controlled by adjusting parameters in DBS, we can reduce the amount of susceptibility to multiple occurrences of noise by increasing the amount of tolerance, at the expense of a higher cost of noise detection—losing more points when a change of behavior occurs. 10.6.3. Identifying Deterministic Rules Using Induction As we discussed in Section 10.2, deterministic behaviors are abundant in the Iterated Prisoner’s Dilemma. Deterministic behaviors can be modeled by deterministic rules, whereas random behavior would require probabilistic rules. A nice feature about deterministic rules is that they have only two possible degrees of cooperation: zero or one, as opposed to an infinite set of possible degrees of cooperation of the probabilistic rules. Therefore, there should be ways to learn deterministic rules that are much faster than the discounted frequency method described earlier. For example, if we knew at the outset which rules were deterministic, it would take only one occurrence to learn each of them: each time the condition of a deterministic rule was satisfied, we could assign a degree of cooperation of 1 or 0 depending on whether the player’s move was C or D. The trick, of course, is to determine which rules are deterministic. We have developed an inductive-reasoning method to distinguish deterministic rules from probabilistic rules during learning and to learn the correct degree
Is it Accidental or Intentional?
245
of cooperation for the deterministic rules. In general, induction is the process of deriving general principles from particular facts or instances. To learn deterministic rules, the idea of induction can be used as follows. If a certain kind of behavior occurs repeatedly several times, and during this period of time there is no other behavior that contradicts to this kind of behavior, then we will hypothesize that the chance of the same kind of behavior occurring in the next few iterations is pretty high, regardless of how the other player behaved in the remote past. More precisely, let K ≥ 1 be a number which we will call the promotion threshold. Let H = h(a1 , b1 ), (a2 , b2 ), . . . , (ak , bk )i be the current history. For each condition Condj ∈ C, let Ij be the set of indexes such that for all i ∈ Ij , i < k and Condj (h(a1 , b1 ), (a2 , b2 ), . . . , (ai , bi )i) = True. Let Iˆj be the set of the largest K indexes in Ij . If |Ij | ≥ K and for all i ∈ Iˆj , bi+1 = C (i.e., the other player chose C when the previous history up to the i’th iteration satisfies Condj ), then we will hypothesize that the other player will choose C whenever Condj is satisfied; hence we will use Condj → 1 as a deterministic rule. Likewise, if |Ij | ≥ K and for all i ∈ Iˆj , bi+1 = D, we will use Condj → 0 as a deterministic rule. See Line 7 to Line 10 in Figure 10.2 for an outline of the induction method we use in DBS. The induction method can be faster at learning deterministic rules than the discounted frequency method that regards a rule as deterministic when the degree of cooperation estimated by discounted frequencies is above or below certain thresholds. As can be seen in Figure 10.3, the induction method takes only three iterations to infer the other player’s moves correctly, whereas the discounted frequency technique takes six iterations to obtain a 95% degree of cooperation, and it never becomes 100%.k We may want to set the threshold in the discounted frequency method to be less than 0.8 to make it faster than the induction method. However, this will increase the chance of incorrectly identifying a random behavior as deterministic. A faster learning speed allows us to infer deterministic rules with a shorter history, and hence increase the effectiveness of symbolic noise detection by having more deterministic rules at any time, especially when a change of the other player’s behavior occurs. The promotion threshold K controls the speed of the identification of deterministic rules. The larger the value of K, the slower the speed of identification, but the less likely we will mistakenly hypothesize that the other player’s behavior is deterministic. k If
we modify Equation 10.1 to discard the early interactions of a game, the degree of cooperation of a probabilistic rule can attain 100%.
T-C. Au and D. Nau
246
1
Degree of Cooperation
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
Induction Discount Frequency
0
0
1
2
3
4
5
6
7
8
9
10
Iteration Fig. 10.3. Learning speeds of the induction method and the discounted frequency method when the other player always cooperates. The initial degree of cooperation is zero, the discounted rate is 0.75, and the promotion threshold is 3.
10.6.4. Symbolic Noise Detection and Temporary Tolerance Once DBS has identified the set of deterministic rules, it can readily use them to detect noise. As we said earlier, if the other player’s move violate a deterministic rule, it can be caused either by noise or by a change in the other player’s behavior, and DBS uses an evidence collection process to figure out which is the case. More precisely, once a deterministic rule Condi → oi is violated (i.e., the history up to the previous iteration satisfies Condi but the other player’s move in the current iteration is different from oi ), DBS keeps the violated rule but marks it as violated. Then DBS starts an evidence collection process that in the implementation of our competition entries is a violation counting: for each violated probabilistic rule DBS maintains a counter called the violation count to record how many violations of the rule have occurred (Line 12).∗∗ In the subsequent iterations, DBS increases the violation count by one every time a violation of the rule occurs. However, if DBS encounters a positive example of the rule, DBS resets the violation count to zero and unmark the rule (Line 11). If any violation count excesses a threshold called the violation threshold, DBS concludes that the violation is not due to noise; it is due to a change of the other player’s behavior. In this case, DBS invokes a special procedure ∗∗ We
believe that a better evidence collection process should be based on statistical hypothesis testing.
Is it Accidental or Intentional?
247
(described in Section 10.6.5) to handle this situation (Line 13). This evidence collection process takes advantages of the fact that the pattern of moves affected by noise is often quite different from the pattern of moves generated by the new behavior after a change of behavior occurs. Therefore, it can often distinguish noise from a change of behavior by observing moves in the next few iterations and gather enough evidence. As discussed in Section 10.6.2, we want to set a larger violation threshold in order to avoid the drawback of the discount frequency method in dealing with several misinterpretations caused by noise within a small time interval. However, if the threshold is too large, it will slow down the speed of adaptation to changes in the other player’s behavior. In the competition, we entered DBS several times with several different violation thresholds; and in the one that performed the best, the violation threshold was 4. 10.6.5. Coping with Ignorance of the Other Player’s New Behavior When the evidence collection process detects a change in the other player’s behavior, DBS knows little about the other player’s new behavior. How DBS copes with this ignorance is critical to its success. When DBS knows little about the other player’s new behavior when it detects a change of the other player’s behavior, DBS temporarily uses the previous hypothesized policy as the current hypothesized policy, until it deems that this substitution no longer works. More precisely, DBS maintains two sets of deterministic rules: the current rule set Rc and the default rule set Rd . Rc is the set of deterministic rules that is learned after the change of behavior occurs, while Rd is the set of deterministic rules before the change of behavior occurs. At the beginning of a game, Rd is πT F T and Rc is an empty set (Line 1 and Line 2). When DBS constructs a hypothesized policy π for move generation, it uses every rule in Rc and Rd . In addition, for any missing rule (i.e., the rule those condition are different from any rule’s condition in Rc or Rd ), we regard it as a probabilistic rule and approximate its degree of cooperation by Equation 10.1 (Line 17). These probabilistic rules form the probabilistic rule set Rp ⊆ ψk+1 . While DBS can insert any newly found deterministic rule in Rc , it insert rules into Rd only when the evidence collection process detects a change of the other player’s behavior. When it happens, DBS copies all the rules in Rc to Rd , and then set Rc to an empty set (Line 13). The default rule set is designed to be rejected : we maintain a violation
248
T-C. Au and D. Nau
count to record the number of violations to any rule in Rd . Every time any rule in Rd is violated, the violation count increased by 1 (Line 14). Once the violation count exceeds a rejection threshold, we drop the default rule set entirely (set it to an empty set) and reset the violation count (Line 15 and Line 16). We also reject Rd whenever any rule in Rc contradicts any rule in Rd (Line 15). We preserve the rules in Rc mainly for sake of providing a smooth transition: we don’t want to convert all deterministic rules to probabilistic rules at once, as it might suddenly alter the course of our moves, since the move generator in DBS generates moves according to the current hypothesized policy only. This sudden change in DBS’s behavior can potentially disrupt the cooperative relationship with the other player. Furthermore, some of the rules in Rc may still hold, and we don’t want to learn them from scratch. Notice that symbolic noise detection and temporary tolerance makes use of the rules in Rc but not the rules in Rd , although DBS makes use of the rules in both Rc and Rd when DBS decides the next move (Line 18). We do not use Rd for symbolic noise detection and temporary tolerance because when DBS inserts rules into Rd , a change of the other player’s behavior has already occurred—there is little reason to believe that anomalies detected using the rules in Rd are due to noise. Furthermore, we want to turn off symbolic noise detection and temporary tolerance temporarily when a change of behavior occurs, in order to identify a whole new set of deterministic rules from scratch. 10.7. The Move Generator in DBS We devised a simple and reasonably effective move generator for DBS. As shown in Figure 10.1, the move generator takes the current hypothesized policy π and the current history Hcurrent whose length is l = |Hcurrent |, and then decides whether DBS should cooperate in the current iteration. It is difficult to devise a good move generator, because our move could lead to a change of the hypothesized policy and complicate our projection of the long-term payoff. Perhaps, the move generator should take the other player’s model of DBS into account [Carmel and Markovitch (1994)]. However, we found that by making the assumption that hypothesized policy will not change for the rest of the game, we can devise a simple move generator that generates fairly good moves. The idea is that we compute the maximum expected score we can possibly earn for the rest of the game, using a technique that combines some ideas from both game-tree search and
Is it Accidental or Intentional?
249
Markov Decision Processes (MDPs). Then we choose the first move in the set of moves that leads to this maximum expected score as our move for the current iteration. To accomplish the above, we consider all possible histories whose prefix is Hcurrent as a tree. In this tree, each path starting from the root represents a possible history, which is a sequence of past interactions in Hcurrent plus a sequence of possible interactions in future iterations. Each node on a path represents the interaction of an iteration of a history. Figure 10.4 shows an example of such a tree. The root node of the tree represents the interaction of the first iteration. Let interaction(S) be the interaction represented by a node S. Let hS0 , S1 , . . . , Sk i be a sequence of nodes on the path from the root S0 to Sk . We define the depth of Sk to be k − l, and the history of Sk be H(Sk ) = hinteraction(S1 ), interaction(S2 ), . . . , interaction(Sk )i. Si is called the current node if the depth of Si is zero; the current node represents the interaction of the last iteration and H(Si ) = Hcurrent . As we do not know when the game will end, we assume it will go for N ∗ more iterations; thus each path in the tree has length of at most l + N ∗ . Our objective is to compute a non-negative real number called the maximum expected score E(S) for each node S with a non-negative depth. Like a conventional game tree search in computer chess or checkers, the maximum expected scores are defined recursively: the maximum expected score of a node at depth i is determined by the maximum expected scores of its children nodes at depth i + 1. The maximum expected score of a node S of depth N ∗ is assumed to be the value computed by an evaluation function f . This is a mapping from histories to non-negative real numbers, such that E(S) = f (H(S)). The maximum expected score of a node S of depth k, where 0 ≤ k < N ∗ , is computed by the maximizing rule: suppose the four possible nodes after S are SCC , SCD , SDC , and SDD , and let p be the degree of cooperation predicted by the current hypothesized policy π (i.e., p is the right-hand side of a rule (Cond → p) in π such that H(S) satisfies the condition Cond). Then E(S) = max{EC (S), ED (S)}, where EC (S) = p(uCC + E(SCC )) + (1 − p)(uCD + E(SCD )) and ED (S) = p(uDC +E(SDC ))+(1−p)(uDD +E(SDD )). Furthermore, we let move(S) be the decision made by the maximizing rule at each node S, i.e., move(S) = C if EC (S) ≥ ED (S) and move(S) = D otherwise. By applying this maximizing rule recursively, we obtain the maximum expected score of every node with a non-negative depth. The move that we choose for the current iteration is move(Si ), where Si is the current node.
250
T-C. Au and D. Nau
First Iteration (Root Node)
frag replacements
Previous Iteration (Current Node)
Depth 0
Depth 1
.
Depth 2
Fig. 10.4. An example of the tree that we use to compute the maximum expected scores. Each node denotes the interaction of an iteration. The top four nodes constitute a path representing the current history Hcurrent . The length of Hcurrent is l = 2, and the maximum depth N ∗ is 2. There are four edges emanating from each node S after the current node; each of these edges corresponds to a possible interaction of the iteration after S. The maximum expected scores (not shown) of the nodes with depth 2 are set by an evaluation function f ; these values are then used to calculate the maximum expected scores of the nodes with depth 1 by using the maximizing rule. Similarly, the maximum expected scores of the current node is calculated using four maximum expected scores of the nodes with depth 1.
The number of nodes in the tree increases exponentially with N ∗ . Thus, the tree can be huge—there are over a billion nodes when N ∗ ≥ 15. It is infeasible to compute the maximum expected score for every node one by one. Fortunately, we can use dynamic programming to speed up the computation. As an example, suppose the hypothesized policy is π = {(C, C) → pCC , (C, D) → pCD , (D, C) → pDC , (D, D) → pDD }, and suppose the evaluation function f returns a constant fo1 o2 for any history that satisfies the condition (o1 , o2 ), where o1 , o2 ∈ {C, D}. Then, given our assumption that the hypothesized policy does not change, it is not hard to show by induction that all nodes whose histories have the same length and satisfy the same condition have the same maximum expected score. By using this property, we construct a table of size 4 × (N ∗ + 2) in which each entry, denoted by Eok1 o2 , stores the maximum expected score of the nodes whose histories have length l + k and satisfy the condition (o1 , o2 ), where o1 , o2 ∈ {C, D}. We also have another table of the same size to record the decisions the procedure makes; the entry mko1 o2 of this table is the deciN +1 N +1 sion being made at Eok1 o2 . Initially, we set ECC = fCC , ECD = fCD ,
Is it Accidental or Intentional?
251
N +1 N +1 EDC = fDC , and EDD = fDD . Then the maximum expected scores in the remaining entries can be computed by the following recursive equation: k+1 k+1 Eok1 o2 = max po1 o2 (uCC + ECC ) + (1 − po1 o2 )(uCD + ECD ), k+1 k+1 po1 o2 (uDC + EDC ) + (1 − po1 o2 )(uDD + EDD ) , k+1 where o1 , o2 ∈ {C, D}. Similarly, mko1 o2 = C if (po1 o2 (uCC + ECC ) + (1 − k+1 k+1 k+1 po1 o2 )(uCD + ECD )) ≥ (po1 o2 (uDC + EDC ) + (1 − po1o2 )(uDD + EDD ) and k mo1 o2 = D otherwise. If the interaction of the previous iteration is (o1 , o2 ), we pick m0o1 o2 as the move for the current iteration. The pseudocode of this dynamic programming algorithm is shown in Figure 10.5.
Procedure MoveGen(π, H) hpCC , pCD , pDC , pDD i ← π {(a1 , b1 ), (a2 , b2 ), . . . , (ak , bk )} ← H (a0 , b0 ) ← (C, C) ; (a, b) ← (ak , bk ) N ∗ +1 N ∗ +1 N ∗ +1 N ∗ +1 hECC , ECD , EDC , EDD i ← hfCC , fCD , fDC , fDD i For k = N ∗ down to 0 For each (o1 , o2 ) in {(C, C), (C, D), (D, C), (D, D)} k+1 k+1 ) + (1 − po1 o2 )(uCD + ECD ) Fok1 o2 ← po1 o2 (uCC + ECC k+1 k+1 ) + (1 − po1 o2 )(uDD + EDD ) Gko1 o2 ← po1 o2 (uDC + EDC Eok1 o2 ← max(Fok1 o2 , Gko1 o2 ) If Fok1 o2 ≥ Gko1 o2 , then mko1 o2 ← C If Fok1 o2 < Gko1 o2 , then mko1 o2 ← D End For End For Return m0ab Fig. 10.5. The procedure for computing a recommended move for the current iteration. In the competition, we set N ∗ = 60, fCC = 3, fCD = 0, fDC = 5, and fDD = 1.
10.8. Competition Results The 2005 IPD Competition was actually a set of four competitions, each for a different version of the IPD. The one for the Noisy IPD was Category 2, which used a noise level of 0.1. Of the 165 programs entered into the competition, eight of them were provided by the organizer of the competition. These programs included
252
T-C. Au and D. Nau
ALLC (always cooperates), ALLD (always defects), GRIM (cooperates until the first defection of the other player, and thereafter it always defects), NEG (cooperate (or defect) if the other player defects (or cooperates) in the previous iteration), RAND (defects or cooperates with the 1/2 probability), STFT (suspicious TFT, which is like TFT except it defects in the first iteration) TFT, and TFTT. All of these strategies are well known in the literature on IPD. The remaining 157 programs were submitted by 36 different participants. Each participant was allowed to submit up to 20 programs. We submitted the following 20: • DBS. We entered nine different versions of DBS into the competition, each with a different set of parameters or different implementation. The one that performed best was DBSz, which makes use of the exact set of features we mentioned in this chapter. Versions that have fewer features or additional features did not do as well. • Learning of Opponent’s Strategy with Forgiveness (LSF). Like DBS, LSF is a strategy that learns the other player’s strategy during the game. The difference between LSF and DBS is that LSF does not make use of symbolic noise detection. It uses the discount frequency (Equation 10.1) to learn the other player’s strategy, plus a forgiveness strategy that decides when to cooperate if mutual defection occurs. We entered one instance of LSF. It placed around the 30’th in three of the runs and around 70’th in the other two runs. We believe the poor ranking of LSF is due to the deficiency of using discount frequency alone as we discussed at the beginning of Section 10.6. • Tit-for-Tat Improved (TFTI). TFTI is a strategy based on a totally different philosophy from DBS’s. It is not an opponent-modeling strategy, in the sense that it does not model the other player’s behavior using a set of rules. Instead, it is a variant of TFT with a sophisticated forgiveness policy that aims at overcoming some of the deficiencies of TFT in noisy environments. We entered ten instantiations of TFTI in the competition, each with a different set of parameters or some differences in the implementation. The best of these, TFTIm, did well in the competition (see Table 10.1), but not as well as DBS. Three of the other participants each entered the full complement of twenty programs: Wolfgang Kienreich, Jia-wei Li, and Perukrishnen Vytelingum. All three of them appear to have adopted the master-andslaves strategy that was first proposed by Vytelingum’s team from the Uni-
Is it Accidental or Intentional?
253
versity of Southampton. A master-and-slaves strategy is not a strategy for a single program, but instead for a team of collaborating programs. One of the programs in such a team is the master, and the remaining programs are slaves. The basic idea is that at the start of a run, the master and slaves would each make a series of moves using a predefined protocol, in order to identify themselves to each other. From then on, the master program would always play “defect” when playing with the slaves, and the slave programs would always play “cooperate” when playing with the master, so that the master would gain the highest possible payoff at each iteration. Furthermore, a slave would alway plays “defect” when playing with a program other than the master, in order to try to minimize that player’s score. Wolfgang Kienreich’s master program was CNGF (CosaNostra Godfather), and its slaves were 19 copies of CNHM (CosaNostra Hitman). Jia-wei Li’s master program was IMM01 (Intelligent Machine Master 01), and its slaves were IMS02, IMS03, . . . , IMS20 (Intelligent Machine Slave n, for n = 02, 03, . . . 20). Perukrishnen Vytelingum’s master program was BWIN (S2Agent1 ZEUS), and its slaves were BLOS2, BLOS3, . . . , BLOS20 (like BWIN, these programs also had longer names based on the names of ancient Greek gods). We do not know what strategies the other participants used in their programs. 10.8.1. Overall Average Scores Category 2 (IPD with noise) consisted of five runs. Each run was a roundrobin tournament in which each program played with every program, including itself. Each program participated in 166 games in each run (recall that there is one game in which a player plays against itself, which counts as two games for that player). Each game consisted of 200 iterations. A program’s score for a game is the sum of its payoffs over all 200 iterations (note that this sum will be at least 0 and at most 1000). The program’s total score for an entire run is the sum of its scores over all 166 games. On the competition’s website, there is a ranking for each of the five runs, each program is ranked according to its total score for the run. A program’s average score within a run is its total score for the run divided by 166. The program’s overall average score is its average over all five runs, i.e., its total over all five runs divided by 830 = 5 × 166. The table in Table 10.1 shows the average scores in each of the five runs of the top twenty-five programs when the programs are ranked by their
254
T-C. Au and D. Nau
overall average scores. Of our nine different versions of DBS, all nine of them are among the top twenty-five programs, and they dominate the top ten places. This phenomenon implies that DBS’s performance is insensitive to the parameters in the programs and the implementation details of an individual program. The same phenomenon happens to TFTI—nine out of ten programs using TFTI are ranked between the 11th place and the 25th place, and the last one is at the 29th place. 10.8.2. DBS versus the Master-and-Slaves Strategies Recall from Table 10.1: that DBSz placed third in the competition: it lost only to BWIN and IMM01, the masters of two master-and-slaves strategies. DBS does not use a master-and-slaves strategy, nor does it conspire with other programs in any other way—but in contrast, BWIN’s and IMM01’s performance depended greatly on the points fed to them by their slaves. In particular, (1) If we average the score of each master with the scores of its slaves, we get 379.9 for BWIN and 351.7 for IMM01, both of which are considerably less than DBSz’s score of 408. (2) A more extensive analysis [Au and Nau (2005)] shows that if the size of each master-and-slaves team had been limited to less than or equal to 10, DBSz would have outperformed BWIN and IMM01 in the competition, even without averaging the score of each master with its slaves. The reason for the above two phenomena is that the master-and-slaves strategies did not cooperate the other players as much as they did amongst themselves. In particular, Table 10.2 gives the percentages of each of the four possible interactions when any program from one group plays with any program from another group. Note that: • When BWIN and IMM01 play with their slaves, about 64% and 47% of the interactions are (D, C), but when non-master-and-slaves strategies play with each other, only 19% of the interactions are (D, C). • When the slave programs play with non-master-and-slaves programs, over 60% of interactions are (D, D), but when non-master-and-slaves programs play with other non-master-and-slaves programs, only 31% of the interactions are (D, D). • The master-and-slaves strategies decrease the overall percentage of (C, C) from 31% to 13%, and increase the overall percentage of (D, D) from 31% to 55%.
Is it Accidental or Intentional?
255
Table 10.2. Percentages of different interactions. “All but M&S” means all 105 programs that did not use master-and-slaves strategies, and “all” means all 165 programs in the competition. Player 1 Player 2 (C, C) (C, D) (D, C) (D, D) BWIN BWIN’s slaves 12% 5% 64% 20% 6% 47% 38% IMM01 IMM01’s slaves 10% CNGF CNGF’s slaves 2% 10% 10% 77% 5% 9% 24% 62% BWIN’s slaves all but M&S IMM01’s slaves all but M&S 7% 9% 23% 61% CNGF’s slaves all but M&S 4% 8% 24% 64% 33% 20% 20% 27% TFT all but M&S DBSz all but M&S 54% 15% 13% 19% 55% 20% 11% 14% TFTT all but M&S TFT all 23% 19% 16% 42% DBSz all 36% 14% 11% 39% TFTT all 38% 21% 10% 31% all but M&S all but M&S 31% 19% 19% 31% all all 13% 16% 16% 55%
10.8.3. A comparison between DBSz, TFT, and TFTT Next, we consider how DBSz performs against TFT and TFTT. Table 10.2 shows that when playing with another cooperative player, TFT cooperates ((C, C) in the table) 33% of the time, DBSz does so 54% of the time, and TFTT does so 55% of the time. Furthermore, when playing with a player who defects, TFT defects ((D, D) in the table) 27% of the time, DBSz does so 19% of the time, and TFTT does so 14% of the time. From this, one might think that DBSz’s behavior is somewhere between TFT’s and TFTT’s. But on the other hand, when playing with a player who defects, DBSz cooperates ((C, D) in the table) only 15% of the time, which is a lower percentage than for TFT and TFTT (both 20%). Since cooperating with a defector generates no payoff, this makes TFT and TFTT perform worse than DBSz overall. DBSz’s average score was 408 and it ranked 3rd, but TFTT’s and TFT’s average scores were 388.4 and 388.2 and they ranked 30th and 33rd. 10.9. Related Work Early studies of the effect of noise in the Iterated Prisoner’s Dilemma focused on how TFT, a highly successful strategy in noise-free environments, would do in the presence of noise. TFT is known to be vulnerable to noise; for instance, if two players use TFT at the same time, noise would trig-
256
T-C. Au and D. Nau
ger long sequences of mutual defections [Molander (1985)]. A number of people confirmed the negative effects of noise to TFT [Molander (1985); Bendor (1987); Mueller (1987); Axelrod and Dion (1988); Nowak and Sigmund (1990); Bendor et al. (1991)]. Axelrod found that TFT was still the best decision rule in the rerun of his first tournament with a one percent chance of misperception (Axelrod, 1984, page 183), but TFT finished sixth out of 21 in the rerun of Axelrod’s second tournament with a 10 percent chance of misperception [Donninger (1986)]. In Competition 2 of the 2005 IPD competition, the noise level was 0.1, and TFT’s overall average score placed it 33rd out of 165. The oldest approach to remedy TFT’s deficiency in dealing with noise is to be more forgiving in the face of defections. A number of studies found that more forgiveness promotes cooperation in noisy environments [Bendor et al. (1991); Mueller (1987)]. For instance, Tit-For-Two-Tats (TFTT), a strategy submitted by John Maynard Smith to Axelrod’s second tournament, retaliates only when it receives two defections in two previous iterations. TFTT can tolerate isolated instances of defections caused by noise and is more readily to avoid long sequences of mutual defections caused by noise. However, TFTT is susceptible to exploitation of its generosity and was beaten in Axelrod’s second tournament by TESTER, a strategy that may defect every other move. In Competition 2 of the 2005 IPD Competition, TFTT ranked 30—a slightly better ranking than TFT’s. In contrast to TFTT, DBS can tolerate not only an isolated defection but also a sequence of defections caused by noise, and at the same time DBS monitors the other player’s behavior and retaliates when exploitation behavior is detected (i.e., when the exploitation causes a change of the hypothesized policy, which initially is TFT). Furthermore, the retaliation caused by exploitation continues until the other player shows a high degree of remorse (i.e., cooperations when DBS defects) that changes the hypothesized policy to one with which DBS favors cooperations instead of defections. [Molander (1985)] proposed to mix TFT with ALLC to form a new strategy which is now called Generous Tit-For-Tat (GTFT) [Nowak and Sigmund (1992)]. Like TFTT, GTFT avoids an infinite echo of defections by cooperating when it receives a defection in certain iterations. The difference is that GTFT forgives randomly: for each defection GTFT receives it randomly choose to cooperate with a small probability (say 10%) and defect otherwise. DBS, however, does not make use of forgiveness explicitly as in GTFT; its decisions are based entirely on the hypothesized policy that it learned. But temporary tolerance can be deemed as a form of forgiveness,
Is it Accidental or Intentional?
257
since DBS does not retaliate immediately when a defection occurs in a mutual cooperation situation. This form of forgiveness is carefully planned and there is no randomness in it. Another way to improve TFT in noisy environments is to use contrition: unilaterally cooperate after making mistakes. One strategy that makes use of contrition is Contrite TFT (CTFT) [Sugden (1986); Boyd (1989); Wu and Axelrod (1995)], which does not defect when it knows that noise has occurred and affected its previous action. However, this is less useful in the Noisy IPD since a program does not know whether its action is affected by noise or not. DBS does not make use of contrition, though the effect of temporary tolerance resembles contrition. A family of strategies called “Pavlovian” strategies, or simply called Pavlov, was found to be more successful than TFT in noisy environments [Kraines and Kraines (1989, 1993, 1995); Nowak and Sigmund (1993)]. The simplest form of Pavlov is called Win-Stay, Lose-Shift [Nowak and Sigmund (1993)], because it cooperates only after mutual cooperation or mutual defection, an idea similar to Simpleton [Rapoport and Chammah (1965)]. When an accidental defection occurs, Pavlov can resume mutual cooperation in a smaller number of iterations than TFT [Kraines and Kraines (1989, 1993)]. Pavlov learns by conditioned response through rewards and punishments; it adjusts its probability of cooperation according to the previous interaction. Like Pavlov, DBS learns from its past experience and makes decisions accordingly. DBS, however, has an intermediate step between learning from experience and decision making: it maintains a model of the other player’s behavior, and uses this model to reason about noise. Although there are probabilistic rules in the hypothesized policy, there is no randomness in its decision making process. For readers who are interested, there are several surveys on the Iterated Prisoner’s Dilemma with noise [Axelrod and Dion (1988); Hoffmann (2000); O’Riordan (2001); Kuhn (2001)]. The use of opponent modeling is common in games of imperfect information such as Poker [Billings et al. (1998); Barone and While (1998, 1999, 2000); Davidson et al. (2000); Billings et al. (2003)] and RoShamBo [Egnor (2000)]. One entry in Axelrod’s original IPD tournament used opponent modeling, but it was not successful. There have been many works on learning the opponent’s strategy in the non-noisy IPD [Dyer (2004); Hingston and Kendall (2004); Powers and Shoham (2005)]. By assuming the opponent’s next move depends only on the interactions of the last few iterations, these works model the opponent’s strategy as probabilistic finite automata,
258
T-C. Au and D. Nau
and then use various learning methods to learn the probabilities in the automata. For example, [Hingston and Kendall (2004)] proposed an adaptive agent called an opponent modeling agent (OMA) of order n, which maintains a summary of the moves made up to n previous iterations. Like DBS, OMA learns the probabilities of cooperations of the other player in different situations using an updating rule similar to the Equation 10.1, and generates a move based on the opponent model by searching a tree similar to that shown in Figure 10.4. The opponent model in [Dyer (2004)] also has a similar construct. The main way they differ from DBS is how they learn the other player’s strategy, but there are several other differences: for example, the tree they used has a maximum depth of 4, whereas ours has a depth of 60. The agents of both [Hingston and Kendall (2004)] and [Dyer (2004)] learned the other player’s strategy by exploration—deliberately making moves in order to probe the other player’s strategy. The use of exploration for learning opponent’s behaviors was studied by [Carmel and Markovitch (1998)], who developed a lookahead-based exploration strategy to balance between exploration and exploitation and avoid making risky moves during exploration. [Hingston and Kendall (2004)] and [Dyer (2004)] used a different exploration strategy than [Carmel and Markovitch (1998)]; [Hingston and Kendall (2004)] introduced noise to 1% of their agent’s moves (they call this method the trembling hand), whereas the agent in [Dyer (2004)] makes decisions at random when it uses the opponent’s model and finds a missing value in the model. Both of their agents used a random opponent model at the beginning of a game. DBS does not make deliberate moves to attempt to explore the other player’s strategy, because we believe that this is a high-risk, low-payoff business in IPD. We believe it incurs a high risk because many programs in the competition are adaptive; our defections made in exploration may affect our long-term relationship with them. We believe it has a low payoff because the length of a game is usually too short for us to learn any non-trivial strategy completely. Moreover, the other player may alter its behavior at the middle of a game, and therefore it is difficult for any learning method to converge. It is essentially true in noisy IPD, since noise can provoke the other player (e.g., GRIM). Furthermore, our objective is to cooperate with the other players, not to exploit their weakness in order to beat them. So as long as the opponent cooperates with us there is no need to bother with their other behaviors. For these reasons, DBS does not aim at learning the other player’s strategy completely; instead, it learns the other player’s recent
Is it Accidental or Intentional?
259
behavior, which is subject to change. In contrast to the OMA strategy described earlier in this section, most of our DBS programs cooperated with each other in the competition. Our decision-making algorithm combines elements of both minimax game tree search and the value iteration algorithm for Markov Decision Processes. In contrast to [Carmel and Markovitch (1994)], we do not model the other player’s model of our strategy; we assume that the hypothesized policy does not change for the rest of the game. Obviously this assumption is not valid, because our decisions can affect the decisions of the other players in the future. Nonetheless, we found that the moves returned by our algorithm are fairly good responses. For example, if the other player behaves like TFT, the move returned by our algorithm is to cooperate regardless of the previous interactions; if the other player does not behave like TFT, our algorithm is likely to return defection, a good move in many situations. To the best of our knowledge, ours is the first work on using opponent models in the IPD to detect errors in the execution of another agent’s actions. 10.10. Summary and Future Work For conflict prevention in noisy environments, a critical problem is to distinguish between situations where another player has misbehaved intentionally and situations where the misbehavior was accidental. That is the problem that DBS was formulated to deal with. DBS’s impressive performance in the 2005 Iterated Prisoner’s Dilemma competition occurred because DBS was better able to maintain cooperation in spite of noise than any other program in the competition. To distinguish between intentional and unintentional misbehaviors, DBS uses a combination of symbolic noise detection plus temporary tolerance: if an action of the other player is inconsistent with the player’s past behavior, we continue as if the player’s behavior has not changed, until we gather sufficient evidence to see whether the inconsistency was caused by noise or by a genuine change in the other player’s behavior. Since clarity of behavior is an important ingredient of long-term cooperation in the IPD, most IPD programs have behavior that follows clear deterministic patterns. The clarity of these patterns made it possible for DBS to construct policies that were good approximations of the other players’ strategies, and to use these policies to fend off noise.
260
T-C. Au and D. Nau
We believe that clarity of behavior is also likely to be important in other multi-agent environments in which agents have to cooperate with each other. Thus it seems plausible that techniques similar to those used in DBS may be useful in those domains. In the future, we are interested in studying the following issues: • The evidence collection process takes time, and the delay may invite exploitation. For example, the policy of temporary tolerance in DBS may be exploited by a “hypocrite” strategy that behaves like TFT most of the time but occasionally defects even though DBS did not defect in the previous iteration. DBS cannot distinguish this kind of intentional defection from noise, even though DBS has built-in mechanism to monitor exploitation. We are interested to seeing how to avoid this kind of exploitation. • In multi-agent environments where agents can communicate with each other, the agents might be able to detect noise by using a predefined communication protocol. However, we believe there is no protocol that is guaranteed to tell which action has been affected by noise, as long as the agents cannot completely trust each other. It would be interesting to compare these alternative approaches with symbolic noise detection to see how symbolic noise detection could enhance these methods or vice versa. • The type of noise in the competition assumes that no agent know whether an execution of an action has been affected by noise or not. Perhaps there are situations in which some agents may be able to obtain partial information about the occurrence of noise. For example, some agents may obtain a plan of the malicious third party by counterespionage. We are interested to see how to utilize these information into symbolic noise detection. • It would be interesting to put DBS in an evolutionary environment to see whether it can survive after a number of generations. Is it evolutionarily stable? Acknowledgment. This work was supported in part by ISLE contract 0508268818 (subcontract to DARPA’s Transfer Learning program), UC Berkeley contract SA451832441 (subcontract to DARPA’s REAL program), and NSF grant IIS0412812. The opinions in this paper are those of the authors and do not necessarily reflect the opinions of the funders. This work is based on an earlier work: Accident or Intention: That Is the Question (in the Noisy Iterated Prisoner’s Dilemma), in AAMAS’06
Is it Accidental or Intentional?
261
c (May 8–12 2006) ACM, 2006. We would like to thank the anonymous reviewers for their comments. References Au, T.-C. and Nau, D. (2005). An Analysis of Derived Belief Strategy’s Performance in the 2005 Iterated Prisoner’s Dilemma Competition, Tech. Rep. CSTR-4756/UMIACS-TR-2005-59, University of Maryland, College Park. Axelrod, R. (1984). The Evolution of Cooperation (Basic Books). Axelrod, R. (1997). The Complexity of Cooperation: Agent-Based Models of Competition and Collaboration (Princeton University Press). Axelrod, R. and Dion, D. (1988). The further evolution of cooperation, Science 242, 4884, pp. 1385–1390. Barone, L. and While, L. (1998). Evolving adaptive play for simplified poker, in Proceedings of IEE International Conference on Computational Intelligence (ICEC-98), pp. 108–113. Barone, L. and While, L. (1999). An adaptive learning model for simplified poker using evolutionary algorithms, in Proceedings of the Congreess of Evolutionary Computation (GECCO-1999), pp. 153–160. Barone, L. and While, L. (2000). Adaptive learning for poker, in Proceedings of the Genetic and Evolutionary Computation Conference, pp. 566–573. Bendor, J. (1987). In good times and bad: Reciprocity in an uncertain world, American Journal of Politicial Science 31, 3, pp. 531–558. Bendor, J., Kramer, R. M. and Stout, S. (1991). When in doubt... cooperation in a noisy prisoner’s dilemma, The Journal of Conflict Resolution 35, 4, pp. 691–719. Billings, D., Burch, N., Davidson, A., Holte, R. and Schaeffer, J. (2003). Approximating game-theoretic optimal strategies for full-scale poker, in IJCAI, pp. 661–668. Billings, D., Papp, D., Schaeffer, J. and Szafron, D. (1998). Opponent modeling in poker, in AAAI, pp. 493–499. Boyd, R. (1989). Mistakes allow evolutionary stability in the repeated prisoner’s dilemma game, Journal of Theoretical Biology 136, pp. 47–56. Carmel, D. and Markovitch, S. (1994). The M* algorithms: Incorporating opponent models into adversary search, Tech. Rep. CIS9402, Computer Science Department Technion. Carmel, D. and Markovitch, S. (1998). How to explore your opponent’s strategy (almost) optimally, in Proceedings of the Third International Conference on Multi-Agent Systems, pp. 64–71. Davidson, A., Billings, D., Schaeffer, J. and Szafron, D. (2000). Improved opponent modeling in poker, in Proceedings of the 2000 International Conference on Artificial Intelligence (ICAI’2000), pp. 1467–1473. Donninger, C. (1986). Paradoxical Effects of Social Behavior, chap. Is it always efficient to be nice? (Heidelberg: Physica Verlag), pp. 123–134. Dyer, D. W. (2004). Opponent Modelling and Strategy Evolution in the Iterated
262
T-C. Au and D. Nau
Prisoner’s Dilemma, Master’s thesis, School of Computer Science and Software Engineering, The University of Western Australia. Egnor, D. (2000). Iocaine powder explained, ICGA Journal 23, 1, pp. 33–35. Hingston, P. and Kendall, G. (2004). Learning versus evolution in iterated prisoner’s dilemma, in Proceedings of the Congress on Evolutionary Computation (CEC’04). Hoffmann, R. (2000). Twenty years on: The evolution of cooperation revisited, Journal of Artificial Societies and Social Simulation 3, 2. Kraines, D. and Kraines, V. (1989). Pavlov and the prisoner’s dilemma, Theory and Decision 26, pp. 47–79. Kraines, D. and Kraines, V. (1993). Learning to cooperate with pavlov an adaptive strategy for the iterated prisoner’s dilemma with noise, Theory and Decision 35, pp. 107–150. Kraines, D. and Kraines, V. (1995). Evolution of learning among pavlov strategies in a competitive environment with noise, The Journal of Conflict Resolution 39, 3, pp. 439–466. Kuhn, S. T. (2001). Prisoner’s dilemma, http://karmak.org/archive/2002/11/Prisoner’s Dilemma.html Stanford Encyclopedia of Philosophy. Molander, P. (1985). The optimal level of generosity in a selfish, uncertain environment, The Journal of Conflict Resolution 29, 4, pp. 611–618. Mueller, U. (1987). Optimal retaliation for optimal cooperation, The Journal of Conflict Resolution 31, 4, pp. 692–724. Nowak, M. and Sigmund, K. (1990). The evolution of stochastic strategies in the prisoner’s dilemma, Acta Applicandae Mathematicae 20, pp. 247–265. Nowak, M. and Sigmund, K. (1993). A strategy of win-stay, lose-shift that outperforms tit-for-tat in the prisoner’s dilemma game, Nature 364, pp. 56–58. Nowak, M. A. and Sigmund, K. (1992). Tit for tat in heterogeneous populations, Nature 355, pp. 250–253. O’Riordan, C. (2001). Iterated prisoner’s dilemma: A review, Tech. Rep. NUIGIT-260601, Department of Information Technology, National University of Ireland, Galway. Powers, R. and Shoham, Y. (2005). Learning against opponents with bounded memory, in IJCAI. Rapoport, A. and Chammah, A. M. (1965). Prisoner’s dilemma (University of Michigan Press). Sugden, R. (1986). The economics of rights, co-operation and welfare (Blackwell). Wu, J. and Axelrod, R. (1995). How to cope with noise in the iterated prisoner’s dilemma, Journal of Conflict Resolution 39, pp. 183–189.