0001:cs0001008/clri.tex

1: %$Id: clri.tex,v 1.9 2000/10/23 16:02:36 jmvidal Exp jmvidal $

2: %

3: \input{shortcuts.tex}

4: %\documentclass{elsart}

5: \documentclass{article}

6: \usepackage{natbib,amsmath,psfrag,graphicx, subfigure, url, fancyhdr,hyperref}

7: \usepackage[all]{xy}

8:

9: \hypersetup{bookmarksopen,

10:   bookmarksnumbered,

11: % These appear on the File->Document Info

12:   pdftitle={Predicting the Expected Behavior of Agents that Learn About Agents: The CLRI Framework},

13:   pdfauthor={Jose M. Vidal and Edmund H. Durfee},

14:   pdfsubject={Learning in MASs},

15:   pdfkeywords={Learning, MAS},

16:   pdfview=FitH,

17:   pdfstartview=FitH

18: }

19:

20:

21: %\runauthor{Vidal and Durfee}

22: %\begin{frontmatter}

23:

24: \title{Predicting the Expected Behavior of Agents that Learn About

25:   Agents: The CLRI Framework}

26:

27: \author{Jos\'{e} M. Vidal and Edmund H. Durfee \\

28: Swearingen Engineering Center, University of South \\

29: Carolina, Columbia, SC, 29208 \\

30: Advanced Technology Laboratory, University of \\

31:   Michigan, Ann Arbor, MI, 48102}

32: \begin{document}

33: \maketitle

34: \begin{abstract}

35:   We describe a framework and equations used to model and predict the

36:   behavior of multi-agent systems (MASs) with learning agents. A

37:   difference equation is used for calculating the progression of an

38:   agent's error in its decision function, thereby telling us how the

39:   agent is expected to fare in the MAS. The equation relies on

40:   parameters which capture the agent's learning abilities, such as its

41:   change rate, learning rate and retention rate, as well as relevant

42:   aspects of the MAS such as the impact that agents have on each

43:   other. We validate the framework with experimental results using

44:   reinforcement learning agents in a market system, as well as with

45:   other experimental results gathered from the AI literature. Finally,

46:   we use PAC-theory to show how to calculate bounds on the values of

47:   the learning parameters.

48: \end{abstract}

49: %\begin{keyword}

50:   Multi-Agent Systems, Machine Learning, Complex Systems.

51: %\end{keyword}

52: %\end{frontmatter}

53:

54: \thispagestyle{fancy}

55: \lhead{}

56: \chead{Autonomous Agents and Multiagent Systems, January, 2003.}

57: \cfoot{\copyright{} Kluwer Academic Publishers 2002.}

58:

59: \section{Introduction}

60: \label{sec:Introduction}

61:

62: With the steady increase in the number of multi-agent systems (MASs)

63: with learning agents \cite{durfee:97,chavez:96,etzioni:96,stone97a}

64: the analysis of these systems is becoming increasingly important.

65: Some of the research in this area consists of experiments where a

66: number of learning agents are placed in a MAS, then different learning

67: or system parameters are varied and the results are gathered and

68: analyzed in an effort to determine how changes in the individual agent

69: behaviors will affect the system behavior.  We have learned about the

70: dynamics of market-based MASs using this approach \cite{vidal:98b}.

71: However, in this article we will take a step beyond these

72: observation-based experimental results and describe a framework that

73: can be used to model and predict the behavior of MASs with learning

74: agents.  We give a difference equation that can be used to calculate

75: the progression of an agent's error in its decision function.  The

76: equation relies on the values of parameters which capture the agents'

77: learning abilities and the relevant aspects of the MAS. We validate

78: the framework by comparing its predictions with our own experimental

79: results and with experimental results gathered from the AI literature.

80: Finally, we show how to use probably approximately correct (PAC)

81: theory to get bounds on the values of some of the parameters.

82:

83: The types of MAS we study are exemplified by the abstract

84: representation shown in Figure~\ref{fig:problem}. We assume that the

85: agents observe the physical state of the world (denoted by $w$ in the

86: figure) and take some action ($a$) based on their observation of the

87: world state. An agent's mapping from states to actions is denoted by

88: the decision function ($\delta$) inside the agent.  Notice that the

89: ``physical'' state of the world includes everything that is directly

90: observable by the agent using its sensors. It could include facts such

91: as a robot's position, or the set outstanding bids in an auction,

92: depending on the domain. The agent does not know the decision

93: functions of the other agents.  After taking action an agent can

94: change its decision function as prescribed by whatever

95: machine-learning algorithm the agent is using.

96:

97: \begin{figure}

98:   \begin{center}

99:     \psfrag{d1}{$\delta_1$}

100:     \psfrag{d2}{$\delta_2$}

101:     \psfrag{d3}{$\delta_3$}

102:     \psfrag{d}{{\tiny $\delta$}}

103:     \includegraphics{problem}

104:     \caption{The agents in a MAS.}

105:     \label{fig:problem}

106:   \end{center}

107: \end{figure}

108:

109: We have a situation where agents are changing their decision function

110: based on the effectiveness of their actions. However, the

111: effectiveness of their actions depends on the other agents' decision

112: functions.  This scenario leads to the immediate problem: if all the

113: agents are changing their decisions functions then it is not clear

114: what will happen to the system as a whole. Will the system settle to

115: some sort of equilibrium where all agents stop changing their decision

116: functions? How long will it take to converge? How do the agents'

117: learning abilities influence the system's behavior and possible

118: convergence? These are some of the questions we address in this

119: article.

120:

121: Section~\ref{sec:Fram-Model-MASs} presents our framework for

122: describing an agent's learning abilities and the error in its

123: behavior.  Section~\ref{sec:Calc-Agent's-Error} presents an equation

124: that can be used to predict an agent's expected error, as a function

125: of time, when the agent is in a MAS composed of other learning agents.

126: This equation is simplified in Section~\ref{sec:Assum-Cond-Indep} by

127: making some assumptions about the type of MAS being modeled.

128: Section~\ref{sec:Volatility} then defines the last few parameters used

129: in our framework---volatility and impact.

130: Section~\ref{sec:Example-with-2} gives an illustrative example of the

131: use of the framework.  The predictions made by our framework are

132: verified by our own experiments, as shown in

133: Section~\ref{sec:Simple-Application}, and with the experiments of

134: others, as shown in Section~\ref{sec:Appl-our-Theory}. The use of PAC

135: theory for determining bounds on the learning parameters is detailed

136: in Section~\ref{sec:Bound-Learn-Rate}. Finally

137: Section~\ref{sec:Related-Work} describes some of the related work and

138: Section~\ref{sec:Summary} summarizes our claims.

139:

140:

141: \section{A Framework for Modeling MASs}

142: \label{sec:Fram-Model-MASs}

143:

144:

145: In order to analyze the behaviors of agents in MASs composed of

146: learning agents, we must first construct a formal framework for

147: describing these agents. The framework must state any assumptions and

148: simplifications it makes about the world, the agents, and the agents'

149: behaviors. It must also be mathematically precise, so as to allow us

150: to make quantitative predictions about expected behaviors. Finally,

151: the simplifications brought about because of the need for mathematical

152: precision should not be so constraining that they prevent the

153: applicability of the framework to a wide variety of learning

154: algorithms and different types of MASs.  We now describe our framework

155: and explain the types of MASs and learning behaviors that it can

156: capture.

157:

158: \subsection{The World and its Agents}

159: \label{sec:World-its-Agents}

160:

161: A MAS consists of a finite number of agents, actions, and world

162: states. We let $N$ denote the finite set of agents in the system.  $W$

163: denotes the finite set of world states.  Each agent is assumed to have

164: a set of perceptors (e.g., a camera, microphone, bid queue) with which

165: it can perceive the world. An agent uses is sensors to ``look'' at the

166: world and determine which world state $w$ it is in; the set of all

167: these states is $W$.  $A_i$, where $|A_i| \geq 2$, denotes the finite

168: set of actions agent $i \in N$ can take.

169:

170: We assume discrete time, indexed in the various functions by the

171: superscript $t$, where $t$ is an integer greater than or equal to 0.

172: The assumption of discrete time is made, for practical reasons, by a

173: number of learning algorithms. It means that, while the world might be

174: continuous, the agents perceive and learn in separate discrete

175: time steps.

176:

177: We also assume that there is only one world state $w$ at each time,

178: which all the agents can perceive in its completeness. That is, we

179: asume the enviroment is accessible (as defined in

180: \cite[p46]{ai:modern:approach}).  This assumption holds for market

181: systems in which all the actions of all the agents are perceived by

182: all the agents, and for software agent domains in which all the agents

183: have access to the same information. However, it might not hold for

184: robotic domains where one agent's view of the world might be obscured

185: by some physical obstacle. Even in such domains, it is possible that

186: there is a strong correlation between the states perceived by each

187: agent. These correlations could be used to create equivalency classes

188: over the agents' perceived states, and these classes could then be

189: used as the states in $W$.

190:

191: Finally, we assume the environment is determistic

192: \cite[p46]{ai:modern:approach}. That is, the agents' combined actions

193: will always have the expected effect. Of course, agent $i$ might not

194: know what action agent $j$ will take so $i$ might not know the

195: eventual effect of its own individual action.

196:

197: \subsection{A Description of Agent Behavior}

198: \label{sec:Agents}

199:

200: In the types of MASs we are modeling, every agent $i$ perceives the

201: state of the world $w$ and takes an action $a_i$, at each time step.

202: We assume that every agent's \textbf{behavior}, at each moment in

203: time, can be described with a simple state-to-action mapping. That is,

204: an agent's choice of action is solely determined by its current

205: state-to-action mapping and the current world $w$.

206:

207: Formally, we say that agent $i$'s behavior is represented by a

208: \textbf{decision function} (also known as a ``policy'' in control

209: theory and a ``strategy'' in game theory), given by $\delta_{i}^t:W

210: \ra A_i$. This function maps each state $w \in W$ to the action $a_i

211: \in A_i$ that agent $i$ will take in that state, at time $t$. This

212: function can effectively describe any agent that deterministically

213: chooses its action based on the state of the world. Notice that the

214: decision function is indexed with the time $t$. This allows us to

215: represent agents that change their behavior.

216:

217: The action agent $i$ \emph{should} take in each state $w$ is given by

218: the \textbf{target function} $\Delta_{i}^t:W \ra A_i$, which also maps

219: each state $w \in W$ to an action $a_i \in A_i$. The agent does not

220: have direct access to its target function. The target function is used

221: to determine how well an agent is doing. That is, it represents the

222: ``perfect'' behavior for a given agent. An agent's learning task is to

223: get its decision function to match its target function as much as

224: possible.

225:

226: Since the choice of action for agent $i$ often depends on the actions

227: of other agents, the target function for $i$ needs to take these

228: actions into account. That is, in order to generate \Dit{}, one would

229: need to know \djtw{} for all $j \in N_{-i}$ and $w \in W$.  These

230: \djtw{} functions tell us the actions that all the other agents will

231: take in every state $w$.  For example, in order for one to determine

232: what an agent should bid in every world $w$ of an auction-based market

233: system, one will need to know what the other agents will bid in every

234: world $w$. One can use these actions, along with the state $w$, in

235: order to identify the best action for $i$ to take.

236:

237: \begin{figure}

238:   \centerline{

239:     \xymatrix{ *\txt{New world $w^t \in \D$} \ar[r] & *\txt{Perceive world $w^t$}

240:       \ar[r] & *\txt{Take action $\delta_i(w^t)$}\ar[d] \\

241:       & *\txt{Learn}\ar[lu]_{t \leftarrow t + 1} & *\txt{Receive payoff\\ or feedback.}\ar[l]

242:       }}

243:     \caption{Action/Learn loop for an agent.}

244:     \label{fig:action-learn}

245: \end{figure}

246:

247:

248: An agent's \ditw{} can change over time, so that $\ditt \neq \dit$.

249: These changes in an agent's decision function reflect its learned

250: knowledge. The agents in the MASs we consider are engaged in the

251: discrete action/learn loop shown in Figure~\ref{fig:action-learn}. The

252: loop works as follows: At time $t$ the agents perceive a world $w^t

253: \in W$ which is drawn from a fixed distribution \Dw{}.  They then each

254: take the action dictated by their \dit{} functions; all of these

255: actions are assumed to be taken effectively in parallel. Lastly, they

256: each receive a payoff which their respective learning algorithms use

257: to change the \dit{} so as to, hopefully, better match \Dit{}.  By

258: time $t+1$, the agents have new \ditt{} functions and are ready to

259: perceive the world again and repeat the loop.  Notice that, at time

260: $t$, an agent's \Dit{} is derived by taking into account the \djt{} of

261: all other agents $j \in N_{-i}$.

262:

263: We assume that \Dw{} is a fixed probability distribution from which we

264: take the worlds seen at each time. This assumption is not unreasonably

265: limiting.  For example, in an economic domain where the new state is

266: the new good being offered, or in an episodic domain where the agents

267: repeatedly engage in different games (e.g. a Prisoner's Dilemma

268: competition) there is no correlation between successive world states

269: or between these states and the agents' previous actions.  However, in

270: a robotic domain one could argue that the new state of the world will

271: depend on the current state of the world; after all, the agents

272: probably move very little each time step.

273:

274: Our measure of the correctness of an agent's behavior is given by our

275: \textbf{error} measure. We define the error of agent $i$'s decision

276: function \ditw{} as

277: \begin{equation}

278:   \label{eq:error}

279:   \begin{split}

280:     \edit &= \sum_{w \in W} \Dw \Pro[\ditw \neq \Ditw] \\

281:     &= \Pro_{w \in \D} [\ditw \neq \Ditw].

282:   \end{split}

283: \end{equation}

284:

285: \edit{} gives us the probability that agent $i$ will take an incorrect

286: action; it is in keeping with the error definition used in

287: computational learning theory \cite{intro:clt}.  We use it to gauge

288: how well agent $i$ is performing. An error of 0 means that the agent

289: is taking all the actions dictated by its target function. An error of

290: 1 means that the agent never takes an action as dictated by its target

291: function. Each action the agent takes is either correct or incorrect,

292: that is, it either matches the target function or it does not. We do

293: not model degrees of incorrectness. However, since the error is

294: defined as the average over all possible world states, an agent that

295: takes the correct action in most world states will have a small error.

296: Extending the theory to handle degrees of incorrectness is one of

297: the subjects of your continuing work, see

298: Section~\ref{sec:future-work}. All the notation from this section is

299: summarized in Figure~\ref{fig:summary-notation}.

300:

301: \begin{figure}

302:   \begin{center}

303:     \fbox{ \parbox{4.5in}{

304:         \begin{description}

305:         \item[$N$] the set of all agents, where $i \in N$ is one

306:           particular agent.

307:         \item[$W$] the set of possible states of the world, where $w

308:           \in W$ is one particular state.

309:         \item[$A_i$] the set of all actions that agent $i$ can take.

310:         \item[$\delta_i^t: W \ra A_i$] the \textbf{decision} function

311:           for agent $i$ at time $t$. It tells which action agent $i$

312:           will take in each world.

313:         \item[$\Delta_i^t: W \ra A_i$] the \textbf{target} function

314:           for agent $i$ at time $t$. It tells us what action agent $i$

315:           should take. It takes into account the actions that other

316:           agents will take.

317:         \item[$e(\delta_i^t)$] $= \Pro[\delta_i^t(w) \neq

318:           \Delta_i^t(w) \,|\, w \in \D]$ the \textbf{error} of agent

319:           $i$ at time $t$. It is the probability that $i$ will take an

320:           incorrect action, given that the worlds $w$ are taken from

321:           the fixed probability distribution \D.

322:         \end{description}

323:         } }

324:     \caption{Summary of notation used for describing a MAS and the agents in it.}

325:     \label{fig:summary-notation}

326:   \end{center}

327: \end{figure}

328:

329:

330: \subsection{The Moving Target Function Problem}

331: \label{sec:Moving-Targ-Funct}

332:

333:

334: The learning problem the agent faces is to change its \ditw{} so that

335: it matches \Ditw{}.  If we imagine the space of all possible decision

336: functions, then agent $i$'s \dit{} and \Dit{} will be two points in

337: this space, as shown in Figure~\ref{fig:trad-learn}.  The agent's

338: learning problem can then be re-stated as the problem of moving its

339: decision function as close as possible to its target function, where

340: the distance between the two functions is given by the error \edit{}.

341: This is the traditional machine learning problem.

342:

343: \begin {figure}[thbp]

344:   \centerline{

345:     \xymatrix {

346:       & &  \ditt{}  \ar@{~}[rrrd]^{e(\ditt)}

347:       & & &  \\

348:       \dit{} \ar[rru]^{\txt{Learn}} \ar@{~}[rrrrr]_{e(\dit)} &&

349:       & & &  \Delta_i

350:       } }

351:   \caption{The traditional learning problem.}

352:   \label{fig:trad-learn}

353: \end{figure}

354:

355: However, once agents start to change their decision functions (i.e.,

356: change their behaviors) the problem of learning becomes more

357: complicated because these changes might cause changes in the other

358: agents' target functions. We end up with a moving target function, as

359: seen in Figure~\ref{fig:learn-mas}. In these systems, it is not clear

360: if the error will ever reach 0 or, more generally, what the expected

361: error will be as time goes to infinity. Determining what will happen

362: to an agent's error in such a system is what we call the

363: \textbf{moving target function problem}, which we address in this

364: article. However, we will first need to define some parameters that

365: describe the capabilities of an agent's learning algorithm.

366:

367: \begin{figure}[thbp]

368:   \centerline{

369:     \xymatrix{

370:       & &  \ditt{}  \ar@{~}[rrrrd]^{e(\ditt)}

371:       & & & \\

372:       \dit{} \ar[rru]^{\txt{Learn}} \ar@{~}[rrrrr]_{e(\dit)} &&

373:       & & & \Dit \ar[r]_{\txt{Move}} & \Ditt

374:       }}

375:   \caption{The learning problem in learning MASs.}

376:   \label{fig:learn-mas}

377: \end{figure}

378:

379: \subsection{A Model of Learning Algorithms}

380: \label{sec:Model-Learn-Algor}

381:

382: An agent's learning algorithm is responsible for changing \dit{} into

383: \ditt{} so that it is a better match of \Dit{}. Different machine

384: learning algorithms will achieve this match with different degrees of

385: success.  We have found a set of parameters that can be used to model

386: the effects of a wide range of learning algorithms. The parameter are:

387: Change rate, Learning rate, Retention rate, and Impact; and they will

388: be explained in this section, except for Impact which will be

389: introduced in Section~\ref{sec:Volatility}. These parameters, along

390: with the equations we provide, form the \textbf{CLRI} framework (the

391: letters correspond to the first letter of the parameters' names).

392:

393: After agent $i$ takes an action and receives some payoff, it activates

394: its learning algorithm, as we showed in Figure~\ref{fig:action-learn}.

395: The learning algorithm is responsible for using this payoff in order

396: to change \dit{} into \ditt{}, making \ditt{} match \Dit{} as much as

397: possible. We can expect that for some $w$ it was true that $\ditw =

398: \Ditw$, while for some other $w$ this was not the case. That is,

399: some of the $w \ra a_i$ mappings given by \ditw{} might have been

400: incorrect.  In general, a learning algorithm might affect both the

401: correct and incorrect mappings. We will treat these two cases

402: separately.

403:

404: We start by considering the incorrect mappings and define the

405: \textbf{change rate} of the agent as the probability that the agent

406: will change at least one of its incorrect mappings. Formally, we

407: define the change rate $c_i$ for agent $i$ as

408: \begin{equation}

409:   \label{eq:54}

410:   \forallb_{w} \; \Pro[\dittw \neq \ditw \, |\, \ditw \neq \Ditw] = c_i.

411: \end{equation}

412:

413: The change rate tells us the likelihood of the agent changing an

414: incorrect mapping into something else. This ``something else'' might

415: be the correct action, but it could also be another incorrect action.

416: The probability that the agent changes an incorrect mapping to the

417: correct action is called the \textbf{learning rate} of the agent.

418: It is defined as $l_i$ where

419: \begin{equation}

420:   \label{eq:51}

421:   \forallb_{w} \; \Pro[\dittw = \Ditw \, | \, \ditw \neq

422:   \Ditw] = l_{i}.

423: \end{equation}

424: %When determining the value of $l_i$, for a particular agent, one must

425: %remember that the worlds seen at each time step are taken from \Dw{}.

426: There are two constraints which must always be satisfied by these two

427: rates. Since changing to the correct mapping implies that a change was

428: made, the value of $l_i$ must be less than or equal to $c_i$, that is,

429: $l_i \leq c_i$ must always be true.  Also, if $|A_i| = 2$ then $c_i =

430: l_i$ since there are only two actions available, so the one that is

431: not wrong must be right.

432:

433: The complementary value for the learning rate is $1- l_i$ and refers

434: to the probability that an incorrect mapping does not get changed to a

435: correct one.  An example learning rate of $l_i = .5$ means that if

436: agent $i$ initially has all mappings wrong it will make half of them

437: match the original target function after the first iteration.

438:

439: We now consider the agent's correct mappings and define the

440: \textbf{retention rate} as the probability that a correct mapping will

441: stay correct in the next iteration. The retention rate is given by

442: $r_i$ where

443: \begin{equation}

444:   \label{eq:52}

445:   \forallb_{w} \; \Pro[\dittw = \Ditw \, | \, \ditw = \Ditw].

446:  = r_i.

447: \end{equation}

448: We propose that the behavior of a wide variety of learning algorithms

449: can be captured (or at least approximated) using appropriate values

450: for $c_i$, $l_i$, and $r_i$. Notice, however, that these three rates

451: claim that the $w \ra a$ mappings that change are independent of the

452: $w$ that was just seen. We can justify this independence by noting

453: that most learning algorithms usually perform some form of

454: generalization.  That is, after observing one world state $w$ and the

455: payoff associated with it, a typical learning algorithm is able to

456: generalize what it learned to some other world states. This

457: generalization is reflected in the fact that the change, learning, and

458: retention rates apply to all $w$'s. However, a more precise model

459: would capture the fact that, in some learning algorithms, the mapping

460: for the world state that was just seen is more likely to change than

461: the mapping for any other world state.

462:

463: The rates are not time dependent because we assume that agents use one

464: learning algorithm during their lifetimes. The rates capture the

465: capabilities of this learning algorithm and, therefore, do not need to

466: vary over time.

467:

468: Finally, we define \textbf{volatility} to mean the probability that

469: the target function will change from time $t$ to time $t+1$. Formally,

470: volatility is given by $v_i$ where

471: \begin{equation}

472:   \label{eq:53}

473:   \forallb_{w} \; \Pro[\Dittw \neq \Ditw] = v_i

474: \end{equation}

475: In Section~\ref{sec:Volatility}, we will show how to calculate $v_i$

476: in terms of the error of the other agents. We will then see that

477: volatility is not a constant but, instead, varies with time.

478:

479:

480: \section{Calculating the Agent's Error}

481: \label{sec:Calc-Agent's-Error}

482:

483: We now wish to write a difference equation that will let us calculate

484: the agent's expected error, as defined in Eq.~\eqref{eq:error}, at

485: time $t+1$ given the error at time $t$ and the other parameters we

486: have introduced. We can do this by observing that there are two

487: conditions that determine the new error: whether $\Dittw{} = \Ditw{}$

488: or not, and whether $\ditw{} = \Ditw{}$ or not.  If we define $a

489: \equiv \Dittw{} = \Ditw{}$, and $b \equiv \ditw{} = \Ditw{}$, we can

490: then say that we need to consider the four cases where: $a \wedge b$,

491: $a \wedge \neg b$, $\neg a \wedge b$, and $\neg a \wedge \neg

492: b$. Formally, this implies that

493: \begin{equation}

494:   \label{eq:1}

495:   \begin{split}

496:     \Pro&[\dittw \neq \Dittw] = \\

497:     & \Pro[\dittw \neq \Dittw \wedge a \wedge b] +

498:     \Pro[\dittw \neq \Dittw \wedge a \wedge \neg b] + \\

499:     & \Pro[\dittw \neq \Dittw \wedge \neg a \wedge b] +

500:     \Pro[\dittw \neq \Dittw \wedge \neg a \wedge \neg b],

501:   \end{split}

502: \end{equation}

503: since the four cases are exclusive of each other. Applying the chain

504: rule of probability, we can rewrite each of the four terms in order to

505: get

506: \begin{equation}

507:   \label{eq:2}

508:   \begin{split}

509:     \Pro[\dittw \neq \Dittw] &= \Pro[a \wedge b] \cdot \Pro[\dittw \neq \Dittw \,|\, a\wedge b] + \\

510:     & \Pro[a \wedge \neg b] \cdot \Pro[\dittw \neq \Dittw \,|\, a\wedge

511:   \neg b] + \\

512:   & \Pro[\neg a \wedge b] \cdot \Pro[\dittw \neq \Dittw \,|\, \neg

513:   a\wedge  b] + \\

514:   & \Pro[\neg a \wedge \neg b] \cdot \Pro[\dittw \neq \Dittw \,|\, \neg

515:   a\wedge \neg b].

516: \end{split}

517: \end{equation}

518: We can now find values for these conditional probabilities. We start

519: with the first term where, after replacing the values of $a$ and $b$,

520: we find that

521: \begin{equation}

522:   \label{eq:3}

523:   \Pro[\dittw \neq \Dittw \,|\, \Dittw = \Ditw \wedge \ditw = \Ditw] = 1 - r_i.

524: \end{equation}

525: Since the target function does not change from time $t$ to $t+1$ and

526: the agent was correct at time $t$, the agent will also be correct at

527: time $t+1$; \emph{unless} it changes its correct $w \ra a$ mapping. The

528: agent changes this mapping with probability $1 - r_i$.

529:

530: The value for the second conditional probability is

531: \begin{equation}

532:   \label{eq:4}

533:   \Pro[\dittw \neq \Dittw \,|\, \Dittw = \Ditw \wedge \ditw \neq \Ditw] = 1 - l_i.

534: \end{equation}

535: In this case the target function still stays the same but the agent

536: was incorrect. If the agent was incorrect then it will change its

537: decision function to match the target function with probability

538: $l_i$. Therefore, the probability that it will be incorrect next time

539: is the probability that it does not make this change, or $1 -

540: l_i$.

541:

542: The third probability has a value of

543: \begin{equation}

544:   \label{eq:5}

545:   \begin{split}

546:   \Pro&[\dittw \neq \Dittw \,|\, \Dittw \neq \Ditw \wedge \ditw =

547:   \Ditw] \\

548:   &= ( r_i + (1 - r_i) \cdot B)

549:   \end{split}

550: \end{equation}

551: In this case the agent was correct and the target function

552: changes. This means that if the agent retains the same mapping,

553: which it does with probability $r_i$, then the agent will definitely be

554: incorrect at time $t+1$. If it does not retain the same mapping, which

555: happens with probability $1-r_i$, then it will be incorrect with

556: probability $B$, where

557: \begin{equation}

558:   \begin{split}

559:   B = \Pro[ &\dittw \neq \Dittw | \ditw = \Ditw \wedge \Dittw \neq

560:   \Ditw \label{eq:b}\\

561:   & \wedge \dittw \neq \Ditw].

562:   \end{split}

563: \end{equation}

564: Finally, the fourth conditional probability has a value of

565: \begin{equation}

566:   \label{eq:10}

567:   \begin{split}

568:   \Pro[&\dittw \neq \Dittw \,|\, \Dittw \neq \Ditw \wedge \ditw \neq

569:   \Ditw] \\

570:   &= (1 - c_i)D + l_i +  (c_i - l_i)F,

571:   \end{split}

572: \end{equation}

573: where

574: \begin{align}

575:   D &= \Pro[ \ditw \neq \Dittw | \ditw \neq \Ditw \wedge \Dittw \neq

576:   \Ditw] \label{eq:d}\\

577:   F &= \Pro[ \dittw \neq \Dittw | \ditw \neq \Ditw \wedge \Dittw \neq

578:   \Ditw \label{eq:f} \\

579:   & \quad \quad \wedge \dittw \neq \Ditw \wedge \dittw \neq \ditw]. \nonumber

580: \end{align}

581: This is the case where the target function changes and the agent was

582: wrong. We have to consider three possibilities. The first possibility

583: is for the agent not to change its decision function, which happens

584: with probability $1 - c_i$. The probability that the agent will be

585: incorrect in this case is given by $D$. The second possibility, when

586: the agent changes its mapping to the correct function, has a

587: probability of $l_i$ and ensures that the agent will be incorrect the

588: next time.  The third possibility happens, with probability $c_i -

589: l_i$ when the agent changes its mapping to an incorrect value. In this

590: case, the probability that it will be wrong next time is given by $F$.

591:

592: We can substitute Eqs.~\eqref{eq:3}, \eqref{eq:4}, \eqref{eq:5}, and

593: \eqref{eq:10} into Eq.~\eqref{eq:2}, substitute the values of $a$ and

594: $b$, and expand $\Pro[a \wedge b]$ into $\Pro[a \,|\,b] \cdot

595: \Pro[b]$, in order to get

596: \begin{equation}

597:   \label{eq:main:general}

598:   \begin{split}

599:   E&[\editt] = E[\sum_{w \in W} \Dw \Pro[\dittw \neq \Dittw]] = \sum_{w \in W} \Dw ( \\

600:   &\quad \Pro[\Dittw = \Ditw | \ditw = \Ditw] \cdot \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\

601:   &+ \Pro[\Dittw = \Ditw | \ditw \neq \Ditw]\cdot \Pro[\ditw \neq \Ditw] \cdot (1 - l_i) \\

602:   &+ \Pro[\Dittw \neq \Ditw | \ditw = \Ditw ] \\

603:   & \quad \cdot

604:     \Pro[\ditw = \Ditw] \cdot

605:     \left(

606:       r_i + (1 - r_i) \cdot B

607: %      \left(

608: %        \frac{|A_i| - 2}{|A_i| - 1}

609: %      \right)

610:     \right) \\

611:  &+ \Pro[\Dittw \neq \Ditw | \ditw \neq \Ditw ]  \cdot \Pro[\ditw \neq

612:  \Ditw] \\

613:  & \quad \cdot  (1 - c_i)D + l_i +  (c_i - l_i)F. \\

614:   \end{split}

615: \end{equation}

616: Equation~\eqref{eq:main:general} will model any MAS whose agent

617: learning can be described with the parameters presented

618: Section~\ref{sec:Model-Learn-Algor} and whose action/learn loop is the

619: same as we have described. We can use Eq.~\eqref{eq:main:general} to

620: calculate the successive expected errors for agent $i$, given values

621: for all the parameters and probabilities. In the next section we show

622: how this is done in a simple example game.

623:

624: \subsection{The Matching game}

625: \label{sec:Matching-game}

626:

627: In this matching game we have two agents $i$ and $j$ each of whom, in

628: every world $w$, wants to play the same action as the other one. Their

629: set of actions is $A_i = A_j$, where we assume $|A_i| > 2$ (for $|A_i|

630: = 2$ the equation is simpler).  After every time step, the agents both

631: learn and change their decision functions in accordance to their

632: learning rates, retention rates, and change rates.  Since the agents

633: are trying to match each other, in this game it is always true that

634: $\Delta_i^t(w)= \delta_j^t(w)$ and $\Delta_j^t(w) = \delta_i^t(w)$.

635: Given all this information, we can find values for some of the

636: probabilities in Eq.~\eqref{eq:main:general} (including values

637: for Equations~\eqref{eq:b} \eqref{eq:d} \eqref{eq:f}) and rewrite

638: (see Appendix~\ref{sec:Deriving-C-matching} for derivation) it as:

639: \begin{equation}

640:   \label{eq:main:matching}

641:   \begin{split}

642:   E&[\editt] = \sum_{w \in W} \Dw \{  r_j \cdot \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\

643:   &+ (1 -c_j) \cdot \Pro[\ditw \neq \Ditw] \cdot (1 - l_i) \\

644:   &+ (1-r_j) \cdot

645:     \Pro[\ditw = \Ditw] \cdot

646:   \left(

647:     r_i + (1 - r_i) \cdot

648:     \left(

649:       \frac{|A_i| - 2}{|A_i| - 1}

650:     \right)

651:   \right)\\

652:   &+ c_j \cdot \Pro[\ditw \neq \Ditw] \cdot

653:   \left(

654:  1 - l_j +

655:     \frac{c_i l_j(|A_i| -1) + l_i(1 - l_j) - c_i}{|A_i|-2}

656:   \right) \}\\

657:   \end{split}

658: \end{equation}

659: We can better understand this equation by plugging in some values and

660: simplifying. For example, lets assume that $r_i = r_j = 1$ and $l_i =

661: l_j = 1$, which implies that $c_i = c_j = 1$. This is the case where

662: the two agents always change all their incorrect mappings so as to

663: match their respective target functions at time $t$. That is, if we

664: had $\delta_i^t(w_1)= x$ and $\delta_j^t(w_1) = y$, then at time $t+1$

665: we will have $\delta_i^{t+1}(w_1) = y$ and $\delta_j^{t+1}(w_1) = x$.

666: This means that agent $i$ changes all its incorrect mappings to match

667: $j$, while $j$ changes to match $i$, so all the mappings stay wrong

668: after all (i.e., $i$ ends up doing what $j$ did before, while $j$ does

669: what $i$ did before). The error, therefore, stays the same.  We can

670: see this by plugging the values into Eq.~\eqref{eq:main:matching}.

671: The first three terms will become 0 and the fourth term will simplify

672: to the definition of error, as given by Eq.~\eqref{eq:error}. Since

673: the fourth term is the only one that is non-zero, we end up with

674: $E[\editt] = \edit$.

675:

676: We can also let $c_i$ and $l_i$ (keeping $c_j = l_j = 1$) be

677: arbitrary numbers, which gives us $ E[\editt] = c_i \edit$.  This

678: tells us that the error will drop faster for a smaller change rate

679: $c_i$. The reason is that $i$'s learning (remember $l_i \leq c_i$) in

680: this game is counter-productive because it is always made invalid by

681: $j$'s learning rate of $1$. That is, since $j$ is changing all its

682: mappings to match $i$'s actions, $i$'s best strategy is to keep its

683: actions the same (i.e., $c_i = 0$).

684:

685:

686: \section{Further Simplification}

687: \label{sec:Assum-Cond-Indep}

688:

689: We can further simplify Eq.~\eqref{eq:main:general} if we are willing to

690: make two assumptions. The first assumption is that the new actions

691: chosen when either \ditw{} changes (and does not match the target), or

692: when \Ditw{} changes, are both taken from flat probability

693: distributions over $A_i$. By making this assumption we can find

694: values for $B$, $D$, and $F$, namely:

695: \begin{alignat}{2}

696:   \label{eq:23}

697:   B = D & = \frac{|A_i| -2}{|A_i| - 1} & \qquad F & = \frac{|A_i| -3}{|A_i| - 2}

698: \end{alignat}

699:

700: %which makes

701: %\begin{equation}

702: %  \label{eq:27}

703: %  C =  \frac{|A_i| -2 - c_i + 2l_i}{|A_i| - 1}

704: %\end{equation}

705:

706: The second assumption we make is that the probability of \Ditw{}

707: changing, for a particular $w$, is independent of the

708: probability that \ditw{} was correct. In

709: Section~\ref{sec:Matching-game} we saw that in the matching game the

710: probabilities of \Ditw{} and \ditw{} changing were correlated since,

711: if \ditw{} was wrong then \djtw{} was also wrong, which meant $j$

712: would probably change \djtw{}, which would change \Ditw{}.

713:

714: However, the matching game is a degenerate example in exhibiting such

715: tight coupling between the agents' target functions. In general, we

716: can expect that there will be a number of MASs where the probability

717: that any two agents $i$ and $j$ are correct is uncorrelated (or

718: loosely correlated). For example, in a market system all sellers try

719: to bid what the buyer wants, so the fact that one seller bids the

720: correct amount says nothing about another seller's bid. Their bids are

721: all uncorrelated. In fact, the Distributed Artificial Intelligence

722: literature is full of systems that try to make the agents' decisions

723: as loosely-coupled as possible \cite{lesser:81,Liu:95}.

724:

725: This second assumption we are trying to make can be formally

726: represented by having Eq.~\eqref{eq:cond:indep} be true for all pairs of

727: agents $i$ and $j$ in the system.

728: \begin{equation}

729:   \label{eq:cond:indep} %

730:   \begin{split}

731:   \Pro[&\ditw = \Ditw \wedge \djtw = \Djtw] \\

732:   & = \Pro[\ditw = \Ditw] \cdot \Pro[\djtw = \Djtw]

733:   \end{split}

734: \end{equation}

735:

736:

737: Once we make these two assumptions we can

738: rewrite Eq.~\eqref{eq:main:general} as:

739: \begin{equation}

740:   \label{eq:main}

741:   \begin{split}

742:   E&[\editt]  = \sum_{w \in W} \Dw (  \Pro[\Dittw = \Ditw] \cdot ( \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\

743:   & \qquad \qquad \qquad \qquad \qquad + \Pro[\ditw \neq \Ditw] \cdot (1 - l_i)) \\

744:   & \quad + \Pro[\Dittw \neq \Ditw] \cdot

745:     (\Pro[\ditw = \Ditw] \cdot

746:   \left(

747:     r_i + (1 - r_i) \cdot

748:     \left(

749:       \frac{|A_i| - 2}{|A_i| - 1}

750:     \right)

751:   \right)\\

752:   & \qquad \qquad \qquad \qquad \qquad + \Pro[\ditw \neq \Ditw] \cdot

753:   \left(

754:     \frac{|A_i| -2 - c_i + 2l_i}{|A_i| - 1}

755:   \right))) \\

756:   \end{split}

757: \end{equation}

758: Some of the probabilities in this equation are just the definition of

759: $v_i$, and others simplify to the agent's error. This means that we

760: can simplify Eq.~\eqref{eq:main} to:

761: \begin{multline}

762:   \label{eq:main:simp}

763:     E[\editt] = 1 - r_i + v_i

764:     \left(

765:       \frac{|A_i|r_i - 1}{|A_i| -1}

766:     \right) \\

767:     + \edit

768:     \left(

769:       r_i -l_i + v_i

770:       \left(

771:         \frac{|A_i|(l_i - r_i) + l_i - c_i}{|A_i| -1}

772:       \right)

773:     \right)

774: \end{multline}

775:

776: Eq.~\eqref{eq:main:simp} is a difference equation that can be used to

777: determine the expected error of the agent at any time by simply using

778: $E[\editt]$ as the \edit{} for the next iteration. While it might look

779: complicated, it is just the function for a line $y = mx+b$ where $x =

780: \edit$ and $y= \editt$. Using this observation, and the fact that

781: \editt{} will always be between $0$ and $1$, we can determine that the

782: final convergence point for the error is the point

783: where Eq.~\eqref{eq:main:simp} intersects the line $y=x$. The only

784: exception is if the slope equals $-1$, in which case we will see the

785: error oscillating between two points.

786: \begin{figure}

787:   \psfrag{edit}{{\small \edit}}

788:   \psfrag{editt}{{\small \editt}}

789:   \psfrag{editt2}[B]{{\tiny \editt}}

790:   \psfrag{learning}[B]{{\tiny learning}}

791:   \psfrag{volatility}[B]{{\tiny volatility}}

792:   \begin{center}

793:     \includegraphics[width=3in]{errorprog}

794:     \caption{Error progression for agent $i$, assuming a fixed

795:       volatility $v_i = .2$, $c_i =1$, $l_i = .3$, $r_i = 1$, $|A_i| =

796:       20$. We show the error function (\editt{}), as well as its two

797:       components: learning and volatility. The line $y=x$ allows us

798:       trace the agent's error as it starts at $.95$ and converges to

799:       $.44$.}

800:     \label{fig:errorprog}

801:   \end{center}

802: \end{figure}

803:

804: By looking at Eq.~\eqref{eq:main:simp} we can also determine that

805: there are two ``forces'' acting on the agent's error: volatility and

806: the agent's learning abilities. The volatility tends to increase the

807: agent's error past its current value while the learning reduces it. We

808: can better appreciate this effect by separating the $v_i$ terms in

809: Eq.~\eqref{eq:main:simp} and plotting the $v_i$ terms (volatility) and

810: the rest of the terms (learning) as two separate lines. By definition,

811: these will add up to the line given by Eq.~\eqref{eq:main:simp}. We

812: have plotted these three lines and traced a sample error progression

813: in Figure~\ref{fig:errorprog}. The error starts at .95 and then

814: decreases to eventually converge to .44.  We notice the learning curve

815: always tries to reduce the agent's error, as confirmed by the fact

816: that its line always falls below $y=x$.  Meanwhile, the volatility

817: adds an extra error. This extra error is bigger when the agent's error

818: is small since, any change in the target function is then likely to

819: increase the agent's error.

820:

821:

822: \section{Volatility and Impact}

823: \label{sec:Volatility}

824: Equation~\eqref{eq:main:simp} is useful for determining the agent's error

825: when we know the volatility of the system.  However, it is likely that

826: this value is not available to us (if we knew it we would already know

827: a lot about the dynamics of the system).  In this section we determine

828: the value of $v_i$ in terms of the other agents' changes in their

829: decision functions. That is, in terms of $\Pro[\djtt \neq \djt]$, for

830: all other agents $j$.

831:

832: In order to do this we first need to define the \textbf{impact} \Iji{}

833: that agent $j$'s changes in its decision function have on $i$'s target

834: function.

835: \begin{equation}

836:   \label{eq:14}

837:  \forall_{w \in W} \;  I_{ji} = \Pro[\Dittw \neq \Ditw \,|\, \djttw \neq \djtw]

838: \end{equation}

839:

840: We can now start to define volatility by first determining that, for

841: two agents $i$ and $j$

842: \begin{equation}

843:   \label{eq:15}

844:   \begin{split}

845:      \forall_{w \in W} \; v_{i}^{t} &= \Pro[\Dittw \neq \Ditw] \\

846:      &= \Pro[\Dittw \neq \Ditw \,|\, \djttw \neq \djtw] \cdot \Pro[\djttw

847:      \neq \djtw] \\

848:      &+ \Pro[\Dittw \neq \Ditw \,|\, \djttw = \djtw] \cdot \Pro[\djttw

849:      = \djtw]. \\

850:   \end{split}

851: \end{equation}

852:

853: The reader should notice that volatility is no longer constant; it

854: varies with time (as recorded by the superscript). The first

855: conditional probability in Eq.~\eqref{eq:15} is just $I_{ji}$. The

856: second one we will set to $0$, since we are specifically interested in

857: MASs where the volatility arises \emph{only} as a side-effect of the

858: other agents' learning. That is, we assume that agent $i$'s target

859: function changes only when $j$'s decision function changes. For cases

860: with more than two agents, we similarly assume that one agent's target

861: function changes only when some other agent's decision function

862: changes.  That is, we ignore the possibility that outside influences

863: might change an agent's target function.

864:

865: We can simplify Eq.~\eqref{eq:15} and generalize it to $N$ agents,

866: under the assumption that the other agents' changes in their decision

867: functions will not cancel each other out, making \Dit{} stay the same

868: as a consequence. $v_i^t$ then becomes

869: \begin{equation}

870:   \label{eq:16}

871:   \begin{split}

872:        \forall_{w \in W} \; v_{i}^{t} &= \Pro[\Dittw \neq \Ditw] \\

873:        &= 1 - \prod_{j \in N_{-i}}(1 - \Iji \Pro[\djttw \neq \djtw]). \\

874:   \end{split}

875: \end{equation}

876:

877: We now need to determine the expected value of $\Pro[\djttw \neq

878: \djtw]$ for any agent. Using $i$ instead of $j$ we have

879: \begin{equation}

880:   \label{eq:55}

881:   \begin{split}

882:   \forall_{w \in W} \; \Pro[&\dittw \neq \ditw] \\

883:     &= \Pro{}[\ditw \neq \Ditw] \cdot \Pro[\dittw \neq \Ditw \,|\, \ditw

884:     \neq \Ditw] \\

885:     &+ \Pro{}[\ditw = \Ditw] \cdot \Pro[\dittw \neq \Ditw \,|\, \ditw

886:     = \Ditw], \\

887:   \end{split}

888: \end{equation}

889: where the expected value is:

890: \begin{equation}

891:   \label{eq:30}

892:   E[\Pro[\dittw \neq \ditw]] = c_i \edit + (1- r_i)\cdot(1 -

893:   \edit).

894: \end{equation}

895:

896: We can then plug Eq.~\eqref{eq:30} into Eq.~\eqref{eq:16} in order to get the

897: expected volatility

898: \begin{equation}

899:   \label{eq:6}

900:  E[ v_{i}^{t}] =  1 - \prod_{j \in N_{-i}}1 - \Iji (c_j \edjt + (1- r_j)\cdot(1 -

901:   \edjt)).

902: \end{equation}

903:

904: We can use this expected value of $v_{i}^{t}$ in Eq.~\eqref{eq:main:simp}

905: in order to find out how the other agents' learning will affect agent

906: $i$. In MASs that have identical learning agents (i.e., their $c$, $l$,

907: $r$, and $I$ rates are all the same and they start with the same

908: initial error) we can replace the multiplier in  Eq.~\eqref{eq:6} with an

909: exponent of $|N| -1$. We use this simplification later in

910: Section~\ref{sec:Shoham-Tennenholtz}.

911:

912:

913: \section{An Example with Two Agents}

914: \label{sec:Example-with-2}

915:

916: In a MAS with just two agents $i$ and $j$, we can use Eq.~\eqref{eq:6} to

917: rewrite Eq.~\eqref{eq:main:simp} as

918: \begin{equation}

919:   \label{eq:7}

920:   \begin{split}

921:     E&[\editt] = 1 - r_i + \Iji(c_j \edjt + (1- r_j)\cdot(1 - \edjt))

922:     \left(

923:       \frac{|A_i|r_i - 1}{|A_i| -1}

924:     \right) \\

925:     &+ \edit

926:     \{ r_i -l_i + \Iji(c_j \edjt + (1- r_j)\cdot(1 - \edjt))

927:     \\

928:     &

929:       \qquad \qquad \cdot

930:     \left(

931:       \frac{|A_i|(l_i - r_i) + l_i - c_i}{|A_i| -1}

932:     \right)

933:   \}.

934: \end{split}

935: \end{equation}

936:

937:

938: \begin{figure}

939:   \psfrag{Iij}{{\tiny $\Iij$}}

940:   \psfrag{Iji}{{\tiny $\Iji$}}

941:   \psfrag{Error}[B]{{\tiny Error}}

942:   \psfrag{Final Error for i}[B]{{\small Final Error for $i$}}

943:   \begin{center}

944:     \includegraphics[width=3in]{ii-error}

945:     \caption{Plot of Final Error for agent $i$, given $l_i = l_j = .2$,

946:       $r_i = r_j = 1$, $c_i = c_j =1$, $|A_j| = |A_i| = 20$.}

947:     \label{fig:ii-error}

948:   \end{center}

949: \end{figure}

950:

951: We can now use Eq.~\eqref{eq:7} to plot values for one particular example.

952: Let us say that $l_i = l_j =.2$, $c_i = c_j = 1$, $r_i = r_j = 1$,

953: $|A_j| = |A_i| = 20$ and we let the impacts $\Iij$ and $\Iji$ vary

954: between zero and one. Figure~\ref{fig:ii-error} shows the final error,

955: after convergence, for this situation. It shows an area where the

956: error is expected to be below $.1$, corresponding to low values for

957: either \Iij{}, \Iji{} or both. This area represents MASs that are

958: loosely coupled, i.e., one agent's change in behavior does not

959: significantly affect the other's target function. In these systems we

960: can expect that the error will eventually\footnote{Notice that we are

961:   not representing how long it takes for the error to converge. This

962:   can easily be done and is just one more of the parameters our theory

963:   allows us to explore.}  reach a value close to zero. We see that as

964: the impact increases the final error also increases, with a fairly

965: abrupt transition between a final error of 0 and bigger final errors.

966: This abrupt transition is characteristic of these types of systems

967: where there are tendencies for the system to either converge or

968: diverge, and both of them are self-enforcing behaviors. Notice also

969: that the graph is not symmetric---\Iij{} has more weight in

970: determining $i$'s final error than \Iji. This result seems

971: counterintuitive, until we realize that it is $j$'s error that makes

972: it hard for $i$ to converge to a small error. If \Iij{} is high then,

973: if $i$ has a large error then $j$'s error will increase, which will

974: make $j$ change its decision function often and make it hard for $i$

975: to reduce its error. If \Iij{} is low then, even if \Iji{} is high,

976: $j$ will probably settle down to a low error and as it does $i$ will

977: also be able to settle down to a low error.

978:

979: If we were about to design a MAS we would try to build it so that it

980: lies in the area where the final error is zero. This way we can expect

981: all agents to eventually have the correct behavior. We note that a

982: substantial percentage of the research in DAI and MAS deals with

983: taking systems that are not inherently in this area of near-zero error

984: and designing protocols and rules of encounter so as to move them into

985: this area, as in \cite{rules:of:encounter}.

986:

987: The fact that the final error is 1 for the case with $\Iij = \Iji = 1$

988: can seem non-intuitive to readers familiar with game theory. In game

989: theory there are many games, such as the ``matching game'' from

990: Section~\ref{sec:Matching-game}, where two agents have an impact of 1

991: on each other. However, it is known \cite{binmore2} that, in these

992: games, two learning agents will eventually converge to one of the

993: equilibria (if there are any), making their final error equal to 0.

994: This is certainly true, and it is exactly what we showed in

995: Section~\ref{sec:Matching-game}. The same result is not seen in

996: Figure~\ref{fig:ii-error} because the figure was plotted using our

997: simplified Equation.~\eqref{eq:main:simp}, which makes the simplifying

998: independence assumption given by Eq.~\eqref{eq:cond:indep}. This

999: assumption cannot be made in games such as the matching game because,

1000: in these games, there is a correlation between the correctness of each

1001: of the agents actions. Specifically, in the matching game it is always

1002: true that both agents are either correct, or incorrect, but it is

1003: never true that one of them is correct while the other one is

1004: incorrect, i.e., either they matched, or they did not match.

1005:

1006:

1007: \begin{figure}

1008:   \psfrag{edit}{{\small \edit}}

1009:   \psfrag{edjt}{{\small \edjt}}

1010:   \begin{center}

1011:     \includegraphics[width=3in]{vector}

1012:     \caption{Vector plot for \edit{} and \edjt{}, where $|A_i| = |A_j|

1013:       = 20$, $l_i=l_j=.2$, $r_i = r_j=1$, $c_i=.5$, $c_j=1$,

1014:       $\Iij=.1$, $\Iji=.3$. It shows the error progression for a pair

1015:       agents $i$ and $j$. For each pair of errors $(\edit{},\edjt{})$,

1016:       the arrows indicate the expected $(\editt{},\edjtt{})$.

1017:       }

1018:     \label{fig:vector}

1019:   \end{center}

1020: \end{figure}

1021:

1022: Another view of the system is given by Figure~\ref{fig:vector} which

1023: shows a vector plot of the agents' errors. We can see how the bigger

1024: errors are quickly reduced but the pace of learning decreases as the

1025: errors get closer to the convergence point. Notice also that an

1026: agent's error need not change in a monotonic fashion. That is, an

1027: agent's error can get bigger for a while before it starts to get

1028: smaller.

1029:

1030:

1031: \section{A Simple Application}

1032: \label{sec:Simple-Application}

1033:

1034: In order to demonstrate how our theory can be used, we tested it on a

1035: simple market-based MAS. The game consists of three agents, one buyer

1036: and two seller agents $i$ and $j$. The buyer will always buy at the

1037: cheapest price---but the sellers do not know this fact. In each time

1038: step the sellers post a price and the buyer decides which of the

1039: sellers to buy from, namely, the one with the lowest bid.  The sellers

1040: can bid any one of 20 prices in an effort to maximize their profits.

1041: The sellers use a reinforcement learning algorithm with their

1042: reinforcements being the profit the agent achieved in each round, or 0

1043: if it did not sell the good at the time. In this system we had one

1044: good being sold ($|W| = 1$).As predicted by economic theory, the price

1045: in this system settles to the sellers' marginal cost, but it takes

1046: time to get there due to the learning inefficiencies.

1047:

1048: We experimented with different $\alpha_j$ rates\footnote{$\alpha$ is

1049:   the relative weight the algorithm gives to the most recent payoff.

1050:   $\alpha =1$ means that it will forget all previous experience and

1051:   use only the latest payoff to determine what action to take.} for

1052: the reinforcement learning of agent $j$, while keeping $\alpha_i = .1$

1053: fixed, and plotted the running average of the error of agent $i$.

1054: \begin{figure}

1055:   \psfrag{l=.1=}[B]{{\tiny $\alpha_j = .1$}}

1056:   \psfrag{l=.3=}[B]{{\tiny $\alpha_j = .3$}}

1057:   \psfrag{l=.9=}[]{{\tiny $\alpha_j = .9$}}

1058:   \psfrag{l=.005=}[B]{{\tiny $l_j = .005$}}

1059:   \psfrag{l=.04=}[B]{{\tiny $l_j = .04$}}

1060:   \psfrag{l=.055=}[]{{\tiny $l_j = .055$}}

1061:   \psfrag{time}{{\tiny time}}

1062:   \psfrag{error}{{\tiny error}}

1063:   \begin{center}

1064:     \mbox{\subfigure[Experiment]{\label{fig:lre-exp}\charts{simie}} \quad

1065:       \subfigure[Theory]{\label{fig:lre-theory}\charts{lre}}}

1066:     \caption{Comparison of observed and predicted error.}

1067:     \label{fig:application}

1068:   \end{center}

1069: \end{figure}

1070: A comparison is shown in Figure~\ref{fig:application}.

1071: Figure~\ref{fig:lre-exp} gives the experimental results for three

1072: different values of $\alpha_j$. It shows $i$'s average error, over 100

1073: runs, as a function of time.  Since both sellers start with no

1074: knowledge, their initial actions are completely random which makes

1075: their error equal to $.5$. Then, depending on $\alpha_j$, $i$'s error

1076: will either start to go down from there or will first go up some and

1077: then down. Eventually, $i$'s error gets very close to 0, as the system

1078: reaches a market equilibrium.

1079:

1080: We can predict this behavior using Eq.~\eqref{eq:7}. Based on the game

1081: description, we set $|A_i| = |A_j| = 20$, since there were $20$

1082: possible actions. We let $r_i = r_j = 1$ because reinforcement

1083: learning with fixed payoffs enforces the condition that once an agent

1084: is taking the correct action it will never change its decision

1085: function to take a different action.  The agent might, however, still

1086: take a wrong action but only when its exploration rate dictates it.

1087:

1088: We then let $\Iij = \Iji = .17$ based on the rough calculation that

1089: each agent has an equal probability of bidding any one of the 20

1090: prices.  If $\Dit = 20$ then \Iji{} for this situation is the

1091: probability that $j$ was also bidding 20 or above, i.e., $1/20$, times

1092: the probability that $j$'s new price is lower than 20, i.e.  $19/20$.

1093: Similarly, if $\Dit = 19$ then \Iji{} is equal to $2/20$ times

1094: $18/20$.  The average of all of these probabilities is $.17$. A more

1095: precise calculation of the impact would require us to find it via

1096: experimentation by actually running the system.

1097:

1098: Finally, we chose $l_i = l_j = c_i = c_j = .005$ for the first curve

1099: (i.e., the one that compares with $\alpha_j = .1$). We knew that for

1100: such a low $\alpha_j$ the learning and change rate should be the same.

1101: The actual value was chosen via experimentation. The resulting curve

1102: is shown in Figure~\ref{fig:lre-theory}. At this moment, we do not

1103: possess a formal way of deriving learning and change rates from

1104: $\alpha$-rates.

1105:

1106: For the second curve ($\alpha_j = .3$) we knew that, since only

1107: $\alpha_j$ had changed from the first experiment, we should only

1108: change $l_j$ and $c_j$. In fact, these two values should only be

1109: increased. We found their exact values, again by experimentation, to

1110: be $l_j = .04$, $c_j =.4$. For the third curve we found the values to

1111: be $l_j = .055$, $c_j = .8$.

1112:

1113: One difference we notice between the experimental and the theoretical

1114: results is that the experimental results show a longer delay before

1115: the error starts to decrease. We attribute this delay to the agent's

1116: initially high exploration rate. That is, the agents initially start

1117: by taking all random actions but progressively reduce this rate of

1118: exploration. As the exploration rate decreases the discrepancy between

1119: our theoretical predictions and experimental results is reduced.

1120:

1121: In summary, while it is true that we found $l_j$ and $c_j$ by

1122: experimentation, all the other values were calculated from the

1123: description of the problem. Even the relative values of $l_j$ and

1124: $c_j$ follow the intuitive relation with $\alpha_j$ that, as

1125: $\alpha_j$ increases so does $l_j$ and (even more) $c_j$.

1126: Section~\ref{sec:Bound-Learn-Rate} shows how to calculate lower bounds

1127: on the learning rate. We believe that this experiment provides solid

1128: evidence that our theory can be used to approximately determine the

1129: quantitative behaviors of MASs with learning agents.

1130:

1131: \section{Application of our Theory to Experiments in the Literature}

1132: \label{sec:Appl-our-Theory}

1133:

1134: In this section we show how we can apply our theory to experimental

1135: results found in the AI and MAS literature. While we will often not be

1136: able to completely reproduce the authors' results exactly, we believe

1137: that being able to reproduce the flavor and the main quantitative

1138: characteristics of experimental results in the literature shows that

1139: our theory can be widely applied and used by practitioners in this

1140: area of research.

1141:

1142: \subsection{Claus and Boutilier}

1143: \begin{figure}

1144:   \psfrag{time}{{\tiny time}}

1145:   \psfrag{1 - error}{{\tiny 1 - error}}

1146:   \psfrag{l=.1}[B]{{\tiny $l_i=.1$}}

1147:   \begin{center}

1148:     \mbox{\subfigure[Experiment]{\label{fig:claus-exp}\charts{claus-figure}} \quad

1149:       \subfigure[Theory]{\label{fig:claus-theory}\charts{claus-theory}}}

1150:     \caption{Comparing theory (b) with results from \cite{claus:97} (a).}

1151:     \label{fig:claus}

1152:   \end{center}

1153: \end{figure}

1154:

1155: Claus and Boutilier \cite{claus:97} study the dynamics of a system that

1156: contains two reinforcement learning agents. Their first experiment

1157: puts the two agents in a matching game exactly like the one we

1158: describe in Section~\ref{sec:Matching-game} with $|A_i| = |A_j| = 2$.

1159: Their results show the probability that both agents matched (i.e., 1 -

1160: \edit) as time progressed. Since they were using two reinforcement

1161: learning agents, it was not surprising that the curve they saw, seen

1162: in Figure~\ref{fig:claus-exp}, was nearly identical to the curve we

1163: saw in our experiments with the two buying agents

1164: (Figure~\ref{fig:lre-exp} with $\alpha_j = \alpha_i = .1$, except

1165: upside-down).

1166:

1167: We can reproduce their curve using our equation for the matching

1168: game Eq.~\eqref{eq:main:matching}. The results can be seen in

1169: Figure~\ref{fig:claus-theory}. Our theory again fails to account for

1170: the initial exploration rate. We can, however, confirm that by time 15

1171: their Boltzmann temperature (the authors used Boltzmann exploration)

1172: had been reduced from an initial value of 16 to $3.29$ and would keep

1173: decreasing by a factor of .9 each time step. This means that by time

1174: 15 the agents were, indeed, starting to do more exploitation (i.e.,

1175: reduce their error) while doing little exploration.

1176: \label{sec:Claus-Boutilier}

1177:

1178: \subsection{Shoham and Tennenholtz}

1179: \label{sec:Shoham-Tennenholtz}

1180:

1181: Shohan and Tennenholtz \cite{shoham:97} investigate how learning

1182: agents might arrive at social conventions. The authors introduce a

1183: simple learning algorithm (strategy-selection rule) called

1184: \emph{highest cumulative reward} (HCR) which their agents use for

1185: learning these conventions. Shoham and Tennenholtz also provide the

1186: results of a series of experiments using populations of learning

1187: agents. We try to reproduce the results they present in their

1188: Section~4.1 where they study the ``coordination game'' which is

1189: similar to our matching game, but with only two actions.

1190:

1191: The experiment in question involves 100 agents, all of them identical

1192: and all of them using HCR. At each time instant the agents take one of

1193: two available actions. The aim is for every pair of chosen agents to

1194: take the same action as each other. Agents are randomly made to form

1195: pairs. The agents update their behavior (i.e., apply HCR) after a

1196: given delay. The authors try a series of delays (from 0 to 200) and

1197: show that increasing the update delay decreases the percentage of

1198: trials where, after 1600 iterations, at least 95\% of the agents

1199: reached a convention. The authors show surprise at finding this

1200: phenomenon. Their results are reproduced in

1201: Figure~\ref{fig:shoham-exp} (cf. Figure~1 in their article).

1202: \label{sec:Appl-to-Results}

1203: \begin{figure}

1204:   \psfrag{l}{{\small $l_i$}}

1205:   \psfrag{final error}{{\small final error}}

1206:   \psfrag{theory}[B]{{\tiny theory}}

1207:   \psfrag{experiment}[B]{{\tiny experiment}}

1208:   \begin{center}

1209:     \mbox{\subfigure[Original Experiment]{\label{fig:shoham-exp}\charts{shoham-figure}} \quad

1210:       \subfigure[Theory and Experiment]{\label{fig:shoham-theory}\charts{shoham}}}

1211:     \caption{Comparing theory (b) with results from \cite{shoham:97} (a).}

1212:     \label{fig:shoham}

1213:   \end{center}

1214: \end{figure}

1215: The number of actions for all agents is easily set to $|A_i| =2$,

1216: which implies that we must have $l_i = c_i$.  By examining HCR, it is

1217: easy to determine that $r_i =1$ (i.e if an agent took the right

1218: action, it will only get more support for it).  At first intuition,

1219: one's impulse is to set $\Iij = 1$ for every pair of agents $i$ and

1220: $j$.  However, since there are 100 of them and only pairs of them

1221: interact at every time instant, the real impact is $\Iij = 1/99$.

1222:

1223: We will now convert from their units of measurement into ours.  In

1224: Figure~\ref{fig:shoham-exp} we can see that their x-axis is called the

1225: \emph{update delay}, which we will refer to as $d$. This value is the

1226: number of time units that pass before the agent is allowed to learn.

1227: For $d=0$ the agent learns after every interaction (i.e., on every time

1228: $t$), while for $d=200$ the agent takes the same action for 200 time

1229: instances and only learns after every 200 iterations. This means that

1230: we must set $l_i = \frac{1}{p(d +1)}$ where $p>0$. The value of $p$

1231: depends on their learning algorithm's performance, but we know that it

1232: must be a small number ($< 50$) greater than 0. Through some

1233: experimentation we settled on $p=6$ (other values close to this one

1234: give similar results). Since in their graph they look at $0 \leq d

1235: \leq 200$, we must then look at $l_i$ where $\frac{1}{1206} \leq l_i

1236: \leq \frac{1}{6}$ Finally, we find the value of $d$ in terms of $l_i$

1237: to be

1238: \begin{equation}

1239:   \label{eq:9}

1240:   d = \frac{1}{pl_i} - 1

1241: \end{equation}

1242:

1243:

1244: The y-axis of Figure~\ref{fig:shoham-exp} is the \emph{success}, i.e.,

1245: number of trials, out of 4000, where at least 95\% of the agents

1246: reached a convention. We will refer to this value as $s$.  We know

1247: that in $s/4000$ of the trials \emph{at least} 95\% of the

1248: agents have error close to 0 (i.e., reaching a convention means that

1249: the agents take the right action almost all the time), and for the

1250: rest of the trials the error was greater.  We can approximately map

1251: this to an error by saying that in $s/4000$ of the trials the error

1252: was 0 (a slight underestimate), while in $1 - s/4000$ of the trials

1253: the error was 1 (a slight overestimate).  We add these two up (the 0

1254: makes the first term disappear) and arrive at an equation that maps

1255: $s$ to \edit.

1256: \begin{equation}

1257:   \label{eq:26}

1258:   \edit \approx

1259:   \left(

1260:     \frac{4000 - s}{4000}

1261:   \right)

1262: \end{equation}

1263:

1264:

1265: The mapping from $d$ to $s$ is given by their actual data. Their data

1266: can be fit by the following function:

1267: \begin{equation}

1268:   \label{eq:28}

1269:   s = 3900 - 4d - \frac{(d -100)^2}{100}

1270: \end{equation}

1271:

1272:

1273: Plugging Eq.~\eqref{eq:9} into Eq.~\eqref{eq:28}, and the result

1274: into Eq.~\eqref{eq:26}, we finally arrive at a function that maps their

1275: experimental results into our units:

1276: \begin{equation}

1277:   \label{eq:29}

1278:   \mbox{Final error} = \frac{4000 -

1279:     \left(

1280:       3900 - 4(1/pl_i - 1) -

1281:       \left(

1282:         \frac{(1/pl_i - 1 -100)^2}{100}

1283:       \right)

1284:     \right)}{4000}

1285: \end{equation}

1286: for the range $\frac{1}{1206} \leq l_i \leq \frac{1}{6}$.

1287:

1288: Now that we have values for $c_i$, $l_i$, $r_i$, \Iij, $|A_i|$, a

1289: range for $l_i$ and an equation that maps their experimental results

1290: into our units, we can plot both functions, as seen in

1291: Figure~\ref{fig:shoham-theory}. The x-axis was plotted on a log-scale

1292: in order to better show the shape of the experiment curve, otherwise

1293: it would appear mostly as a straight line. For our theory curve we

1294: used Equations~\eqref{eq:main:simp} and \eqref{eq:6}, and iterated for 1600 time

1295: units, just like in the experiment, and plotted the error at that

1296: point. For the experiment curve we used Eq.~\eqref{eq:29}. We plotted both

1297: of these curves in the specified range for $l_i$.  The reader will

1298: notice that our theory was able to make precise quantitative

1299: predictions. The maximum distance from our theory curve to the

1300: experimental curve is $.05$, which means that our predictions for the

1301: final error were, at worst, within 5\% of the experimental values.

1302: Also, an error of about 5\% was introduced when mapping from their

1303: success percentage $s$ to our error.

1304:

1305:

1306: \subsection{Others}

1307: \label{sec:Others}

1308:

1309: There are several other examples in the literature where we believe

1310: our theory can be successfully applied. \cite[chapter 3.7]{ishida:97}

1311: gives results of an experiment where two agents try to find each other

1312: in a 100 by 100 grid. He shows that if the grid has few obstacles it

1313: is faster if both agents move towards each other, while if there are

1314: many obstacles it is faster if one of the agents stays still while the

1315: other one searches for it. We believe that the number of obstacles is

1316: proportional to the change rate that the agents experience and,

1317: perhaps, to the impact that they have on each other.  When there are

1318: no obstacles the agents never change their decision functions (because

1319: their initial Manhattan heuristics lead them in the correct path). As

1320: the number of obstacles increases, the agents will start to change

1321: their decision functions as they move, which will have an impact on

1322: the other agent's target function.  If, however, one of them stays

1323: put, this means that his change rate is 0 so the other agent's target

1324: function will stay still and he will be able to reach his target

1325: (i.e., error 0) quicker.

1326:

1327: Notice that the problem of a moving target that Ishida studies is

1328: different from the problem of a moving target function which we study.

1329: It is, however, interesting to note their similarities and how our

1330: theory can be applied to some aspects of that domain.

1331:

1332: Another possible example is given by \cite{sen:94b}.  They show two

1333: Q-learning agents trying to cooperate in order to move a block.  The

1334: authors show how different $\alpha$ rates ($\beta$ in their article)

1335: affect the quality of the result that the agents converge to. This

1336: quality roughly corresponds to our error, except for the fact that

1337: their measurements implicitly consider some actions to be better than

1338: others, while we consider an action to be either correct or incorrect.

1339: This discrepancy would make it harder to apply our theory to their

1340: results but we still believe that a rough approximation is possible.

1341: Our future work includes the extension of the CLRI framework to handle

1342: a more general definition of error---one that attaches a utility to

1343: each state-action pair, rather than the simple correct/incorrect

1344: categorization we use.

1345:

1346:

1347: \section{Bounding the Learning Rate with Sample Complexity}

1348: \label{sec:Bound-Learn-Rate}

1349:

1350: In the previous examples we have used our knowledge of the learning

1351: algorithms to determine the values of the agent's $c_i$, $l_i$, and

1352: $r_i$ parameters. However, there might be cases where this is not

1353: possible---the learning algorithm might be too complicated or unknown.

1354: It would be useful, in these cases, to have some other measure of the

1355: agent's learning abilities, which could be used to determine some

1356: bounds on the values of these parameters.

1357:

1358: One popular measure of the complexity of learning is given by Probably

1359: Approximately Correct (PAC) theory \cite{intro:clt}, in the form of a

1360: measure called the \emph{sample complexity}. The sample complexity

1361: gives us a loose upper bound on the number of examples that a

1362: consistent learning agent must observe before arriving at a PAC

1363: hypothesis.

1364:

1365: There are two important assumptions made by PAC-theory. The first

1366: assumption is that the agents are consistent learners\footnote{See

1367:   \cite[p162]{machine:learning} for a formal definition of a

1368:   consistent learner.}. Using our notation, a consistent learner is one

1369: who, once it has learned a correct $w \ra a$ mapping does not forget

1370: it. This simply means that the agent must have $r_i = 1$. The second

1371: assumption is that the agent is trying to learn a fixed concept. This

1372: assumption makes $\Ditt = \Dit$ true for all $t$.

1373:

1374: The sample complexity $m$ of an agent's learning problem is given by

1375: \begin{equation}

1376:   \label{eq:11}

1377:   m \geq \frac{1}{\epsilon}

1378:   \left(

1379:     \ln \frac{|H|}{\gamma}

1380:   \right),

1381: \end{equation}

1382: where $|H|$ is the size of the hypothesis space for the agent. In

1383: other words, $|H|$ is the total number of different \diw{} functions

1384: that the agent will consider. For an agent with no previous knowledge

1385: we have $|H| = |A_i|^{|W|}$. However, agents with previous knowledge

1386: might have smaller $|H|$, since this knowledge might be used to

1387: eliminate impossible mappings. If a consistent learning agent has seen

1388: $m$ examples then, with probability at least $(1 - \gamma)$, it has

1389: error at most $\epsilon$.

1390:

1391: While we cannot map the sample complexity $m$ to a particular learning

1392: rate $l_i$, we can use it to put a lower bound on the learning rate

1393: for a consistent learning agent. That is, we can find a lower bound

1394: for the learning rate of an agent who does not forget anything it has

1395: seen, and who is trying to learn a fixed target function.  Since the

1396: agent does not forget anything it has seen, we can deduce that its

1397: retention rate must be $r_i = 1$. Since the target function is not

1398: changing, we know that $\Pr[\Dittw \neq \Ditw] = 0$ and $\Pr[\Dittw =

1399: \Ditw] = 1$. We can plug these values into Eq.~\eqref{eq:main} and

1400: simplify in order to get:

1401: \begin{equation}

1402:   \label{eq:13}

1403:   E[\editt] = \edit \cdot (1 - l_i).

1404: \end{equation}

1405: We can solve the difference Eq.~\eqref{eq:13}, for any time $n$, in

1406: order to get:

1407: \begin{equation}

1408:   \label{eq:17}

1409:   E[e(\delta_i^n)] = e(\delta_i^0) \cdot (1 - l_i)^n.

1410: \end{equation}

1411: We now remember that after $m$ time steps we expect, with probability

1412: $(1 - \gamma)$, the error to be less than $\epsilon$.  Since

1413: Eq.~\eqref{eq:17} only gives us an expected error, not a probability

1414: distribution over errors, we cannot use it to calculate the likelihood

1415: of the agent having that expected error. That is, we cannot calculate

1416: the ``probably'' ($\gamma$) part of probably approximately correct. We

1417: will, therefore, assume that the $\gamma$ chosen for $m$ is small

1418: enough so that it will be safe to say that, after $m$ time steps, the

1419: error is less than $\epsilon$.  In a typical application one uses a

1420: small $\gamma$ because it guarantees a high degree of certainty on the

1421: upper bound of the error.

1422:

1423: Since we can now safely say that, after $m$ time steps, the error is

1424: less than $\epsilon$, we can then deduce that the $l_i$ for this agent

1425: should be small enough such that, if $n = m$, then $E[e(\delta_i^n)]

1426: \leq \epsilon$.  This is expressed mathematically as:

1427: \begin{equation}

1428:   \label{eq:18}

1429:   e(\delta_i^0) \cdot (1 - l_i)^m \leq \epsilon.

1430: \end{equation}

1431:

1432: We solve this equation for $l_i$ in order to get:

1433: \begin{equation}

1434:   \label{eq:19}

1435:   l_i \geq 1 -

1436:   \left(

1437:     \frac{\epsilon}{e(\delta_i^0)}

1438:   \right)^{1/m}.

1439: \end{equation}

1440: This equation is not defined for $e(\delta_i^0) = 0$. However, given

1441: our assumption of a fixed target function and $r_i = 1$, we already

1442: know, from Eq.~\eqref{eq:13}, that if an agent starts with an error of

1443: $0$ it will maintain this error of $0$ for any future time $t > 0$.

1444: Therefore, in this case, the choice of a learning rate has no bearing

1445: on the agent's error, which will always be $0$.

1446:

1447: Equation~\eqref{eq:19} gives us a lower bound on the learning rate

1448: that a consistent learner must have, given that it has sample

1449: complexity $m$, and based on an error $\epsilon$ and a sufficiently

1450: small $\gamma$. A designer of an agent that uses a reasonable learning

1451: algorithm can expect that, if his agent has sample complexity $m$ (for

1452: $\epsilon$ error), then his agent will have a learning rate of at

1453: least $l_i$, as given by Eq.~\eqref{eq:19}. Furthermore, if a designer

1454: is comparing two possible agent designs, each with a different sample

1455: complexity but both with similarly powerful learning algorithms, he

1456: can calculate bounds on the learning rates of both agents and

1457: compare their relative performance.

1458:

1459:

1460: \section{Related Work}

1461: \label{sec:Related-Work}

1462:

1463: The topic of agents learning about agents arises often in the studies

1464: of complexity \cite{complexity}. In fact, systems where the agents try

1465: to adapt to endogenously created dynamics are being widely studied

1466: \cite{hubler94, arthur97b}. In these systems, like in ours, the agents

1467: co-create their expectations as they learn and change their behaviors.

1468: Complexity research uses simulated agents in an effort to understand

1469: the complex behaviors of these systems as observed in the real world.

1470:

1471: One example is the work of Arthur \emph{et. al.} \cite{arthur97b}, who

1472: arrive at the conclusion that systems of adaptive agents, where the

1473: agents are allowed to change the complexity of their learning

1474: algorithms, end up in one of two regimes: a stable/simple regime where

1475: it is trivial to predict an agent's future behavior, and a complex

1476: regime where the agents' behaviors are very complex. It is this second

1477: regime that interests complexity researchers the most. In it, the

1478: agents are able to reach some kind of ``equilibrium'' point in model

1479: building complexity.  These same results are echoed by Darley and

1480: Kauffman \cite{darley97a} in a similar experiment. In this article we

1481: have not allowed the agents to dynamically change the complexity of

1482: their learning algorithms.  Therefore, our dynamics are simpler.

1483: Allowing the agents to change their complexity amounts to allowing

1484: them to change the values of their $c$, $l$, and $r$ parameters while

1485: learning.

1486:

1487: However, while complexity research is very important and inspiring, it

1488: is only partially relevant to our work. Our emphasis is on finding

1489: ways to predict the behavior of MASs composed of machine-learning

1490: agents.  We are only concerned with the behavior of simpler artificial

1491: programmable agents, rather than the complex behavior of humans or the

1492: unpredictable behavior of animals.

1493:

1494: The dynamics of MASs have also been studied by Kephart \emph{et.  al.}

1495: \cite{kephart:90}.  In this work the authors show how simple

1496: predictive agents can lead to globally cyclic or chaotic behaviors. As

1497: the authors explain, the chaotic behaviors were a result of the simple

1498: predictive strategies used by the agents. Unlike our agents, most of

1499: their agents are not engaged in learning, instead they use simple

1500: fixed predictive strategies, such as ``if the state of the world was

1501: $x$ ten time units before, then it will be $x$ next time so take

1502: action $a$''. The authors later show how learning can be used to

1503: eliminate these chaotic global fluctuations.

1504:

1505: Matari\'{c} \cite{mataric97a} has studied reinforcement learning in

1506: multi-robot domains. She notes, for example, how learning can give

1507: rise to social behaviors \cite{mataric97b}. The work shows how robots

1508: can be individually programmed to produce certain group behaviors. It

1509: represents a good example of the usefulness and flexibility of

1510: learning agents in multi-agent domains. However, the author does not

1511: offer a mathematical justification for the chosen individual learning

1512: algorithms, nor does she explain why the agents were able to converge

1513: to the global behaviors. Our research hopes to provide the first steps

1514: in this direction.

1515:

1516: One particularly interesting approach is taken by Carmel and

1517: Markovitch \cite{carmel97}.  They work on model-based learning, that

1518: is, agents build models of other agents via observations. They use

1519: models based on finite state machines.  The authors show how some of

1520: these models can be effectively learned via observation of the other

1521: agent's actions. The authors concentrate on the development of

1522: learning algorithms that would let one agent learn a finite-state

1523: machine model of another agent. They have not considered the case

1524: where two or more agents are simultaneously learning about each other,

1525: which we study in this article. However, their work is more general in

1526: the sense that they model agents as state machines, rather than the

1527: state-action pairs we use.

1528:

1529: Finally, a lot of experimental work has been done in the area of

1530: agents learning about agents \cite{acl:96, weib:97}.  For example, Sen

1531: and Sekaran \cite{sen:98a} show how learning agents in simple MAS

1532: converge to system-wide optimal behavior. Their agents use Q-learning

1533: or modified classifier systems in order to learn. The authors

1534: implement these agents and compare the performance of the different

1535: learning algorithms for developing agent coordination. Hu and Wellman

1536: \cite{hu:98a, hu:96} have studied reinforcement learning in

1537: market-base MASs, showing how certain initial learning biases can be

1538: self-fulfilling, and how learning can be useful but is affected by an

1539: agent's models of other agents.  Claus and Boutilier \cite{claus:97}

1540: have also carried out experimental studies of the behavior of

1541: reinforcement learning agents.  We have been able to use the CLRI

1542: framework to predict some of their experimental results

1543: \cite{vidal:thesis}. Other researchers \cite{stone99a, littman94a,

1544:   hu:98b} have extended the basic Q-learning \cite{watkins:92}

1545: algorithm for use with MASs in an effort to either improve or prove

1546: convergence to the optimal behavior.

1547:

1548: We have also successfully experimented with reinforcement learning

1549: simulations \cite{vidal:98b}, but we believe that the formal

1550: treatment elucidated in these pages will shed more light into the real

1551: nature of the problem and the relative importance of the various

1552: parameters that describe the capabilities of an agent's learning

1553: algorithm.

1554:

1555:

1556: \section{Limitations and Future Work}

1557: \label{sec:future-work}

1558:

1559: The CLRI framework places some constraints on the type of systems it

1560: can model, which limits its usability. However, it is important to

1561: understand that, as we remove the limitations from the CLRI framework,

1562: the dynamics of the system become much harder to predict.  In the

1563: extreme, without any limitations on the agents' abilities, the system

1564: becomes a complex adaptive system, as studied by Holland

1565: \cite{holland95a} and others in the field of complexity.  The dynamic

1566: behavior of these systems continues to be studied by complexity

1567: researchers with only modest progress.  It is only by placing

1568: limitations on the system that we were able to predict the expected

1569: error of agents in the systems modeled by the CLRI framework.

1570:

1571: Our ongoing work involves the relaxation of some of the constraints

1572: made by the CLRI framework so that it may become more easily and

1573: widely applicable, without making the system dynamics impossible to

1574: analyze. We are targeting three specific constraints.

1575: \begin{enumerate}

1576: \item The values of $c_i$, $l_i$, $r_i$, and $I_{ij}$ cannot, in all

1577:   situations, be mathematically determined from the system's

1578:   description. We have found that bounds for the $c_i$, $l_i$, and

1579:   $r_i$ values can often be determined when using reinforcement

1580:   learning or supervised learning. However, the bounds are often very

1581:   loose. The values of the $I_{ij}$ parameter depend on the particular

1582:   system. Sometimes it is trivial to calculate the impact, sometimes

1583:   it requires extensive simulation.

1584: \item The CLRI framework assumes that an agent's action is either

1585:   correct or incorrect. The framework does not allow degrees of

1586:   correctness.  Specifically, in many systems the agents can often

1587:   take several actions, any one of which is equally good. When

1588:   modeling these systems, the CLRI framework requires the user to

1589:   designate one of those actions as the correct one, thereby ignoring

1590:   some possibly useful information.

1591: \item The world states are taken from a uniform probability

1592:   distribution which does not change over time. The environment is

1593:   assumed to be episodic. As such, the framework is limited in the

1594:   type of domains it can effectively describe.

1595: \end{enumerate}

1596:

1597: We are attacking these challenges with some of the same tools used by

1598: researchers in complex adaptive systems, namely, agent-based

1599: simulations and co-evolving utility landscapes. We believe we can gain

1600: some insight into the dynamics of adaptive MASs by constructing and

1601: analyzing various types of MASs. We also believe that the next step

1602: for the CLRI framework is the replacement of the current error

1603: definition with a utility function. The agents can then be seen as

1604: searching for the maximum value in the changing utility landscape

1605: defined by their utility function. The degree to which the agents are

1606: successful on their climb to the landscape peaks depends on the

1607: abilities of their learning algorithm (change rate, learning rate, and

1608: retention rate), and the speed at which the landscape changes as the

1609: other agents change their behavior (impact).

1610:

1611: The use of utility landscapes will allow us to consider an agent's

1612: utility for any particular action, rather than simply considering

1613: whether an action is correct or incorrect. The landscapes will also

1614: allow us to consider systems where agents cannot travel between any

1615: two world states in one time step. That is, the agents' moves on the

1616: landscape will be constrained in the same manner as their actions or

1617: behaviors are constrained the actual system. Finally, the new theory

1618: will likely need to redefine the CLRI parameters. We hope the new

1619: parameters will be easy to derive directly from the values that govern

1620: the machine-learning algorithms' behavior. These extensions will make

1621: the new theory applicable to a much wider set of domains.

1622:

1623:

1624: \section{Summary}

1625: \label{sec:Summary}

1626:

1627: We have presented a framework for studying and predicting the behavior

1628: of MASs composed of learning agents. We believe that this framework

1629: captures the most important parameters that describe an agents'

1630: learning and the system's rules of encounter. Various comparisons

1631: between the framework's predictions and experimental results were

1632: given.  These comparisons showed that the theoretical predictions

1633: closely match our experimental results and the experimental results

1634: published by others.  Our success in reproducing these results allows

1635: us to confidently state the effectiveness and accuracy of our theory

1636: in predicting the expected error of machine learning agents in MASs.

1637:

1638: Since our theory describes an agent's behavior at a high-level (i.e.,

1639: the agent's error), it is not capable of making system-specific

1640: predictions (e.g., predicting the particular actions that are

1641: favored). These types of system-specific predictions can only be

1642: arrived at by the traditional method of implementing populations of

1643: such agents and testing their behaviors. However, we expect that there

1644: will be times when the predictions from our theory will be enough to

1645: answer a designer's questions. A MAS designer that only needs to

1646: determine how ``good'' the agent's behavior will be could probably use

1647: the CLRI framework. A designer that needs to know which particular

1648: emergent behaviors will be favored by his agents will need to

1649: implement the agents.

1650:

1651: Finally, while we have given some examples as to how learning rates

1652: can be determined for particular machine learning implementations, we

1653: do not have any general method for determining these rates. However,

1654: we showed how to use the sample complexity of a learning problem to

1655: determine a lower bound on the learning rate of a consistent learning

1656: agent. This bound is useful for quickly ruling out the possibility of

1657: having agents with high expected errors and of stating that an agent's

1658: expected error will be, at most, a certain constant value. Still, if

1659: the agent's learning algorithm is much better than the one assumed by

1660: a consistent learner (e.g., the agent is very good at generalizing

1661: from one world state to many others), then these lower bounds could be

1662: significantly inaccurate.

1663:

1664: \appendix{}

1665:

1666: \section{Derivation for Matching Game}

1667: \label{sec:Deriving-C-matching}

1668:

1669: If we can assume that the action chosen when an agent changes \ditw{}

1670: and the result does not match \Ditw{} (for some specific $w$) is taken

1671: from a flat probability distribution, then we can say that:

1672:

1673: \begin{equation}

1674:   \label{eq:34}

1675:   B = \frac{|A_i| - 2}{|A_i| - 1}.

1676: \end{equation}

1677:

1678: We will now show how to calculate the fourth term

1679: in~\eqref{eq:main:matching}.  For the matching game we find that we

1680: can set:

1681: %\begin{equation}

1682: %  C = (1 -c_i) D + l_i + (c_i - l_i)F   \tag{\ref{eq:c}}

1683: %\end{equation}

1684: \begin{align}

1685:   D &= 1 - l_j \\

1686:   F &= l_j + (1 - l_j)

1687:   \left(

1688:     \frac{|A_i| - 3}{|A_i| - 2}

1689:   \right).

1690: \end{align}

1691: Having $|A_i| =2$ implies that $c_i = l_i$, this means

1692: that for this case we have

1693: \begin{equation}

1694:   \label{eq:8}

1695:   (1 -c_i) D + l_i + (c_i - l_i)F    = l_i + (1 -c_i)( 1 - l_j).

1696: \end{equation}

1697:

1698: For the case where $|A_i|> 2$, which is the case we are interested in,

1699: we can plug in the values for $D$ and $F$ and simplify, in order to

1700: get the fourth term:

1701: \begin{equation}

1702:   \label{eq:33}

1703:   (1 -c_i) D + l_i + (c_i - l_i)F  = 1 - l_j +

1704:     \frac{c_i l_j(|A_i| -1) + l_i(1 - l_j) - c_i}{|A_i|-2}.

1705: \end{equation}

1706:

1707: %\bibliographystyle{plain}

1708: \bibliographystyle{plain}

1709: \bibliography{../annobib,../vidal}

1710: %\bibliography{clri-bib}

1711:

1712: \end{document}

1713: