cs0001008/clri.tex
1: %$Id: clri.tex,v 1.9 2000/10/23 16:02:36 jmvidal Exp jmvidal $
2: %
3: \input{shortcuts.tex}
4: %\documentclass{elsart}
5: \documentclass{article}
6: \usepackage{natbib,amsmath,psfrag,graphicx, subfigure, url, fancyhdr,hyperref}
7: \usepackage[all]{xy}
8: 
9: \hypersetup{bookmarksopen,
10:   bookmarksnumbered, 
11: % These appear on the File->Document Info 
12:   pdftitle={Predicting the Expected Behavior of Agents that Learn About Agents: The CLRI Framework},
13:   pdfauthor={Jose M. Vidal and Edmund H. Durfee},
14:   pdfsubject={Learning in MASs},
15:   pdfkeywords={Learning, MAS}, 
16:   pdfview=FitH,
17:   pdfstartview=FitH 
18: }
19: 
20: 
21: %\runauthor{Vidal and Durfee}
22: %\begin{frontmatter}
23: 
24: \title{Predicting the Expected Behavior of Agents that Learn About
25:   Agents: The CLRI Framework}
26: 
27: \author{Jos\'{e} M. Vidal and Edmund H. Durfee \\
28: Swearingen Engineering Center, University of South \\
29: Carolina, Columbia, SC, 29208 \\
30: Advanced Technology Laboratory, University of \\
31:   Michigan, Ann Arbor, MI, 48102}
32: \begin{document}
33: \maketitle
34: \begin{abstract}
35:   We describe a framework and equations used to model and predict the
36:   behavior of multi-agent systems (MASs) with learning agents. A
37:   difference equation is used for calculating the progression of an
38:   agent's error in its decision function, thereby telling us how the
39:   agent is expected to fare in the MAS. The equation relies on
40:   parameters which capture the agent's learning abilities, such as its
41:   change rate, learning rate and retention rate, as well as relevant
42:   aspects of the MAS such as the impact that agents have on each
43:   other. We validate the framework with experimental results using
44:   reinforcement learning agents in a market system, as well as with
45:   other experimental results gathered from the AI literature. Finally,
46:   we use PAC-theory to show how to calculate bounds on the values of
47:   the learning parameters.
48: \end{abstract}
49: %\begin{keyword}
50:   Multi-Agent Systems, Machine Learning, Complex Systems.
51: %\end{keyword}
52: %\end{frontmatter}
53: 
54: \thispagestyle{fancy}
55: \lhead{}
56: \chead{Autonomous Agents and Multiagent Systems, January, 2003.}
57: \cfoot{\copyright{} Kluwer Academic Publishers 2002.}
58: 
59: \section{Introduction}
60: \label{sec:Introduction}
61: 
62: With the steady increase in the number of multi-agent systems (MASs)
63: with learning agents \cite{durfee:97,chavez:96,etzioni:96,stone97a}
64: the analysis of these systems is becoming increasingly important.
65: Some of the research in this area consists of experiments where a
66: number of learning agents are placed in a MAS, then different learning
67: or system parameters are varied and the results are gathered and
68: analyzed in an effort to determine how changes in the individual agent
69: behaviors will affect the system behavior.  We have learned about the
70: dynamics of market-based MASs using this approach \cite{vidal:98b}.
71: However, in this article we will take a step beyond these
72: observation-based experimental results and describe a framework that
73: can be used to model and predict the behavior of MASs with learning
74: agents.  We give a difference equation that can be used to calculate
75: the progression of an agent's error in its decision function.  The
76: equation relies on the values of parameters which capture the agents'
77: learning abilities and the relevant aspects of the MAS. We validate
78: the framework by comparing its predictions with our own experimental
79: results and with experimental results gathered from the AI literature.
80: Finally, we show how to use probably approximately correct (PAC)
81: theory to get bounds on the values of some of the parameters.
82: 
83: The types of MAS we study are exemplified by the abstract
84: representation shown in Figure~\ref{fig:problem}. We assume that the
85: agents observe the physical state of the world (denoted by $w$ in the
86: figure) and take some action ($a$) based on their observation of the
87: world state. An agent's mapping from states to actions is denoted by
88: the decision function ($\delta$) inside the agent.  Notice that the
89: ``physical'' state of the world includes everything that is directly
90: observable by the agent using its sensors. It could include facts such
91: as a robot's position, or the set outstanding bids in an auction,
92: depending on the domain. The agent does not know the decision
93: functions of the other agents.  After taking action an agent can
94: change its decision function as prescribed by whatever
95: machine-learning algorithm the agent is using.
96: 
97: \begin{figure}
98:   \begin{center}
99:     \psfrag{d1}{$\delta_1$}
100:     \psfrag{d2}{$\delta_2$}
101:     \psfrag{d3}{$\delta_3$}
102:     \psfrag{d}{{\tiny $\delta$}}
103:     \includegraphics{problem}
104:     \caption{The agents in a MAS.}
105:     \label{fig:problem}
106:   \end{center}
107: \end{figure}
108: 
109: We have a situation where agents are changing their decision function
110: based on the effectiveness of their actions. However, the
111: effectiveness of their actions depends on the other agents' decision
112: functions.  This scenario leads to the immediate problem: if all the
113: agents are changing their decisions functions then it is not clear
114: what will happen to the system as a whole. Will the system settle to
115: some sort of equilibrium where all agents stop changing their decision
116: functions? How long will it take to converge? How do the agents'
117: learning abilities influence the system's behavior and possible
118: convergence? These are some of the questions we address in this
119: article.
120: 
121: Section~\ref{sec:Fram-Model-MASs} presents our framework for
122: describing an agent's learning abilities and the error in its
123: behavior.  Section~\ref{sec:Calc-Agent's-Error} presents an equation
124: that can be used to predict an agent's expected error, as a function
125: of time, when the agent is in a MAS composed of other learning agents.
126: This equation is simplified in Section~\ref{sec:Assum-Cond-Indep} by
127: making some assumptions about the type of MAS being modeled.
128: Section~\ref{sec:Volatility} then defines the last few parameters used
129: in our framework---volatility and impact.
130: Section~\ref{sec:Example-with-2} gives an illustrative example of the
131: use of the framework.  The predictions made by our framework are
132: verified by our own experiments, as shown in
133: Section~\ref{sec:Simple-Application}, and with the experiments of
134: others, as shown in Section~\ref{sec:Appl-our-Theory}. The use of PAC
135: theory for determining bounds on the learning parameters is detailed
136: in Section~\ref{sec:Bound-Learn-Rate}. Finally
137: Section~\ref{sec:Related-Work} describes some of the related work and
138: Section~\ref{sec:Summary} summarizes our claims.
139: 
140: 
141: \section{A Framework for Modeling MASs}
142: \label{sec:Fram-Model-MASs}
143: 
144: 
145: In order to analyze the behaviors of agents in MASs composed of
146: learning agents, we must first construct a formal framework for
147: describing these agents. The framework must state any assumptions and
148: simplifications it makes about the world, the agents, and the agents'
149: behaviors. It must also be mathematically precise, so as to allow us
150: to make quantitative predictions about expected behaviors. Finally,
151: the simplifications brought about because of the need for mathematical
152: precision should not be so constraining that they prevent the
153: applicability of the framework to a wide variety of learning
154: algorithms and different types of MASs.  We now describe our framework
155: and explain the types of MASs and learning behaviors that it can
156: capture.
157: 
158: \subsection{The World and its Agents}
159: \label{sec:World-its-Agents}
160: 
161: A MAS consists of a finite number of agents, actions, and world
162: states. We let $N$ denote the finite set of agents in the system.  $W$
163: denotes the finite set of world states.  Each agent is assumed to have
164: a set of perceptors (e.g., a camera, microphone, bid queue) with which
165: it can perceive the world. An agent uses is sensors to ``look'' at the
166: world and determine which world state $w$ it is in; the set of all
167: these states is $W$.  $A_i$, where $|A_i| \geq 2$, denotes the finite
168: set of actions agent $i \in N$ can take.
169: 
170: We assume discrete time, indexed in the various functions by the
171: superscript $t$, where $t$ is an integer greater than or equal to 0.
172: The assumption of discrete time is made, for practical reasons, by a
173: number of learning algorithms. It means that, while the world might be
174: continuous, the agents perceive and learn in separate discrete
175: time steps.
176: 
177: We also assume that there is only one world state $w$ at each time,
178: which all the agents can perceive in its completeness. That is, we
179: asume the enviroment is accessible (as defined in
180: \cite[p46]{ai:modern:approach}).  This assumption holds for market
181: systems in which all the actions of all the agents are perceived by
182: all the agents, and for software agent domains in which all the agents
183: have access to the same information. However, it might not hold for
184: robotic domains where one agent's view of the world might be obscured
185: by some physical obstacle. Even in such domains, it is possible that
186: there is a strong correlation between the states perceived by each
187: agent. These correlations could be used to create equivalency classes
188: over the agents' perceived states, and these classes could then be
189: used as the states in $W$.
190: 
191: Finally, we assume the environment is determistic
192: \cite[p46]{ai:modern:approach}. That is, the agents' combined actions
193: will always have the expected effect. Of course, agent $i$ might not
194: know what action agent $j$ will take so $i$ might not know the
195: eventual effect of its own individual action.
196: 
197: \subsection{A Description of Agent Behavior}
198: \label{sec:Agents}
199: 
200: In the types of MASs we are modeling, every agent $i$ perceives the
201: state of the world $w$ and takes an action $a_i$, at each time step.
202: We assume that every agent's \textbf{behavior}, at each moment in
203: time, can be described with a simple state-to-action mapping. That is,
204: an agent's choice of action is solely determined by its current
205: state-to-action mapping and the current world $w$.
206: 
207: Formally, we say that agent $i$'s behavior is represented by a
208: \textbf{decision function} (also known as a ``policy'' in control
209: theory and a ``strategy'' in game theory), given by $\delta_{i}^t:W
210: \ra A_i$. This function maps each state $w \in W$ to the action $a_i
211: \in A_i$ that agent $i$ will take in that state, at time $t$. This
212: function can effectively describe any agent that deterministically
213: chooses its action based on the state of the world. Notice that the
214: decision function is indexed with the time $t$. This allows us to
215: represent agents that change their behavior.
216: 
217: The action agent $i$ \emph{should} take in each state $w$ is given by
218: the \textbf{target function} $\Delta_{i}^t:W \ra A_i$, which also maps
219: each state $w \in W$ to an action $a_i \in A_i$. The agent does not
220: have direct access to its target function. The target function is used
221: to determine how well an agent is doing. That is, it represents the
222: ``perfect'' behavior for a given agent. An agent's learning task is to
223: get its decision function to match its target function as much as
224: possible.
225: 
226: Since the choice of action for agent $i$ often depends on the actions
227: of other agents, the target function for $i$ needs to take these
228: actions into account. That is, in order to generate \Dit{}, one would
229: need to know \djtw{} for all $j \in N_{-i}$ and $w \in W$.  These
230: \djtw{} functions tell us the actions that all the other agents will
231: take in every state $w$.  For example, in order for one to determine
232: what an agent should bid in every world $w$ of an auction-based market
233: system, one will need to know what the other agents will bid in every
234: world $w$. One can use these actions, along with the state $w$, in
235: order to identify the best action for $i$ to take.
236: 
237: \begin{figure}
238:   \centerline{ 
239:     \xymatrix{ *\txt{New world $w^t \in \D$} \ar[r] & *\txt{Perceive world $w^t$} 
240:       \ar[r] & *\txt{Take action $\delta_i(w^t)$}\ar[d] \\
241:       & *\txt{Learn}\ar[lu]_{t \leftarrow t + 1} & *\txt{Receive payoff\\ or feedback.}\ar[l]
242:       }}
243:     \caption{Action/Learn loop for an agent.}
244:     \label{fig:action-learn}
245: \end{figure}
246: 
247: 
248: An agent's \ditw{} can change over time, so that $\ditt \neq \dit$.
249: These changes in an agent's decision function reflect its learned
250: knowledge. The agents in the MASs we consider are engaged in the
251: discrete action/learn loop shown in Figure~\ref{fig:action-learn}. The
252: loop works as follows: At time $t$ the agents perceive a world $w^t
253: \in W$ which is drawn from a fixed distribution \Dw{}.  They then each
254: take the action dictated by their \dit{} functions; all of these
255: actions are assumed to be taken effectively in parallel. Lastly, they
256: each receive a payoff which their respective learning algorithms use
257: to change the \dit{} so as to, hopefully, better match \Dit{}.  By
258: time $t+1$, the agents have new \ditt{} functions and are ready to
259: perceive the world again and repeat the loop.  Notice that, at time
260: $t$, an agent's \Dit{} is derived by taking into account the \djt{} of
261: all other agents $j \in N_{-i}$.
262: 
263: We assume that \Dw{} is a fixed probability distribution from which we
264: take the worlds seen at each time. This assumption is not unreasonably
265: limiting.  For example, in an economic domain where the new state is
266: the new good being offered, or in an episodic domain where the agents
267: repeatedly engage in different games (e.g. a Prisoner's Dilemma
268: competition) there is no correlation between successive world states
269: or between these states and the agents' previous actions.  However, in
270: a robotic domain one could argue that the new state of the world will
271: depend on the current state of the world; after all, the agents
272: probably move very little each time step.
273: 
274: Our measure of the correctness of an agent's behavior is given by our
275: \textbf{error} measure. We define the error of agent $i$'s decision
276: function \ditw{} as
277: \begin{equation}
278:   \label{eq:error}
279:   \begin{split}
280:     \edit &= \sum_{w \in W} \Dw \Pro[\ditw \neq \Ditw] \\
281:     &= \Pro_{w \in \D} [\ditw \neq \Ditw].
282:   \end{split}
283: \end{equation}
284: 
285: \edit{} gives us the probability that agent $i$ will take an incorrect
286: action; it is in keeping with the error definition used in
287: computational learning theory \cite{intro:clt}.  We use it to gauge
288: how well agent $i$ is performing. An error of 0 means that the agent
289: is taking all the actions dictated by its target function. An error of
290: 1 means that the agent never takes an action as dictated by its target
291: function. Each action the agent takes is either correct or incorrect,
292: that is, it either matches the target function or it does not. We do
293: not model degrees of incorrectness. However, since the error is
294: defined as the average over all possible world states, an agent that
295: takes the correct action in most world states will have a small error.
296: Extending the theory to handle degrees of incorrectness is one of
297: the subjects of your continuing work, see
298: Section~\ref{sec:future-work}. All the notation from this section is
299: summarized in Figure~\ref{fig:summary-notation}.
300: 
301: \begin{figure}
302:   \begin{center}
303:     \fbox{ \parbox{4.5in}{
304:         \begin{description}
305:         \item[$N$] the set of all agents, where $i \in N$ is one
306:           particular agent.
307:         \item[$W$] the set of possible states of the world, where $w
308:           \in W$ is one particular state.
309:         \item[$A_i$] the set of all actions that agent $i$ can take.
310:         \item[$\delta_i^t: W \ra A_i$] the \textbf{decision} function
311:           for agent $i$ at time $t$. It tells which action agent $i$
312:           will take in each world.
313:         \item[$\Delta_i^t: W \ra A_i$] the \textbf{target} function
314:           for agent $i$ at time $t$. It tells us what action agent $i$
315:           should take. It takes into account the actions that other
316:           agents will take.
317:         \item[$e(\delta_i^t)$] $= \Pro[\delta_i^t(w) \neq
318:           \Delta_i^t(w) \,|\, w \in \D]$ the \textbf{error} of agent
319:           $i$ at time $t$. It is the probability that $i$ will take an
320:           incorrect action, given that the worlds $w$ are taken from
321:           the fixed probability distribution \D.
322:         \end{description}
323:         } }
324:     \caption{Summary of notation used for describing a MAS and the agents in it.}
325:     \label{fig:summary-notation}
326:   \end{center}
327: \end{figure}
328: 
329: 
330: \subsection{The Moving Target Function Problem}
331: \label{sec:Moving-Targ-Funct}
332: 
333: 
334: The learning problem the agent faces is to change its \ditw{} so that
335: it matches \Ditw{}.  If we imagine the space of all possible decision
336: functions, then agent $i$'s \dit{} and \Dit{} will be two points in
337: this space, as shown in Figure~\ref{fig:trad-learn}.  The agent's
338: learning problem can then be re-stated as the problem of moving its
339: decision function as close as possible to its target function, where
340: the distance between the two functions is given by the error \edit{}.
341: This is the traditional machine learning problem.
342: 
343: \begin {figure}[thbp]
344:   \centerline{
345:     \xymatrix {
346:       & &  \ditt{}  \ar@{~}[rrrd]^{e(\ditt)}
347:       & & &  \\
348:       \dit{} \ar[rru]^{\txt{Learn}} \ar@{~}[rrrrr]_{e(\dit)} &&
349:       & & &  \Delta_i
350:       } }
351:   \caption{The traditional learning problem.}
352:   \label{fig:trad-learn}
353: \end{figure}
354: 
355: However, once agents start to change their decision functions (i.e.,
356: change their behaviors) the problem of learning becomes more
357: complicated because these changes might cause changes in the other
358: agents' target functions. We end up with a moving target function, as
359: seen in Figure~\ref{fig:learn-mas}. In these systems, it is not clear
360: if the error will ever reach 0 or, more generally, what the expected
361: error will be as time goes to infinity. Determining what will happen
362: to an agent's error in such a system is what we call the
363: \textbf{moving target function problem}, which we address in this
364: article. However, we will first need to define some parameters that
365: describe the capabilities of an agent's learning algorithm.
366: 
367: \begin{figure}[thbp]
368:   \centerline{
369:     \xymatrix{
370:       & &  \ditt{}  \ar@{~}[rrrrd]^{e(\ditt)}
371:       & & & \\
372:       \dit{} \ar[rru]^{\txt{Learn}} \ar@{~}[rrrrr]_{e(\dit)} &&
373:       & & & \Dit \ar[r]_{\txt{Move}} & \Ditt
374:       }}
375:   \caption{The learning problem in learning MASs.}
376:   \label{fig:learn-mas}
377: \end{figure}
378: 
379: \subsection{A Model of Learning Algorithms}
380: \label{sec:Model-Learn-Algor}
381: 
382: An agent's learning algorithm is responsible for changing \dit{} into
383: \ditt{} so that it is a better match of \Dit{}. Different machine
384: learning algorithms will achieve this match with different degrees of
385: success.  We have found a set of parameters that can be used to model
386: the effects of a wide range of learning algorithms. The parameter are:
387: Change rate, Learning rate, Retention rate, and Impact; and they will
388: be explained in this section, except for Impact which will be
389: introduced in Section~\ref{sec:Volatility}. These parameters, along
390: with the equations we provide, form the \textbf{CLRI} framework (the
391: letters correspond to the first letter of the parameters' names).
392: 
393: After agent $i$ takes an action and receives some payoff, it activates
394: its learning algorithm, as we showed in Figure~\ref{fig:action-learn}.
395: The learning algorithm is responsible for using this payoff in order
396: to change \dit{} into \ditt{}, making \ditt{} match \Dit{} as much as
397: possible. We can expect that for some $w$ it was true that $\ditw =
398: \Ditw$, while for some other $w$ this was not the case. That is,
399: some of the $w \ra a_i$ mappings given by \ditw{} might have been
400: incorrect.  In general, a learning algorithm might affect both the
401: correct and incorrect mappings. We will treat these two cases
402: separately.
403: 
404: We start by considering the incorrect mappings and define the
405: \textbf{change rate} of the agent as the probability that the agent
406: will change at least one of its incorrect mappings. Formally, we
407: define the change rate $c_i$ for agent $i$ as
408: \begin{equation}
409:   \label{eq:54} 
410:   \forallb_{w} \; \Pro[\dittw \neq \ditw \, |\, \ditw \neq \Ditw] = c_i.
411: \end{equation}
412: 
413: The change rate tells us the likelihood of the agent changing an
414: incorrect mapping into something else. This ``something else'' might
415: be the correct action, but it could also be another incorrect action.
416: The probability that the agent changes an incorrect mapping to the
417: correct action is called the \textbf{learning rate} of the agent.
418: It is defined as $l_i$ where
419: \begin{equation}
420:   \label{eq:51}
421:   \forallb_{w} \; \Pro[\dittw = \Ditw \, | \, \ditw \neq
422:   \Ditw] = l_{i}.
423: \end{equation}
424: %When determining the value of $l_i$, for a particular agent, one must
425: %remember that the worlds seen at each time step are taken from \Dw{}.
426: There are two constraints which must always be satisfied by these two
427: rates. Since changing to the correct mapping implies that a change was
428: made, the value of $l_i$ must be less than or equal to $c_i$, that is,
429: $l_i \leq c_i$ must always be true.  Also, if $|A_i| = 2$ then $c_i =
430: l_i$ since there are only two actions available, so the one that is
431: not wrong must be right. 
432: 
433: The complementary value for the learning rate is $1- l_i$ and refers
434: to the probability that an incorrect mapping does not get changed to a
435: correct one.  An example learning rate of $l_i = .5$ means that if
436: agent $i$ initially has all mappings wrong it will make half of them
437: match the original target function after the first iteration.
438: 
439: We now consider the agent's correct mappings and define the
440: \textbf{retention rate} as the probability that a correct mapping will
441: stay correct in the next iteration. The retention rate is given by
442: $r_i$ where 
443: \begin{equation}
444:   \label{eq:52}
445:   \forallb_{w} \; \Pro[\dittw = \Ditw \, | \, \ditw = \Ditw].
446:  = r_i.
447: \end{equation}
448: We propose that the behavior of a wide variety of learning algorithms
449: can be captured (or at least approximated) using appropriate values
450: for $c_i$, $l_i$, and $r_i$. Notice, however, that these three rates
451: claim that the $w \ra a$ mappings that change are independent of the
452: $w$ that was just seen. We can justify this independence by noting
453: that most learning algorithms usually perform some form of
454: generalization.  That is, after observing one world state $w$ and the
455: payoff associated with it, a typical learning algorithm is able to
456: generalize what it learned to some other world states. This
457: generalization is reflected in the fact that the change, learning, and
458: retention rates apply to all $w$'s. However, a more precise model
459: would capture the fact that, in some learning algorithms, the mapping
460: for the world state that was just seen is more likely to change than
461: the mapping for any other world state.
462: 
463: The rates are not time dependent because we assume that agents use one
464: learning algorithm during their lifetimes. The rates capture the
465: capabilities of this learning algorithm and, therefore, do not need to
466: vary over time. 
467: 
468: Finally, we define \textbf{volatility} to mean the probability that
469: the target function will change from time $t$ to time $t+1$. Formally,
470: volatility is given by $v_i$ where
471: \begin{equation}
472:   \label{eq:53}
473:   \forallb_{w} \; \Pro[\Dittw \neq \Ditw] = v_i
474: \end{equation}
475: In Section~\ref{sec:Volatility}, we will show how to calculate $v_i$
476: in terms of the error of the other agents. We will then see that
477: volatility is not a constant but, instead, varies with time.
478: 
479: 
480: \section{Calculating the Agent's Error}
481: \label{sec:Calc-Agent's-Error}
482: 
483: We now wish to write a difference equation that will let us calculate
484: the agent's expected error, as defined in Eq.~\eqref{eq:error}, at
485: time $t+1$ given the error at time $t$ and the other parameters we
486: have introduced. We can do this by observing that there are two
487: conditions that determine the new error: whether $\Dittw{} = \Ditw{}$
488: or not, and whether $\ditw{} = \Ditw{}$ or not.  If we define $a
489: \equiv \Dittw{} = \Ditw{}$, and $b \equiv \ditw{} = \Ditw{}$, we can
490: then say that we need to consider the four cases where: $a \wedge b$,
491: $a \wedge \neg b$, $\neg a \wedge b$, and $\neg a \wedge \neg
492: b$. Formally, this implies that
493: \begin{equation}
494:   \label{eq:1}
495:   \begin{split}
496:     \Pro&[\dittw \neq \Dittw] = \\
497:     & \Pro[\dittw \neq \Dittw \wedge a \wedge b] + 
498:     \Pro[\dittw \neq \Dittw \wedge a \wedge \neg b] + \\
499:     & \Pro[\dittw \neq \Dittw \wedge \neg a \wedge b] + 
500:     \Pro[\dittw \neq \Dittw \wedge \neg a \wedge \neg b],
501:   \end{split}
502: \end{equation}
503: since the four cases are exclusive of each other. Applying the chain
504: rule of probability, we can rewrite each of the four terms in order to
505: get
506: \begin{equation}
507:   \label{eq:2}
508:   \begin{split}
509:     \Pro[\dittw \neq \Dittw] &= \Pro[a \wedge b] \cdot \Pro[\dittw \neq \Dittw \,|\, a\wedge b] + \\
510:     & \Pro[a \wedge \neg b] \cdot \Pro[\dittw \neq \Dittw \,|\, a\wedge
511:   \neg b] + \\
512:   & \Pro[\neg a \wedge b] \cdot \Pro[\dittw \neq \Dittw \,|\, \neg
513:   a\wedge  b] + \\
514:   & \Pro[\neg a \wedge \neg b] \cdot \Pro[\dittw \neq \Dittw \,|\, \neg
515:   a\wedge \neg b].
516: \end{split}
517: \end{equation}
518: We can now find values for these conditional probabilities. We start
519: with the first term where, after replacing the values of $a$ and $b$,
520: we find that
521: \begin{equation}
522:   \label{eq:3}
523:   \Pro[\dittw \neq \Dittw \,|\, \Dittw = \Ditw \wedge \ditw = \Ditw] = 1 - r_i.
524: \end{equation}
525: Since the target function does not change from time $t$ to $t+1$ and
526: the agent was correct at time $t$, the agent will also be correct at
527: time $t+1$; \emph{unless} it changes its correct $w \ra a$ mapping. The 
528: agent changes this mapping with probability $1 - r_i$. 
529: 
530: The value for the second conditional probability is
531: \begin{equation}
532:   \label{eq:4}
533:   \Pro[\dittw \neq \Dittw \,|\, \Dittw = \Ditw \wedge \ditw \neq \Ditw] = 1 - l_i.
534: \end{equation}
535: In this case the target function still stays the same but the agent
536: was incorrect. If the agent was incorrect then it will change its
537: decision function to match the target function with probability
538: $l_i$. Therefore, the probability that it will be incorrect next time
539: is the probability that it does not make this change, or $1 -
540: l_i$. 
541: 
542: The third probability has a value of
543: \begin{equation}
544:   \label{eq:5}
545:   \begin{split}
546:   \Pro&[\dittw \neq \Dittw \,|\, \Dittw \neq \Ditw \wedge \ditw =
547:   \Ditw] \\
548:   &= ( r_i + (1 - r_i) \cdot B)
549:   \end{split}
550: \end{equation}
551: In this case the agent was correct and the target function
552: changes. This means that if the agent retains the same mapping,
553: which it does with probability $r_i$, then the agent will definitely be
554: incorrect at time $t+1$. If it does not retain the same mapping, which
555: happens with probability $1-r_i$, then it will be incorrect with
556: probability $B$, where
557: \begin{equation}
558:   \begin{split}
559:   B = \Pro[ &\dittw \neq \Dittw | \ditw = \Ditw \wedge \Dittw \neq
560:   \Ditw \label{eq:b}\\
561:   & \wedge \dittw \neq \Ditw].
562:   \end{split}
563: \end{equation}
564: Finally, the fourth conditional probability has a value of
565: \begin{equation}
566:   \label{eq:10}
567:   \begin{split}
568:   \Pro[&\dittw \neq \Dittw \,|\, \Dittw \neq \Ditw \wedge \ditw \neq
569:   \Ditw] \\
570:   &= (1 - c_i)D + l_i +  (c_i - l_i)F,    
571:   \end{split}
572: \end{equation}
573: where
574: \begin{align}
575:   D &= \Pro[ \ditw \neq \Dittw | \ditw \neq \Ditw \wedge \Dittw \neq
576:   \Ditw] \label{eq:d}\\
577:   F &= \Pro[ \dittw \neq \Dittw | \ditw \neq \Ditw \wedge \Dittw \neq
578:   \Ditw \label{eq:f} \\
579:   & \quad \quad \wedge \dittw \neq \Ditw \wedge \dittw \neq \ditw]. \nonumber
580: \end{align}
581: This is the case where the target function changes and the agent was
582: wrong. We have to consider three possibilities. The first possibility
583: is for the agent not to change its decision function, which happens
584: with probability $1 - c_i$. The probability that the agent will be
585: incorrect in this case is given by $D$. The second possibility, when
586: the agent changes its mapping to the correct function, has a
587: probability of $l_i$ and ensures that the agent will be incorrect the
588: next time.  The third possibility happens, with probability $c_i -
589: l_i$ when the agent changes its mapping to an incorrect value. In this
590: case, the probability that it will be wrong next time is given by $F$.
591: 
592: We can substitute Eqs.~\eqref{eq:3}, \eqref{eq:4}, \eqref{eq:5}, and
593: \eqref{eq:10} into Eq.~\eqref{eq:2}, substitute the values of $a$ and
594: $b$, and expand $\Pro[a \wedge b]$ into $\Pro[a \,|\,b] \cdot
595: \Pro[b]$, in order to get
596: \begin{equation} 
597:   \label{eq:main:general}
598:   \begin{split}
599:   E&[\editt] = E[\sum_{w \in W} \Dw \Pro[\dittw \neq \Dittw]] = \sum_{w \in W} \Dw ( \\
600:   &\quad \Pro[\Dittw = \Ditw | \ditw = \Ditw] \cdot \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\
601:   &+ \Pro[\Dittw = \Ditw | \ditw \neq \Ditw]\cdot \Pro[\ditw \neq \Ditw] \cdot (1 - l_i) \\
602:   &+ \Pro[\Dittw \neq \Ditw | \ditw = \Ditw ] \\
603:   & \quad \cdot  
604:     \Pro[\ditw = \Ditw] \cdot
605:     \left(
606:       r_i + (1 - r_i) \cdot B
607: %      \left(
608: %        \frac{|A_i| - 2}{|A_i| - 1}
609: %      \right)
610:     \right) \\
611:  &+ \Pro[\Dittw \neq \Ditw | \ditw \neq \Ditw ]  \cdot \Pro[\ditw \neq
612:  \Ditw] \\
613:  & \quad \cdot  (1 - c_i)D + l_i +  (c_i - l_i)F. \\
614:   \end{split}
615: \end{equation}
616: Equation~\eqref{eq:main:general} will model any MAS whose agent
617: learning can be described with the parameters presented
618: Section~\ref{sec:Model-Learn-Algor} and whose action/learn loop is the
619: same as we have described. We can use Eq.~\eqref{eq:main:general} to
620: calculate the successive expected errors for agent $i$, given values
621: for all the parameters and probabilities. In the next section we show
622: how this is done in a simple example game.
623: 
624: \subsection{The Matching game}
625: \label{sec:Matching-game}
626: 
627: In this matching game we have two agents $i$ and $j$ each of whom, in
628: every world $w$, wants to play the same action as the other one. Their
629: set of actions is $A_i = A_j$, where we assume $|A_i| > 2$ (for $|A_i|
630: = 2$ the equation is simpler).  After every time step, the agents both
631: learn and change their decision functions in accordance to their
632: learning rates, retention rates, and change rates.  Since the agents
633: are trying to match each other, in this game it is always true that
634: $\Delta_i^t(w)= \delta_j^t(w)$ and $\Delta_j^t(w) = \delta_i^t(w)$.
635: Given all this information, we can find values for some of the
636: probabilities in Eq.~\eqref{eq:main:general} (including values
637: for Equations~\eqref{eq:b} \eqref{eq:d} \eqref{eq:f}) and rewrite
638: (see Appendix~\ref{sec:Deriving-C-matching} for derivation) it as:
639: \begin{equation} 
640:   \label{eq:main:matching}
641:   \begin{split}
642:   E&[\editt] = \sum_{w \in W} \Dw \{  r_j \cdot \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\
643:   &+ (1 -c_j) \cdot \Pro[\ditw \neq \Ditw] \cdot (1 - l_i) \\
644:   &+ (1-r_j) \cdot  
645:     \Pro[\ditw = \Ditw] \cdot
646:   \left(
647:     r_i + (1 - r_i) \cdot
648:     \left(
649:       \frac{|A_i| - 2}{|A_i| - 1}
650:     \right)
651:   \right)\\
652:   &+ c_j \cdot \Pro[\ditw \neq \Ditw] \cdot 
653:   \left(
654:  1 - l_j + 
655:     \frac{c_i l_j(|A_i| -1) + l_i(1 - l_j) - c_i}{|A_i|-2} 
656:   \right) \}\\
657:   \end{split}
658: \end{equation} 
659: We can better understand this equation by plugging in some values and
660: simplifying. For example, lets assume that $r_i = r_j = 1$ and $l_i =
661: l_j = 1$, which implies that $c_i = c_j = 1$. This is the case where
662: the two agents always change all their incorrect mappings so as to
663: match their respective target functions at time $t$. That is, if we
664: had $\delta_i^t(w_1)= x$ and $\delta_j^t(w_1) = y$, then at time $t+1$
665: we will have $\delta_i^{t+1}(w_1) = y$ and $\delta_j^{t+1}(w_1) = x$.
666: This means that agent $i$ changes all its incorrect mappings to match
667: $j$, while $j$ changes to match $i$, so all the mappings stay wrong
668: after all (i.e., $i$ ends up doing what $j$ did before, while $j$ does
669: what $i$ did before). The error, therefore, stays the same.  We can
670: see this by plugging the values into Eq.~\eqref{eq:main:matching}.
671: The first three terms will become 0 and the fourth term will simplify
672: to the definition of error, as given by Eq.~\eqref{eq:error}. Since
673: the fourth term is the only one that is non-zero, we end up with
674: $E[\editt] = \edit$.
675: 
676: We can also let $c_i$ and $l_i$ (keeping $c_j = l_j = 1$) be 
677: arbitrary numbers, which gives us $ E[\editt] = c_i \edit$.  This
678: tells us that the error will drop faster for a smaller change rate
679: $c_i$. The reason is that $i$'s learning (remember $l_i \leq c_i$) in
680: this game is counter-productive because it is always made invalid by
681: $j$'s learning rate of $1$. That is, since $j$ is changing all its
682: mappings to match $i$'s actions, $i$'s best strategy is to keep its
683: actions the same (i.e., $c_i = 0$).
684: 
685: 
686: \section{Further Simplification}
687: \label{sec:Assum-Cond-Indep}
688: 
689: We can further simplify Eq.~\eqref{eq:main:general} if we are willing to
690: make two assumptions. The first assumption is that the new actions
691: chosen when either \ditw{} changes (and does not match the target), or
692: when \Ditw{} changes, are both taken from flat probability
693: distributions over $A_i$. By making this assumption we can find
694: values for $B$, $D$, and $F$, namely:
695: \begin{alignat}{2}
696:   \label{eq:23}
697:   B = D & = \frac{|A_i| -2}{|A_i| - 1} & \qquad F & = \frac{|A_i| -3}{|A_i| - 2}
698: \end{alignat}
699: 
700: %which makes
701: %\begin{equation}
702: %  \label{eq:27}
703: %  C =  \frac{|A_i| -2 - c_i + 2l_i}{|A_i| - 1}
704: %\end{equation}
705: 
706: The second assumption we make is that the probability of \Ditw{}
707: changing, for a particular $w$, is independent of the
708: probability that \ditw{} was correct. In
709: Section~\ref{sec:Matching-game} we saw that in the matching game the
710: probabilities of \Ditw{} and \ditw{} changing were correlated since,
711: if \ditw{} was wrong then \djtw{} was also wrong, which meant $j$
712: would probably change \djtw{}, which would change \Ditw{}. 
713: 
714: However, the matching game is a degenerate example in exhibiting such
715: tight coupling between the agents' target functions. In general, we
716: can expect that there will be a number of MASs where the probability
717: that any two agents $i$ and $j$ are correct is uncorrelated (or
718: loosely correlated). For example, in a market system all sellers try
719: to bid what the buyer wants, so the fact that one seller bids the
720: correct amount says nothing about another seller's bid. Their bids are
721: all uncorrelated. In fact, the Distributed Artificial Intelligence
722: literature is full of systems that try to make the agents' decisions
723: as loosely-coupled as possible \cite{lesser:81,Liu:95}.
724: 
725: This second assumption we are trying to make can be formally
726: represented by having Eq.~\eqref{eq:cond:indep} be true for all pairs of
727: agents $i$ and $j$ in the system.
728: \begin{equation}
729:   \label{eq:cond:indep} %  
730:   \begin{split}
731:   \Pro[&\ditw = \Ditw \wedge \djtw = \Djtw] \\
732:   & = \Pro[\ditw = \Ditw] \cdot \Pro[\djtw = \Djtw]    
733:   \end{split}
734: \end{equation}
735: 
736: 
737: Once we make these two assumptions we can
738: rewrite Eq.~\eqref{eq:main:general} as:
739: \begin{equation} 
740:   \label{eq:main}
741:   \begin{split}
742:   E&[\editt]  = \sum_{w \in W} \Dw (  \Pro[\Dittw = \Ditw] \cdot ( \Pro[\ditw = \Ditw] \cdot ( 1 -r_i) \\
743:   & \qquad \qquad \qquad \qquad \qquad + \Pro[\ditw \neq \Ditw] \cdot (1 - l_i)) \\
744:   & \quad + \Pro[\Dittw \neq \Ditw] \cdot
745:     (\Pro[\ditw = \Ditw] \cdot
746:   \left(
747:     r_i + (1 - r_i) \cdot
748:     \left(
749:       \frac{|A_i| - 2}{|A_i| - 1}
750:     \right)
751:   \right)\\
752:   & \qquad \qquad \qquad \qquad \qquad + \Pro[\ditw \neq \Ditw] \cdot 
753:   \left(
754:     \frac{|A_i| -2 - c_i + 2l_i}{|A_i| - 1}
755:   \right))) \\
756:   \end{split}
757: \end{equation} 
758: Some of the probabilities in this equation are just the definition of
759: $v_i$, and others simplify to the agent's error. This means that we
760: can simplify Eq.~\eqref{eq:main} to:
761: \begin{multline}
762:   \label{eq:main:simp}
763:     E[\editt] = 1 - r_i + v_i
764:     \left(
765:       \frac{|A_i|r_i - 1}{|A_i| -1}
766:     \right) \\
767:     + \edit
768:     \left(
769:       r_i -l_i + v_i
770:       \left(
771:         \frac{|A_i|(l_i - r_i) + l_i - c_i}{|A_i| -1}
772:       \right)
773:     \right)
774: \end{multline}
775: 
776: Eq.~\eqref{eq:main:simp} is a difference equation that can be used to
777: determine the expected error of the agent at any time by simply using
778: $E[\editt]$ as the \edit{} for the next iteration. While it might look
779: complicated, it is just the function for a line $y = mx+b$ where $x =
780: \edit$ and $y= \editt$. Using this observation, and the fact that
781: \editt{} will always be between $0$ and $1$, we can determine that the
782: final convergence point for the error is the point
783: where Eq.~\eqref{eq:main:simp} intersects the line $y=x$. The only
784: exception is if the slope equals $-1$, in which case we will see the
785: error oscillating between two points.
786: \begin{figure}
787:   \psfrag{edit}{{\small \edit}}
788:   \psfrag{editt}{{\small \editt}}
789:   \psfrag{editt2}[B]{{\tiny \editt}}
790:   \psfrag{learning}[B]{{\tiny learning}}
791:   \psfrag{volatility}[B]{{\tiny volatility}}
792:   \begin{center}
793:     \includegraphics[width=3in]{errorprog}
794:     \caption{Error progression for agent $i$, assuming a fixed
795:       volatility $v_i = .2$, $c_i =1$, $l_i = .3$, $r_i = 1$, $|A_i| =
796:       20$. We show the error function (\editt{}), as well as its two
797:       components: learning and volatility. The line $y=x$ allows us
798:       trace the agent's error as it starts at $.95$ and converges to
799:       $.44$.}
800:     \label{fig:errorprog}
801:   \end{center}
802: \end{figure}
803: 
804: By looking at Eq.~\eqref{eq:main:simp} we can also determine that
805: there are two ``forces'' acting on the agent's error: volatility and
806: the agent's learning abilities. The volatility tends to increase the
807: agent's error past its current value while the learning reduces it. We
808: can better appreciate this effect by separating the $v_i$ terms in
809: Eq.~\eqref{eq:main:simp} and plotting the $v_i$ terms (volatility) and
810: the rest of the terms (learning) as two separate lines. By definition,
811: these will add up to the line given by Eq.~\eqref{eq:main:simp}. We
812: have plotted these three lines and traced a sample error progression
813: in Figure~\ref{fig:errorprog}. The error starts at .95 and then
814: decreases to eventually converge to .44.  We notice the learning curve
815: always tries to reduce the agent's error, as confirmed by the fact
816: that its line always falls below $y=x$.  Meanwhile, the volatility
817: adds an extra error. This extra error is bigger when the agent's error
818: is small since, any change in the target function is then likely to
819: increase the agent's error.
820: 
821: 
822: \section{Volatility and Impact}
823: \label{sec:Volatility}
824: Equation~\eqref{eq:main:simp} is useful for determining the agent's error
825: when we know the volatility of the system.  However, it is likely that
826: this value is not available to us (if we knew it we would already know
827: a lot about the dynamics of the system).  In this section we determine
828: the value of $v_i$ in terms of the other agents' changes in their
829: decision functions. That is, in terms of $\Pro[\djtt \neq \djt]$, for
830: all other agents $j$.
831: 
832: In order to do this we first need to define the \textbf{impact} \Iji{}
833: that agent $j$'s changes in its decision function have on $i$'s target
834: function.
835: \begin{equation}
836:   \label{eq:14}
837:  \forall_{w \in W} \;  I_{ji} = \Pro[\Dittw \neq \Ditw \,|\, \djttw \neq \djtw]
838: \end{equation}
839: 
840: We can now start to define volatility by first determining that, for
841: two agents $i$ and $j$
842: \begin{equation}
843:   \label{eq:15}
844:   \begin{split}
845:      \forall_{w \in W} \; v_{i}^{t} &= \Pro[\Dittw \neq \Ditw] \\
846:      &= \Pro[\Dittw \neq \Ditw \,|\, \djttw \neq \djtw] \cdot \Pro[\djttw
847:      \neq \djtw] \\
848:      &+ \Pro[\Dittw \neq \Ditw \,|\, \djttw = \djtw] \cdot \Pro[\djttw
849:      = \djtw]. \\
850:   \end{split}
851: \end{equation}
852: 
853: The reader should notice that volatility is no longer constant; it
854: varies with time (as recorded by the superscript). The first
855: conditional probability in Eq.~\eqref{eq:15} is just $I_{ji}$. The
856: second one we will set to $0$, since we are specifically interested in
857: MASs where the volatility arises \emph{only} as a side-effect of the
858: other agents' learning. That is, we assume that agent $i$'s target
859: function changes only when $j$'s decision function changes. For cases
860: with more than two agents, we similarly assume that one agent's target
861: function changes only when some other agent's decision function
862: changes.  That is, we ignore the possibility that outside influences
863: might change an agent's target function.
864: 
865: We can simplify Eq.~\eqref{eq:15} and generalize it to $N$ agents,
866: under the assumption that the other agents' changes in their decision
867: functions will not cancel each other out, making \Dit{} stay the same
868: as a consequence. $v_i^t$ then becomes
869: \begin{equation}
870:   \label{eq:16}
871:   \begin{split}
872:        \forall_{w \in W} \; v_{i}^{t} &= \Pro[\Dittw \neq \Ditw] \\
873:        &= 1 - \prod_{j \in N_{-i}}(1 - \Iji \Pro[\djttw \neq \djtw]). \\
874:   \end{split}
875: \end{equation}
876: 
877: We now need to determine the expected value of $\Pro[\djttw \neq
878: \djtw]$ for any agent. Using $i$ instead of $j$ we have
879: \begin{equation}
880:   \label{eq:55}
881:   \begin{split}
882:   \forall_{w \in W} \; \Pro[&\dittw \neq \ditw] \\
883:     &= \Pro{}[\ditw \neq \Ditw] \cdot \Pro[\dittw \neq \Ditw \,|\, \ditw
884:     \neq \Ditw] \\
885:     &+ \Pro{}[\ditw = \Ditw] \cdot \Pro[\dittw \neq \Ditw \,|\, \ditw
886:     = \Ditw], \\
887:   \end{split}
888: \end{equation}
889: where the expected value is:
890: \begin{equation}
891:   \label{eq:30}
892:   E[\Pro[\dittw \neq \ditw]] = c_i \edit + (1- r_i)\cdot(1 -
893:   \edit).
894: \end{equation}
895: 
896: We can then plug Eq.~\eqref{eq:30} into Eq.~\eqref{eq:16} in order to get the
897: expected volatility
898: \begin{equation}
899:   \label{eq:6}
900:  E[ v_{i}^{t}] =  1 - \prod_{j \in N_{-i}}1 - \Iji (c_j \edjt + (1- r_j)\cdot(1 -
901:   \edjt)).
902: \end{equation}
903: 
904: We can use this expected value of $v_{i}^{t}$ in Eq.~\eqref{eq:main:simp}
905: in order to find out how the other agents' learning will affect agent
906: $i$. In MASs that have identical learning agents (i.e., their $c$, $l$,
907: $r$, and $I$ rates are all the same and they start with the same
908: initial error) we can replace the multiplier in  Eq.~\eqref{eq:6} with an
909: exponent of $|N| -1$. We use this simplification later in
910: Section~\ref{sec:Shoham-Tennenholtz}.
911: 
912: 
913: \section{An Example with Two Agents}
914: \label{sec:Example-with-2}
915: 
916: In a MAS with just two agents $i$ and $j$, we can use Eq.~\eqref{eq:6} to
917: rewrite Eq.~\eqref{eq:main:simp} as
918: \begin{equation}
919:   \label{eq:7}
920:   \begin{split}
921:     E&[\editt] = 1 - r_i + \Iji(c_j \edjt + (1- r_j)\cdot(1 - \edjt))
922:     \left(
923:       \frac{|A_i|r_i - 1}{|A_i| -1}
924:     \right) \\
925:     &+ \edit
926:     \{ r_i -l_i + \Iji(c_j \edjt + (1- r_j)\cdot(1 - \edjt))
927:     \\
928:     &
929:       \qquad \qquad \cdot
930:     \left(
931:       \frac{|A_i|(l_i - r_i) + l_i - c_i}{|A_i| -1}
932:     \right)
933:   \}.
934: \end{split}
935: \end{equation}
936: 
937: 
938: \begin{figure}
939:   \psfrag{Iij}{{\tiny $\Iij$}}
940:   \psfrag{Iji}{{\tiny $\Iji$}}
941:   \psfrag{Error}[B]{{\tiny Error}}
942:   \psfrag{Final Error for i}[B]{{\small Final Error for $i$}}
943:   \begin{center}
944:     \includegraphics[width=3in]{ii-error}
945:     \caption{Plot of Final Error for agent $i$, given $l_i = l_j = .2$,
946:       $r_i = r_j = 1$, $c_i = c_j =1$, $|A_j| = |A_i| = 20$.}
947:     \label{fig:ii-error}
948:   \end{center}
949: \end{figure}
950: 
951: We can now use Eq.~\eqref{eq:7} to plot values for one particular example.
952: Let us say that $l_i = l_j =.2$, $c_i = c_j = 1$, $r_i = r_j = 1$,
953: $|A_j| = |A_i| = 20$ and we let the impacts $\Iij$ and $\Iji$ vary
954: between zero and one. Figure~\ref{fig:ii-error} shows the final error,
955: after convergence, for this situation. It shows an area where the
956: error is expected to be below $.1$, corresponding to low values for
957: either \Iij{}, \Iji{} or both. This area represents MASs that are
958: loosely coupled, i.e., one agent's change in behavior does not
959: significantly affect the other's target function. In these systems we
960: can expect that the error will eventually\footnote{Notice that we are
961:   not representing how long it takes for the error to converge. This
962:   can easily be done and is just one more of the parameters our theory
963:   allows us to explore.}  reach a value close to zero. We see that as
964: the impact increases the final error also increases, with a fairly
965: abrupt transition between a final error of 0 and bigger final errors.
966: This abrupt transition is characteristic of these types of systems
967: where there are tendencies for the system to either converge or
968: diverge, and both of them are self-enforcing behaviors. Notice also
969: that the graph is not symmetric---\Iij{} has more weight in
970: determining $i$'s final error than \Iji. This result seems
971: counterintuitive, until we realize that it is $j$'s error that makes
972: it hard for $i$ to converge to a small error. If \Iij{} is high then,
973: if $i$ has a large error then $j$'s error will increase, which will
974: make $j$ change its decision function often and make it hard for $i$
975: to reduce its error. If \Iij{} is low then, even if \Iji{} is high,
976: $j$ will probably settle down to a low error and as it does $i$ will
977: also be able to settle down to a low error.
978: 
979: If we were about to design a MAS we would try to build it so that it
980: lies in the area where the final error is zero. This way we can expect
981: all agents to eventually have the correct behavior. We note that a
982: substantial percentage of the research in DAI and MAS deals with
983: taking systems that are not inherently in this area of near-zero error
984: and designing protocols and rules of encounter so as to move them into
985: this area, as in \cite{rules:of:encounter}.
986: 
987: The fact that the final error is 1 for the case with $\Iij = \Iji = 1$
988: can seem non-intuitive to readers familiar with game theory. In game
989: theory there are many games, such as the ``matching game'' from
990: Section~\ref{sec:Matching-game}, where two agents have an impact of 1
991: on each other. However, it is known \cite{binmore2} that, in these
992: games, two learning agents will eventually converge to one of the
993: equilibria (if there are any), making their final error equal to 0.
994: This is certainly true, and it is exactly what we showed in
995: Section~\ref{sec:Matching-game}. The same result is not seen in
996: Figure~\ref{fig:ii-error} because the figure was plotted using our
997: simplified Equation.~\eqref{eq:main:simp}, which makes the simplifying
998: independence assumption given by Eq.~\eqref{eq:cond:indep}. This
999: assumption cannot be made in games such as the matching game because,
1000: in these games, there is a correlation between the correctness of each
1001: of the agents actions. Specifically, in the matching game it is always
1002: true that both agents are either correct, or incorrect, but it is
1003: never true that one of them is correct while the other one is
1004: incorrect, i.e., either they matched, or they did not match.
1005: 
1006: 
1007: \begin{figure}
1008:   \psfrag{edit}{{\small \edit}}
1009:   \psfrag{edjt}{{\small \edjt}}
1010:   \begin{center}
1011:     \includegraphics[width=3in]{vector}
1012:     \caption{Vector plot for \edit{} and \edjt{}, where $|A_i| = |A_j|
1013:       = 20$, $l_i=l_j=.2$, $r_i = r_j=1$, $c_i=.5$, $c_j=1$,
1014:       $\Iij=.1$, $\Iji=.3$. It shows the error progression for a pair
1015:       agents $i$ and $j$. For each pair of errors $(\edit{},\edjt{})$, 
1016:       the arrows indicate the expected $(\editt{},\edjtt{})$.
1017:       }
1018:     \label{fig:vector}
1019:   \end{center}
1020: \end{figure}
1021: 
1022: Another view of the system is given by Figure~\ref{fig:vector} which
1023: shows a vector plot of the agents' errors. We can see how the bigger
1024: errors are quickly reduced but the pace of learning decreases as the
1025: errors get closer to the convergence point. Notice also that an
1026: agent's error need not change in a monotonic fashion. That is, an
1027: agent's error can get bigger for a while before it starts to get
1028: smaller.
1029: 
1030: 
1031: \section{A Simple Application}
1032: \label{sec:Simple-Application}
1033: 
1034: In order to demonstrate how our theory can be used, we tested it on a
1035: simple market-based MAS. The game consists of three agents, one buyer
1036: and two seller agents $i$ and $j$. The buyer will always buy at the
1037: cheapest price---but the sellers do not know this fact. In each time
1038: step the sellers post a price and the buyer decides which of the
1039: sellers to buy from, namely, the one with the lowest bid.  The sellers
1040: can bid any one of 20 prices in an effort to maximize their profits.
1041: The sellers use a reinforcement learning algorithm with their
1042: reinforcements being the profit the agent achieved in each round, or 0
1043: if it did not sell the good at the time. In this system we had one
1044: good being sold ($|W| = 1$).As predicted by economic theory, the price
1045: in this system settles to the sellers' marginal cost, but it takes
1046: time to get there due to the learning inefficiencies.
1047: 
1048: We experimented with different $\alpha_j$ rates\footnote{$\alpha$ is
1049:   the relative weight the algorithm gives to the most recent payoff.
1050:   $\alpha =1$ means that it will forget all previous experience and
1051:   use only the latest payoff to determine what action to take.} for
1052: the reinforcement learning of agent $j$, while keeping $\alpha_i = .1$
1053: fixed, and plotted the running average of the error of agent $i$.
1054: \begin{figure}
1055:   \psfrag{l=.1=}[B]{{\tiny $\alpha_j = .1$}}
1056:   \psfrag{l=.3=}[B]{{\tiny $\alpha_j = .3$}}
1057:   \psfrag{l=.9=}[]{{\tiny $\alpha_j = .9$}}
1058:   \psfrag{l=.005=}[B]{{\tiny $l_j = .005$}}
1059:   \psfrag{l=.04=}[B]{{\tiny $l_j = .04$}}
1060:   \psfrag{l=.055=}[]{{\tiny $l_j = .055$}}
1061:   \psfrag{time}{{\tiny time}}
1062:   \psfrag{error}{{\tiny error}}
1063:   \begin{center}
1064:     \mbox{\subfigure[Experiment]{\label{fig:lre-exp}\charts{simie}} \quad
1065:       \subfigure[Theory]{\label{fig:lre-theory}\charts{lre}}}
1066:     \caption{Comparison of observed and predicted error.}
1067:     \label{fig:application}
1068:   \end{center}
1069: \end{figure}
1070: A comparison is shown in Figure~\ref{fig:application}.
1071: Figure~\ref{fig:lre-exp} gives the experimental results for three
1072: different values of $\alpha_j$. It shows $i$'s average error, over 100
1073: runs, as a function of time.  Since both sellers start with no
1074: knowledge, their initial actions are completely random which makes
1075: their error equal to $.5$. Then, depending on $\alpha_j$, $i$'s error
1076: will either start to go down from there or will first go up some and
1077: then down. Eventually, $i$'s error gets very close to 0, as the system 
1078: reaches a market equilibrium.
1079: 
1080: We can predict this behavior using Eq.~\eqref{eq:7}. Based on the game
1081: description, we set $|A_i| = |A_j| = 20$, since there were $20$
1082: possible actions. We let $r_i = r_j = 1$ because reinforcement
1083: learning with fixed payoffs enforces the condition that once an agent
1084: is taking the correct action it will never change its decision
1085: function to take a different action.  The agent might, however, still
1086: take a wrong action but only when its exploration rate dictates it.
1087: 
1088: We then let $\Iij = \Iji = .17$ based on the rough calculation that
1089: each agent has an equal probability of bidding any one of the 20
1090: prices.  If $\Dit = 20$ then \Iji{} for this situation is the
1091: probability that $j$ was also bidding 20 or above, i.e., $1/20$, times
1092: the probability that $j$'s new price is lower than 20, i.e.  $19/20$.
1093: Similarly, if $\Dit = 19$ then \Iji{} is equal to $2/20$ times
1094: $18/20$.  The average of all of these probabilities is $.17$. A more
1095: precise calculation of the impact would require us to find it via
1096: experimentation by actually running the system.
1097: 
1098: Finally, we chose $l_i = l_j = c_i = c_j = .005$ for the first curve
1099: (i.e., the one that compares with $\alpha_j = .1$). We knew that for
1100: such a low $\alpha_j$ the learning and change rate should be the same.
1101: The actual value was chosen via experimentation. The resulting curve
1102: is shown in Figure~\ref{fig:lre-theory}. At this moment, we do not
1103: possess a formal way of deriving learning and change rates from
1104: $\alpha$-rates.
1105: 
1106: For the second curve ($\alpha_j = .3$) we knew that, since only
1107: $\alpha_j$ had changed from the first experiment, we should only
1108: change $l_j$ and $c_j$. In fact, these two values should only be
1109: increased. We found their exact values, again by experimentation, to
1110: be $l_j = .04$, $c_j =.4$. For the third curve we found the values to
1111: be $l_j = .055$, $c_j = .8$.
1112: 
1113: One difference we notice between the experimental and the theoretical
1114: results is that the experimental results show a longer delay before
1115: the error starts to decrease. We attribute this delay to the agent's
1116: initially high exploration rate. That is, the agents initially start
1117: by taking all random actions but progressively reduce this rate of
1118: exploration. As the exploration rate decreases the discrepancy between
1119: our theoretical predictions and experimental results is reduced.
1120: 
1121: In summary, while it is true that we found $l_j$ and $c_j$ by
1122: experimentation, all the other values were calculated from the
1123: description of the problem. Even the relative values of $l_j$ and
1124: $c_j$ follow the intuitive relation with $\alpha_j$ that, as
1125: $\alpha_j$ increases so does $l_j$ and (even more) $c_j$.
1126: Section~\ref{sec:Bound-Learn-Rate} shows how to calculate lower bounds
1127: on the learning rate. We believe that this experiment provides solid
1128: evidence that our theory can be used to approximately determine the
1129: quantitative behaviors of MASs with learning agents.
1130: 
1131: \section{Application of our Theory to Experiments in the Literature}
1132: \label{sec:Appl-our-Theory}
1133: 
1134: In this section we show how we can apply our theory to experimental
1135: results found in the AI and MAS literature. While we will often not be
1136: able to completely reproduce the authors' results exactly, we believe
1137: that being able to reproduce the flavor and the main quantitative
1138: characteristics of experimental results in the literature shows that
1139: our theory can be widely applied and used by practitioners in this
1140: area of research.
1141: 
1142: \subsection{Claus and Boutilier}
1143: \begin{figure}
1144:   \psfrag{time}{{\tiny time}}
1145:   \psfrag{1 - error}{{\tiny 1 - error}}
1146:   \psfrag{l=.1}[B]{{\tiny $l_i=.1$}}
1147:   \begin{center}
1148:     \mbox{\subfigure[Experiment]{\label{fig:claus-exp}\charts{claus-figure}} \quad
1149:       \subfigure[Theory]{\label{fig:claus-theory}\charts{claus-theory}}}
1150:     \caption{Comparing theory (b) with results from \cite{claus:97} (a).}
1151:     \label{fig:claus}
1152:   \end{center}
1153: \end{figure}
1154: 
1155: Claus and Boutilier \cite{claus:97} study the dynamics of a system that
1156: contains two reinforcement learning agents. Their first experiment
1157: puts the two agents in a matching game exactly like the one we
1158: describe in Section~\ref{sec:Matching-game} with $|A_i| = |A_j| = 2$.
1159: Their results show the probability that both agents matched (i.e., 1 -
1160: \edit) as time progressed. Since they were using two reinforcement
1161: learning agents, it was not surprising that the curve they saw, seen
1162: in Figure~\ref{fig:claus-exp}, was nearly identical to the curve we
1163: saw in our experiments with the two buying agents
1164: (Figure~\ref{fig:lre-exp} with $\alpha_j = \alpha_i = .1$, except
1165: upside-down).
1166: 
1167: We can reproduce their curve using our equation for the matching
1168: game Eq.~\eqref{eq:main:matching}. The results can be seen in
1169: Figure~\ref{fig:claus-theory}. Our theory again fails to account for
1170: the initial exploration rate. We can, however, confirm that by time 15
1171: their Boltzmann temperature (the authors used Boltzmann exploration)
1172: had been reduced from an initial value of 16 to $3.29$ and would keep
1173: decreasing by a factor of .9 each time step. This means that by time
1174: 15 the agents were, indeed, starting to do more exploitation (i.e.,
1175: reduce their error) while doing little exploration.
1176: \label{sec:Claus-Boutilier}
1177: 
1178: \subsection{Shoham and Tennenholtz}
1179: \label{sec:Shoham-Tennenholtz}
1180: 
1181: Shohan and Tennenholtz \cite{shoham:97} investigate how learning
1182: agents might arrive at social conventions. The authors introduce a
1183: simple learning algorithm (strategy-selection rule) called
1184: \emph{highest cumulative reward} (HCR) which their agents use for
1185: learning these conventions. Shoham and Tennenholtz also provide the
1186: results of a series of experiments using populations of learning
1187: agents. We try to reproduce the results they present in their
1188: Section~4.1 where they study the ``coordination game'' which is
1189: similar to our matching game, but with only two actions.
1190: 
1191: The experiment in question involves 100 agents, all of them identical
1192: and all of them using HCR. At each time instant the agents take one of
1193: two available actions. The aim is for every pair of chosen agents to
1194: take the same action as each other. Agents are randomly made to form
1195: pairs. The agents update their behavior (i.e., apply HCR) after a
1196: given delay. The authors try a series of delays (from 0 to 200) and
1197: show that increasing the update delay decreases the percentage of
1198: trials where, after 1600 iterations, at least 95\% of the agents
1199: reached a convention. The authors show surprise at finding this
1200: phenomenon. Their results are reproduced in
1201: Figure~\ref{fig:shoham-exp} (cf. Figure~1 in their article).
1202: \label{sec:Appl-to-Results}
1203: \begin{figure}
1204:   \psfrag{l}{{\small $l_i$}}
1205:   \psfrag{final error}{{\small final error}}
1206:   \psfrag{theory}[B]{{\tiny theory}}
1207:   \psfrag{experiment}[B]{{\tiny experiment}}
1208:   \begin{center}
1209:     \mbox{\subfigure[Original Experiment]{\label{fig:shoham-exp}\charts{shoham-figure}} \quad
1210:       \subfigure[Theory and Experiment]{\label{fig:shoham-theory}\charts{shoham}}}
1211:     \caption{Comparing theory (b) with results from \cite{shoham:97} (a).}
1212:     \label{fig:shoham}
1213:   \end{center}
1214: \end{figure}
1215: The number of actions for all agents is easily set to $|A_i| =2$,
1216: which implies that we must have $l_i = c_i$.  By examining HCR, it is
1217: easy to determine that $r_i =1$ (i.e if an agent took the right
1218: action, it will only get more support for it).  At first intuition,
1219: one's impulse is to set $\Iij = 1$ for every pair of agents $i$ and
1220: $j$.  However, since there are 100 of them and only pairs of them
1221: interact at every time instant, the real impact is $\Iij = 1/99$.
1222: 
1223: We will now convert from their units of measurement into ours.  In
1224: Figure~\ref{fig:shoham-exp} we can see that their x-axis is called the
1225: \emph{update delay}, which we will refer to as $d$. This value is the
1226: number of time units that pass before the agent is allowed to learn.
1227: For $d=0$ the agent learns after every interaction (i.e., on every time
1228: $t$), while for $d=200$ the agent takes the same action for 200 time
1229: instances and only learns after every 200 iterations. This means that
1230: we must set $l_i = \frac{1}{p(d +1)}$ where $p>0$. The value of $p$
1231: depends on their learning algorithm's performance, but we know that it
1232: must be a small number ($< 50$) greater than 0. Through some
1233: experimentation we settled on $p=6$ (other values close to this one
1234: give similar results). Since in their graph they look at $0 \leq d
1235: \leq 200$, we must then look at $l_i$ where $\frac{1}{1206} \leq l_i
1236: \leq \frac{1}{6}$ Finally, we find the value of $d$ in terms of $l_i$
1237: to be
1238: \begin{equation}
1239:   \label{eq:9}
1240:   d = \frac{1}{pl_i} - 1
1241: \end{equation}
1242: 
1243: 
1244: The y-axis of Figure~\ref{fig:shoham-exp} is the \emph{success}, i.e.,
1245: number of trials, out of 4000, where at least 95\% of the agents
1246: reached a convention. We will refer to this value as $s$.  We know
1247: that in $s/4000$ of the trials \emph{at least} 95\% of the
1248: agents have error close to 0 (i.e., reaching a convention means that
1249: the agents take the right action almost all the time), and for the
1250: rest of the trials the error was greater.  We can approximately map
1251: this to an error by saying that in $s/4000$ of the trials the error
1252: was 0 (a slight underestimate), while in $1 - s/4000$ of the trials
1253: the error was 1 (a slight overestimate).  We add these two up (the 0
1254: makes the first term disappear) and arrive at an equation that maps
1255: $s$ to \edit.
1256: \begin{equation}
1257:   \label{eq:26}
1258:   \edit \approx
1259:   \left(
1260:     \frac{4000 - s}{4000}
1261:   \right)
1262: \end{equation}
1263: 
1264: 
1265: The mapping from $d$ to $s$ is given by their actual data. Their data
1266: can be fit by the following function:
1267: \begin{equation}
1268:   \label{eq:28}
1269:   s = 3900 - 4d - \frac{(d -100)^2}{100}
1270: \end{equation}
1271: 
1272: 
1273: Plugging Eq.~\eqref{eq:9} into Eq.~\eqref{eq:28}, and the result
1274: into Eq.~\eqref{eq:26}, we finally arrive at a function that maps their
1275: experimental results into our units:
1276: \begin{equation}
1277:   \label{eq:29}
1278:   \mbox{Final error} = \frac{4000 -
1279:     \left(
1280:       3900 - 4(1/pl_i - 1) -
1281:       \left(
1282:         \frac{(1/pl_i - 1 -100)^2}{100}
1283:       \right)
1284:     \right)}{4000}
1285: \end{equation}
1286: for the range $\frac{1}{1206} \leq l_i \leq \frac{1}{6}$.
1287: 
1288: Now that we have values for $c_i$, $l_i$, $r_i$, \Iij, $|A_i|$, a
1289: range for $l_i$ and an equation that maps their experimental results
1290: into our units, we can plot both functions, as seen in
1291: Figure~\ref{fig:shoham-theory}. The x-axis was plotted on a log-scale
1292: in order to better show the shape of the experiment curve, otherwise
1293: it would appear mostly as a straight line. For our theory curve we
1294: used Equations~\eqref{eq:main:simp} and \eqref{eq:6}, and iterated for 1600 time
1295: units, just like in the experiment, and plotted the error at that
1296: point. For the experiment curve we used Eq.~\eqref{eq:29}. We plotted both
1297: of these curves in the specified range for $l_i$.  The reader will
1298: notice that our theory was able to make precise quantitative
1299: predictions. The maximum distance from our theory curve to the
1300: experimental curve is $.05$, which means that our predictions for the
1301: final error were, at worst, within 5\% of the experimental values.
1302: Also, an error of about 5\% was introduced when mapping from their
1303: success percentage $s$ to our error.
1304: 
1305: 
1306: \subsection{Others}
1307: \label{sec:Others}
1308: 
1309: There are several other examples in the literature where we believe
1310: our theory can be successfully applied. \cite[chapter 3.7]{ishida:97}
1311: gives results of an experiment where two agents try to find each other
1312: in a 100 by 100 grid. He shows that if the grid has few obstacles it
1313: is faster if both agents move towards each other, while if there are
1314: many obstacles it is faster if one of the agents stays still while the
1315: other one searches for it. We believe that the number of obstacles is
1316: proportional to the change rate that the agents experience and,
1317: perhaps, to the impact that they have on each other.  When there are
1318: no obstacles the agents never change their decision functions (because
1319: their initial Manhattan heuristics lead them in the correct path). As
1320: the number of obstacles increases, the agents will start to change
1321: their decision functions as they move, which will have an impact on
1322: the other agent's target function.  If, however, one of them stays
1323: put, this means that his change rate is 0 so the other agent's target
1324: function will stay still and he will be able to reach his target
1325: (i.e., error 0) quicker.
1326: 
1327: Notice that the problem of a moving target that Ishida studies is
1328: different from the problem of a moving target function which we study.
1329: It is, however, interesting to note their similarities and how our
1330: theory can be applied to some aspects of that domain.
1331: 
1332: Another possible example is given by \cite{sen:94b}.  They show two
1333: Q-learning agents trying to cooperate in order to move a block.  The
1334: authors show how different $\alpha$ rates ($\beta$ in their article)
1335: affect the quality of the result that the agents converge to. This
1336: quality roughly corresponds to our error, except for the fact that
1337: their measurements implicitly consider some actions to be better than
1338: others, while we consider an action to be either correct or incorrect.
1339: This discrepancy would make it harder to apply our theory to their
1340: results but we still believe that a rough approximation is possible.
1341: Our future work includes the extension of the CLRI framework to handle
1342: a more general definition of error---one that attaches a utility to
1343: each state-action pair, rather than the simple correct/incorrect
1344: categorization we use.
1345: 
1346: 
1347: \section{Bounding the Learning Rate with Sample Complexity}
1348: \label{sec:Bound-Learn-Rate}
1349: 
1350: In the previous examples we have used our knowledge of the learning
1351: algorithms to determine the values of the agent's $c_i$, $l_i$, and
1352: $r_i$ parameters. However, there might be cases where this is not
1353: possible---the learning algorithm might be too complicated or unknown.
1354: It would be useful, in these cases, to have some other measure of the
1355: agent's learning abilities, which could be used to determine some
1356: bounds on the values of these parameters.
1357: 
1358: One popular measure of the complexity of learning is given by Probably
1359: Approximately Correct (PAC) theory \cite{intro:clt}, in the form of a
1360: measure called the \emph{sample complexity}. The sample complexity
1361: gives us a loose upper bound on the number of examples that a
1362: consistent learning agent must observe before arriving at a PAC
1363: hypothesis.
1364: 
1365: There are two important assumptions made by PAC-theory. The first
1366: assumption is that the agents are consistent learners\footnote{See
1367:   \cite[p162]{machine:learning} for a formal definition of a
1368:   consistent learner.}. Using our notation, a consistent learner is one
1369: who, once it has learned a correct $w \ra a$ mapping does not forget
1370: it. This simply means that the agent must have $r_i = 1$. The second
1371: assumption is that the agent is trying to learn a fixed concept. This
1372: assumption makes $\Ditt = \Dit$ true for all $t$.
1373: 
1374: The sample complexity $m$ of an agent's learning problem is given by
1375: \begin{equation}
1376:   \label{eq:11}
1377:   m \geq \frac{1}{\epsilon}
1378:   \left(
1379:     \ln \frac{|H|}{\gamma}
1380:   \right),
1381: \end{equation}
1382: where $|H|$ is the size of the hypothesis space for the agent. In
1383: other words, $|H|$ is the total number of different \diw{} functions
1384: that the agent will consider. For an agent with no previous knowledge
1385: we have $|H| = |A_i|^{|W|}$. However, agents with previous knowledge
1386: might have smaller $|H|$, since this knowledge might be used to
1387: eliminate impossible mappings. If a consistent learning agent has seen 
1388: $m$ examples then, with probability at least $(1 - \gamma)$, it has
1389: error at most $\epsilon$.
1390: 
1391: While we cannot map the sample complexity $m$ to a particular learning
1392: rate $l_i$, we can use it to put a lower bound on the learning rate
1393: for a consistent learning agent. That is, we can find a lower bound
1394: for the learning rate of an agent who does not forget anything it has
1395: seen, and who is trying to learn a fixed target function.  Since the
1396: agent does not forget anything it has seen, we can deduce that its
1397: retention rate must be $r_i = 1$. Since the target function is not
1398: changing, we know that $\Pr[\Dittw \neq \Ditw] = 0$ and $\Pr[\Dittw =
1399: \Ditw] = 1$. We can plug these values into Eq.~\eqref{eq:main} and
1400: simplify in order to get:
1401: \begin{equation}
1402:   \label{eq:13}
1403:   E[\editt] = \edit \cdot (1 - l_i).
1404: \end{equation}
1405: We can solve the difference Eq.~\eqref{eq:13}, for any time $n$, in
1406: order to get:
1407: \begin{equation}
1408:   \label{eq:17}
1409:   E[e(\delta_i^n)] = e(\delta_i^0) \cdot (1 - l_i)^n.
1410: \end{equation}
1411: We now remember that after $m$ time steps we expect, with probability
1412: $(1 - \gamma)$, the error to be less than $\epsilon$.  Since
1413: Eq.~\eqref{eq:17} only gives us an expected error, not a probability
1414: distribution over errors, we cannot use it to calculate the likelihood
1415: of the agent having that expected error. That is, we cannot calculate
1416: the ``probably'' ($\gamma$) part of probably approximately correct. We
1417: will, therefore, assume that the $\gamma$ chosen for $m$ is small
1418: enough so that it will be safe to say that, after $m$ time steps, the
1419: error is less than $\epsilon$.  In a typical application one uses a
1420: small $\gamma$ because it guarantees a high degree of certainty on the
1421: upper bound of the error.
1422: 
1423: Since we can now safely say that, after $m$ time steps, the error is
1424: less than $\epsilon$, we can then deduce that the $l_i$ for this agent
1425: should be small enough such that, if $n = m$, then $E[e(\delta_i^n)]
1426: \leq \epsilon$.  This is expressed mathematically as:
1427: \begin{equation}
1428:   \label{eq:18}
1429:   e(\delta_i^0) \cdot (1 - l_i)^m \leq \epsilon.
1430: \end{equation}
1431: 
1432: We solve this equation for $l_i$ in order to get:
1433: \begin{equation}
1434:   \label{eq:19}
1435:   l_i \geq 1 -
1436:   \left(
1437:     \frac{\epsilon}{e(\delta_i^0)}
1438:   \right)^{1/m}.
1439: \end{equation}
1440: This equation is not defined for $e(\delta_i^0) = 0$. However, given
1441: our assumption of a fixed target function and $r_i = 1$, we already
1442: know, from Eq.~\eqref{eq:13}, that if an agent starts with an error of
1443: $0$ it will maintain this error of $0$ for any future time $t > 0$.
1444: Therefore, in this case, the choice of a learning rate has no bearing
1445: on the agent's error, which will always be $0$.
1446: 
1447: Equation~\eqref{eq:19} gives us a lower bound on the learning rate
1448: that a consistent learner must have, given that it has sample
1449: complexity $m$, and based on an error $\epsilon$ and a sufficiently
1450: small $\gamma$. A designer of an agent that uses a reasonable learning
1451: algorithm can expect that, if his agent has sample complexity $m$ (for
1452: $\epsilon$ error), then his agent will have a learning rate of at
1453: least $l_i$, as given by Eq.~\eqref{eq:19}. Furthermore, if a designer
1454: is comparing two possible agent designs, each with a different sample
1455: complexity but both with similarly powerful learning algorithms, he
1456: can calculate bounds on the learning rates of both agents and
1457: compare their relative performance.
1458: 
1459: 
1460: \section{Related Work}
1461: \label{sec:Related-Work}
1462: 
1463: The topic of agents learning about agents arises often in the studies
1464: of complexity \cite{complexity}. In fact, systems where the agents try
1465: to adapt to endogenously created dynamics are being widely studied
1466: \cite{hubler94, arthur97b}. In these systems, like in ours, the agents
1467: co-create their expectations as they learn and change their behaviors.
1468: Complexity research uses simulated agents in an effort to understand
1469: the complex behaviors of these systems as observed in the real world.
1470: 
1471: One example is the work of Arthur \emph{et. al.} \cite{arthur97b}, who
1472: arrive at the conclusion that systems of adaptive agents, where the
1473: agents are allowed to change the complexity of their learning
1474: algorithms, end up in one of two regimes: a stable/simple regime where
1475: it is trivial to predict an agent's future behavior, and a complex
1476: regime where the agents' behaviors are very complex. It is this second
1477: regime that interests complexity researchers the most. In it, the
1478: agents are able to reach some kind of ``equilibrium'' point in model
1479: building complexity.  These same results are echoed by Darley and
1480: Kauffman \cite{darley97a} in a similar experiment. In this article we
1481: have not allowed the agents to dynamically change the complexity of
1482: their learning algorithms.  Therefore, our dynamics are simpler.
1483: Allowing the agents to change their complexity amounts to allowing
1484: them to change the values of their $c$, $l$, and $r$ parameters while
1485: learning.
1486: 
1487: However, while complexity research is very important and inspiring, it
1488: is only partially relevant to our work. Our emphasis is on finding
1489: ways to predict the behavior of MASs composed of machine-learning
1490: agents.  We are only concerned with the behavior of simpler artificial
1491: programmable agents, rather than the complex behavior of humans or the
1492: unpredictable behavior of animals.
1493: 
1494: The dynamics of MASs have also been studied by Kephart \emph{et.  al.}
1495: \cite{kephart:90}.  In this work the authors show how simple
1496: predictive agents can lead to globally cyclic or chaotic behaviors. As
1497: the authors explain, the chaotic behaviors were a result of the simple
1498: predictive strategies used by the agents. Unlike our agents, most of
1499: their agents are not engaged in learning, instead they use simple
1500: fixed predictive strategies, such as ``if the state of the world was
1501: $x$ ten time units before, then it will be $x$ next time so take
1502: action $a$''. The authors later show how learning can be used to
1503: eliminate these chaotic global fluctuations.
1504: 
1505: Matari\'{c} \cite{mataric97a} has studied reinforcement learning in
1506: multi-robot domains. She notes, for example, how learning can give
1507: rise to social behaviors \cite{mataric97b}. The work shows how robots
1508: can be individually programmed to produce certain group behaviors. It
1509: represents a good example of the usefulness and flexibility of
1510: learning agents in multi-agent domains. However, the author does not
1511: offer a mathematical justification for the chosen individual learning
1512: algorithms, nor does she explain why the agents were able to converge
1513: to the global behaviors. Our research hopes to provide the first steps
1514: in this direction.
1515: 
1516: One particularly interesting approach is taken by Carmel and
1517: Markovitch \cite{carmel97}.  They work on model-based learning, that
1518: is, agents build models of other agents via observations. They use
1519: models based on finite state machines.  The authors show how some of
1520: these models can be effectively learned via observation of the other
1521: agent's actions. The authors concentrate on the development of
1522: learning algorithms that would let one agent learn a finite-state
1523: machine model of another agent. They have not considered the case
1524: where two or more agents are simultaneously learning about each other,
1525: which we study in this article. However, their work is more general in
1526: the sense that they model agents as state machines, rather than the
1527: state-action pairs we use.
1528: 
1529: Finally, a lot of experimental work has been done in the area of
1530: agents learning about agents \cite{acl:96, weib:97}.  For example, Sen
1531: and Sekaran \cite{sen:98a} show how learning agents in simple MAS
1532: converge to system-wide optimal behavior. Their agents use Q-learning
1533: or modified classifier systems in order to learn. The authors
1534: implement these agents and compare the performance of the different
1535: learning algorithms for developing agent coordination. Hu and Wellman
1536: \cite{hu:98a, hu:96} have studied reinforcement learning in
1537: market-base MASs, showing how certain initial learning biases can be
1538: self-fulfilling, and how learning can be useful but is affected by an
1539: agent's models of other agents.  Claus and Boutilier \cite{claus:97}
1540: have also carried out experimental studies of the behavior of
1541: reinforcement learning agents.  We have been able to use the CLRI
1542: framework to predict some of their experimental results
1543: \cite{vidal:thesis}. Other researchers \cite{stone99a, littman94a,
1544:   hu:98b} have extended the basic Q-learning \cite{watkins:92}
1545: algorithm for use with MASs in an effort to either improve or prove
1546: convergence to the optimal behavior.
1547: 
1548: We have also successfully experimented with reinforcement learning
1549: simulations \cite{vidal:98b}, but we believe that the formal
1550: treatment elucidated in these pages will shed more light into the real
1551: nature of the problem and the relative importance of the various
1552: parameters that describe the capabilities of an agent's learning
1553: algorithm.
1554: 
1555: 
1556: \section{Limitations and Future Work}
1557: \label{sec:future-work}
1558: 
1559: The CLRI framework places some constraints on the type of systems it
1560: can model, which limits its usability. However, it is important to
1561: understand that, as we remove the limitations from the CLRI framework,
1562: the dynamics of the system become much harder to predict.  In the
1563: extreme, without any limitations on the agents' abilities, the system
1564: becomes a complex adaptive system, as studied by Holland
1565: \cite{holland95a} and others in the field of complexity.  The dynamic
1566: behavior of these systems continues to be studied by complexity
1567: researchers with only modest progress.  It is only by placing
1568: limitations on the system that we were able to predict the expected
1569: error of agents in the systems modeled by the CLRI framework.
1570: 
1571: Our ongoing work involves the relaxation of some of the constraints
1572: made by the CLRI framework so that it may become more easily and
1573: widely applicable, without making the system dynamics impossible to
1574: analyze. We are targeting three specific constraints.
1575: \begin{enumerate}
1576: \item The values of $c_i$, $l_i$, $r_i$, and $I_{ij}$ cannot, in all
1577:   situations, be mathematically determined from the system's
1578:   description. We have found that bounds for the $c_i$, $l_i$, and
1579:   $r_i$ values can often be determined when using reinforcement
1580:   learning or supervised learning. However, the bounds are often very
1581:   loose. The values of the $I_{ij}$ parameter depend on the particular
1582:   system. Sometimes it is trivial to calculate the impact, sometimes
1583:   it requires extensive simulation.
1584: \item The CLRI framework assumes that an agent's action is either
1585:   correct or incorrect. The framework does not allow degrees of
1586:   correctness.  Specifically, in many systems the agents can often
1587:   take several actions, any one of which is equally good. When
1588:   modeling these systems, the CLRI framework requires the user to
1589:   designate one of those actions as the correct one, thereby ignoring
1590:   some possibly useful information.
1591: \item The world states are taken from a uniform probability
1592:   distribution which does not change over time. The environment is
1593:   assumed to be episodic. As such, the framework is limited in the
1594:   type of domains it can effectively describe.
1595: \end{enumerate}
1596: 
1597: We are attacking these challenges with some of the same tools used by
1598: researchers in complex adaptive systems, namely, agent-based
1599: simulations and co-evolving utility landscapes. We believe we can gain
1600: some insight into the dynamics of adaptive MASs by constructing and
1601: analyzing various types of MASs. We also believe that the next step
1602: for the CLRI framework is the replacement of the current error
1603: definition with a utility function. The agents can then be seen as
1604: searching for the maximum value in the changing utility landscape
1605: defined by their utility function. The degree to which the agents are
1606: successful on their climb to the landscape peaks depends on the
1607: abilities of their learning algorithm (change rate, learning rate, and
1608: retention rate), and the speed at which the landscape changes as the
1609: other agents change their behavior (impact).
1610: 
1611: The use of utility landscapes will allow us to consider an agent's
1612: utility for any particular action, rather than simply considering
1613: whether an action is correct or incorrect. The landscapes will also
1614: allow us to consider systems where agents cannot travel between any
1615: two world states in one time step. That is, the agents' moves on the
1616: landscape will be constrained in the same manner as their actions or
1617: behaviors are constrained the actual system. Finally, the new theory
1618: will likely need to redefine the CLRI parameters. We hope the new
1619: parameters will be easy to derive directly from the values that govern
1620: the machine-learning algorithms' behavior. These extensions will make
1621: the new theory applicable to a much wider set of domains.
1622: 
1623: 
1624: \section{Summary}
1625: \label{sec:Summary}
1626: 
1627: We have presented a framework for studying and predicting the behavior
1628: of MASs composed of learning agents. We believe that this framework
1629: captures the most important parameters that describe an agents'
1630: learning and the system's rules of encounter. Various comparisons
1631: between the framework's predictions and experimental results were
1632: given.  These comparisons showed that the theoretical predictions
1633: closely match our experimental results and the experimental results
1634: published by others.  Our success in reproducing these results allows
1635: us to confidently state the effectiveness and accuracy of our theory
1636: in predicting the expected error of machine learning agents in MASs.
1637: 
1638: Since our theory describes an agent's behavior at a high-level (i.e.,
1639: the agent's error), it is not capable of making system-specific
1640: predictions (e.g., predicting the particular actions that are
1641: favored). These types of system-specific predictions can only be
1642: arrived at by the traditional method of implementing populations of
1643: such agents and testing their behaviors. However, we expect that there
1644: will be times when the predictions from our theory will be enough to
1645: answer a designer's questions. A MAS designer that only needs to
1646: determine how ``good'' the agent's behavior will be could probably use
1647: the CLRI framework. A designer that needs to know which particular
1648: emergent behaviors will be favored by his agents will need to
1649: implement the agents.
1650: 
1651: Finally, while we have given some examples as to how learning rates
1652: can be determined for particular machine learning implementations, we
1653: do not have any general method for determining these rates. However,
1654: we showed how to use the sample complexity of a learning problem to
1655: determine a lower bound on the learning rate of a consistent learning
1656: agent. This bound is useful for quickly ruling out the possibility of
1657: having agents with high expected errors and of stating that an agent's
1658: expected error will be, at most, a certain constant value. Still, if
1659: the agent's learning algorithm is much better than the one assumed by
1660: a consistent learner (e.g., the agent is very good at generalizing
1661: from one world state to many others), then these lower bounds could be
1662: significantly inaccurate.
1663: 
1664: \appendix{}
1665: 
1666: \section{Derivation for Matching Game}
1667: \label{sec:Deriving-C-matching}
1668: 
1669: If we can assume that the action chosen when an agent changes \ditw{}
1670: and the result does not match \Ditw{} (for some specific $w$) is taken
1671: from a flat probability distribution, then we can say that:
1672: 
1673: \begin{equation}
1674:   \label{eq:34}
1675:   B = \frac{|A_i| - 2}{|A_i| - 1}.
1676: \end{equation}
1677: 
1678: We will now show how to calculate the fourth term
1679: in~\eqref{eq:main:matching}.  For the matching game we find that we
1680: can set:
1681: %\begin{equation}
1682: %  C = (1 -c_i) D + l_i + (c_i - l_i)F   \tag{\ref{eq:c}}
1683: %\end{equation}
1684: \begin{align}
1685:   D &= 1 - l_j \\
1686:   F &= l_j + (1 - l_j)
1687:   \left(
1688:     \frac{|A_i| - 3}{|A_i| - 2}
1689:   \right).
1690: \end{align}
1691: Having $|A_i| =2$ implies that $c_i = l_i$, this means
1692: that for this case we have
1693: \begin{equation}
1694:   \label{eq:8}
1695:   (1 -c_i) D + l_i + (c_i - l_i)F    = l_i + (1 -c_i)( 1 - l_j).
1696: \end{equation}
1697: 
1698: For the case where $|A_i|> 2$, which is the case we are interested in,
1699: we can plug in the values for $D$ and $F$ and simplify, in order to
1700: get the fourth term:
1701: \begin{equation}
1702:   \label{eq:33}
1703:   (1 -c_i) D + l_i + (c_i - l_i)F  = 1 - l_j + 
1704:     \frac{c_i l_j(|A_i| -1) + l_i(1 - l_j) - c_i}{|A_i|-2}.
1705: \end{equation}
1706: 
1707: %\bibliographystyle{plain}
1708: \bibliographystyle{plain}
1709: \bibliography{../annobib,../vidal}
1710: %\bibliography{clri-bib}
1711: 
1712: \end{document}
1713: