THE GENERALIZED TUTOR-STUDENT LEARNING ALGORITHM FOR
AUTONOMOUS MOBILE ROBOTS
by
Kevin Brammer
BSc., University of Northern British Golumbia, 2004

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
in
MATHEMATICAL, COM PUTER AND PHYSICAL SCIENCES
(COM PUTER SCIENCE)

THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA
October 2005

Kevin Brammer, 2005

1^1

Library and
Archives Canada

Bibliothèque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de l'édition

395 W ellington Street
Ottawa ON K 1A 0N 4
Canada

395, rue W ellington
Ottawa ON K 1A 0N 4
Canada
Your file Votre référence
ISBN: 978-0-494-28385-1
Our file Notre référence
ISBN: 978-0-494-28385-1

NOTICE:
The author has granted a non­
exclusive license allowing Library
and Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or non­
commercial purposes, in microform,
paper, electronic and/or any other
formats.

AVIS:
L'auteur a accordé une licence non exclusive
permettant à la Bibliothèque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par télécommunication ou par l'Internet, prêter,
distribuer et vendre des thèses partout dans
le monde, à des fins commerciales ou autres,
sur support microforme, papier, électronique
et/ou autres formats.

The author retains copyright
ownership and moral rights in
this thesis. Neither the thesis
nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.

L'auteur conserve la propriété du droit d'auteur
et des droits moraux qui protège cette thèse.
Ni la thèse ni des extraits substantiels de
celle-ci ne doivent être imprimés ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting
forms may have been removed
from this thesis.

Conformément à la loi canadienne
sur la protection de la vie privée,
quelques formulaires secondaires
ont été enlevés de cette thèse.

While these forms may be included
in the document page count,
their removal does not represent
any loss of content from the
thesis.

Bien que ces formulaires
aient inclus dans la pagination,
il n'y aura aucun contenu manquant.

Canada

A b stra ct

This thesis introduces the generalized Tutor-Student learning algorithm, which is
designed to extend an existing robot controller in order to improve its performance and
adaptability. The prototype 2 controller was developed for the tasks of obstacle avoidance
and attractor path planning with predator avoidance, in partially observable static and
dynamic environments. A version of this learning algorithm has been implemented to
extend the prototype 2 robot controller, a behavior based robot controller. This extended
controller is called the prototype 2A robot controller. The performance of this learning
algorithm was evaluated by deploying the prototype 2 robot controller and the prototype
2A robot controller in the same environments and comparing their results. It has been
shown th at the prototype 2A robot controller has superior performance over the prototype
2 robot controller due to its usage of a Tutor-Student learning algorithm. This thesis
will provide a complete description of the generalized Tutor-Student learning algorithm
and an application of this algorithm.

T A B L E OF C O N T E N T S

A b s tr a c t............................................................................................................................

ii

Table of C o n te n ts ...............................................................................................................

1

2

iii

List of T a b l e s ..................................................................................................................

viii

List of F ig u res..................................................................................................................

ix

In trod u ction

1

1.1

Problem S ta te m e n t...............................................................................................

1

1.2

Our C o n trib u tio n ..................................................................................................

2

1.3

O verview ..................................................................................................................

3

R eview o f C ontrol A rch itectu res

4

2.1

4

Robot C o n tro llers..................................................................................................

2.1.1 Autonomous Agents

...........................................................................................

4

2.1.2 Deliberative Controller P a r a d ig m .....................................................................

5

2.1.3 Reactive Controller P a r a d i g m ...........................................................................

7

2.1.4 The Hybrid Controller Paradigm

.....................................................................

9

2.1.5 Discussions of Different Robot Controller P arad ig m s....................................

12

2.1.5.1 Deliberative Controllers

........................................................................

12

..............................................................................

13

2.1.5.3 Hybrid C o n tr o lle r s ..................................................................................

14

2.1.5.2 Reactive Controllers

iii

2.2

Learning in Robot C o n tro lle rs...........................................................................

15

2.2.1 Learning A lg o rith m s ...........................................................................................

15

2.2.1.1 C y b e rn e tic s ...............................................................................................

15

2.2.1.2 Neural Networks

16

.....................................................................................

2.2.1.3 Reinforcement Learning

........................................................................

17

2.2.2 Case S tu d ie s ............................................................................................................

21

2.2.2.1 Example of Competence Acquisition

2.3

3

.................................................

21

2.2.2.2 Examples of Competence Acquisition and Competence A daptation

25

...............................................................................................................

29

Summary

M eth o d o lo g y

31

3.1

I n s p ir a tio n s ............................................................................................................

31

3.1.1 The Prototype 1 Robot C o n tro ller....................................................................

31

3.1.1.1 Overview

..................................................................................................

31

3.1.1.2 L im ita tio n s ...............................................................................................

32

3.1.2 Inspirations from Reinforcement Learning

....................................................

33

3.1.3 Inspirations from C y b e r n e tic s ...........................................................................

34

3.2

............................................

34

3.2.1 O verview ..................................................................................................................

34

3.2.2 C o m p o n e n ts............................................................................................................

35

Generalized Tutor-Student Learning Algorithm

3.2.2.1 Tutor

........................................................................................................

35

3.2.2.2 S t u d e n t .....................................................................................................

36

3.2.2.3 Action S e le c to r ........................................................................................

36

iv

3.2.2.4 Amplifier

4

..................................................................................................

36

3.2.2.5 E v a l u a t o r ..................................................................................................

36

3.2.3 A lg o rith m ...............................................................................................................

37

3.3

A p p lic a tio n ............................................................................................................

41

3.3.1 T utor.........................................................................................................................

42

3.3.2 Student

..................................................................................................................

45

3.3.3 Action S e le c t o r .....................................................................................................

46

3.3.4 A m plifier..................................................................................................................

46

3.3.5 E v a l u a t o r ...............................................................................................................

47

E xp erim en ts

48

4.1

E n v iro n m e n ts........................................................................................................

48

4.2

E v a lu a tio n ...............................................................................................................

50

4.3

E x p e rim en ts...........................................................................................................

51

4.3.1 Prototype 2A Controller: P a ra m e te rs ..............................................................

51

4.3.2 The Chase-Avoid Environment

........................................................................

52

4.3.2.1 D e s c r ip tio n ..............................................................................................

52

4.3.2.2 Results and D iscu ssio n ...........................................................................

52

4.3.3 The Cave-Box E n v iro n m e n t..............................................................................

54

4.3.3.1 D e s c r ip tio n ...............................................................................................

55

4.3.3.2 Results and D iscu ssio n ...........................................................................

55

4.3.4 The Cave-Predator-Prey Environment

...........................................................

57

4.3.4.1 D e s c r ip tio n .............................................................................................

57

V

4.3.4.2 Results and D iscu ssio n .........................................................................

58

4.3.5 The Cave-Predator-Prey-Box E n v iro n m e n t..................................................

59

4.3.5.1 D e s c r ip tio n ...............................................................................................

59

4.3.5.2 Results and D iscu ssio n ...........................................................................

60

C o n c lu sio n s............................................................................................................

62

4.4

5

C onclusion

65

5.1

Summary

...............................................................................................................

65

5.2

Open Q u e stio n s .....................................................................................................

66

5.2.1 Streamlining the Tutor-Student Learning A lg o rith m ....................................

66

5.2.2 Amplification Strategies

.....................................................................................

70

...........................................................................................

70

5.2.4 Real R o b o ts ............................................................................................................

71

5.2.5 Learning In te rv a ls..................................................................................................

71

5.2.3 Student Transplants

A cknow ledgm ent

72

R eferences

73

A p p en d ix A

78

A p p en d ix B

79

VI

A p p en d ix C

80

V ll

List o f Tables

1

Prototype 2A Param eters

...................................................................................

51

2

Controllers Results for the Chase-Avoid E n v iro n m e n t..................................

53

3

Average Improvement by the Prototype 2A Controller for the Chase-Avoid
E nvironm ent............................................................................................................

53

4

Controllers Results for the Cave-Box E n v iro n m en t.........................................

55

5

Average Improvement by the Prototype 2A Controller for the Cave-Box
E nvironm ent............................................................................................................

56

6

Controllers Results for the Cave-Predator-Prey E n v iro n m e n t.....................

58

7

Average Improvement by the Prototype 2A Controller for the Cave-PredatorPrey Environment

...............................................................................................

8

Controllers Results for the Cave-Predator-Prey-Box Environment

....

9

Average Improvement by the Prototype 2A Controller for the Cave-PredatorPrey-Box E n v iro n m e n t........................................................................................

vni

59
61

61

List o f Figures

1

Deliberative Controller Model 1 .........................................................................

5

2

Deliberative Controller Model 2 .........................................................................

5

3

Generalized Reactive A rc h ite ctu re ......................................................................

7

4

Three Layer A rch itectu re ......................................................................................

9

5

Cybernetic Controller M o d e l................................................................................

16

6

Tutor-Student Propagation P h a s e ......................................................................

37

7

Tutor-Student Pre-Evaluation Phase

................................................................

38

8

Tutor-Student Evaluation P h a s e .........................................................................

39

9

The Chase-Avoid Environment

.........................................................................

52

10 The Cave-Box E n v iro n m e n t................................................................................

54

11

............................................................

57

12 The Cave-Predator-Prey-Box E n v iro n m e n t......................................................

60

13 Modified Propagation P h a s e ................................................................................

67

14 Modified Pre-Evaluation Phase

.........................................................................

68

15 Modified Evaluation P h a s e ...................................................................................

69

The Cave-Predator-Prey Environment

IX

C hapter 1
In trod u ction

This thesis introduces the generalized Tutor-Student learning algorithm for mobile
robot controllers. The learning algorithm is used to extend an existing robot controller
in order to give the controller learning capabilities. This chapter will explain some of the
problems with conventional robot controllers, as well as explain how the Tutor-Student
learning algorithm can help overcome some of these problems.

1.1

P rob lem S tatem en t

A robot controller is, in essence, the brain of the robot. The job of the robot controller
is to map the robot’s perceptions to actions. In this paper the term s agent and robot
will be used interchangeably. The term agent applies to any entity th a t is self contained,
independent and can interact with its environment. A mobile robot is an independent
agent th a t has the ability to make decisions based upon a given situation w ithout the
influence or interference of a human operator.
Since the 1960s, research has been conducted in furthering the development of robot
controllers. Three primary robot paradigms have been developed. The deliberative robot
controller paradigm was developed in the 1960s. These types of controllers follow a sense
— > p la n

*• act control cycle. The reactive robot controller paradigm was developed in

the mid 1980s. Reactive controllers follow a sense — act control cycle. The hybrid robot
controllers are a combination of the deliberative controllers and the reactive controllers.
In all of the above paradigms, actions are produced by the robot controller through

either reaction or planning or both. As a consequence learning does not happen in any of
the paradigms. These robot controllers are incapable of learning from past experiences,
which results in the performance of the robot controller being solely dictated by the robot
designer.
As a result the held of robot learning was founded. Robot learning is a branch of
machine learning, which is concerned with developing techniques th a t allow machines to
learn. The held of robot learning has focused on two areas of learning in robot controllers:
C o m p eten ce A cquisition: The robot controller tries to learn a new competence. For
example, a robot controller may not know how to make the robot walk, but it is
able to learn how through negative and positive feedback.
C om p eten ce A d ap tation : The robot controller tries to positively modify a compe­
tence through interaction with its environment. For example, if the robot con­
troller knows how to fetch frisbees, can the same robot controller learn to gather
more frisbees faster?
This thesis looks at the problem of competence adaptation. We ask the question, how
can we convert an existing robot controller into a learning robot controller?

1.2

Our C on trib u tion

In answering the question of how to convert an existing robot controller into a
learning robot controller, we developed the generalized Tutor-Student learning algorithm.
The generalized learning algorithm is a method th a t specifies how to add adaptability to
the behaviors of a pre-existing robot controller. As well we have constructed a possible

implementation of the learning algorithm and have evaluated its performance in four
different simulated environments.
All four of the simulated environments test a robot controller’s ability to avoid obsta­
cles and to determine paths to attractors while avoiding predators. A robot controller’s
performance in all of these environments is judged by the number of times the robot
controller was able to successfully complete its tasks (i.e. how many times it avoided
obstacles, as well as how many times it found attractors). The experiments th a t we have
conducted in these environments have shown th a t a robot controller using the TutorStudent learning algorithm can achieve between almost two to over five times better
performance (depending on the environment) in the task of attractor path planning with
predator avoidance, than the same controller not using this learning algorithm. There­
fore, we can claim th a t the Tutor-Student learning algorithm is successful since it allows
a robot controller to automatically tune its actions according to its environment.

1.3

O verview

There are four remaining chapters in this thesis. Chapter 2 is the literature survey,
which will focus on conventional robot controllers and robot learning with several case
examples. Chapter 3 explains the principles of the Tutor-Student learning algorithm
and offers some guidelines on its implementation. Chapter 4 describes the experimental
procedures and results. Finally, Chapter 5 will present our conclusions and open questions
regarding future research with the Tutor-Student learning algorithm.

C hapter 2
R ev iew o f C ontrol A rch itectu res

This literature survey will discuss the evolution of robot controllers and how robot
learning integrates with robot controllers. In this literature survey, we will examine the
three prim ary robot controller paradigms. As well, we will briefly look at the field of
robot learning along with three learning robot controller case studies.

2.1

R ob ot C ontrollers

A robot controller maps how perceptions from sensors are translated into actions to
be executed by actuators. Robot controllers can be implemented as either software or
hardware, or as a combination of both. The study of robot controllers is quite diverse,
consequently this paper will only be covering three of the prim ary paradigms in robot
control architectures: deliberative, reactive and hybrid controllers.

2.1.1

A u ton om ou s A gen ts

The term agent is an im portant concept in robotics. The definition of an autonomous
agent can be defined iteratively. Maes does this [Mae94] as follows:

“An agent is a system th a t tries to fulfill a set of goals in a complex, dynamic
environment.”
“An agent is called autonomous if it operates completely autonomously, i.e.
if it decides itself how to relate its sensor data to motor commands in such a
way th a t its goals are attended to successfully.”

The term agent in this paper is used to describe any entity with the capacity to act
intelligently and purposefully in its environment in order to achieve its goals, specifically
the term agent will refer to robotic agents.

2.1.2

D elib era tiv e C ontroller P aradigm

Deliberative controllers were first developed in the 60s, and dominated the the field
of robot controllers until about the mid 80s. In deliberative controllers the flow of control
is unidirectional in the general case [GatOS]. Deliberative controllers decompose an agent
in one decomposition. The decomposition in deliberative controllers is commonly called
a horizontal decomposition due to the unidirectional flow of information through its
modules.
Sensor
Module

Planning
Module

Model
Module

Act
Module

Environment

Fig. 1: Deliberative Controller Model 1

Sensor
Module

Planning
Module

Act
Module

Environment

Fig. 2: Deliberative Controller Model 2

Figure 1 and Figure 2 present two types of deliberative controllers.
controllers usually are decomposed into the following modules;

Deliberative

Sensor M odule: This module’s job is to globally collect all the sensor signals from the
agent’s sensors. These signals are usually, but not necessarily, preprocessed into a
knowledge representation (e.g., symbolic representation or fuzzy logic representa­
tion). This information is then passed onto either the model module or the planning
module.

M od el M odule: This module translates the inputs from the sensor module into a rep­
resentation of the agent’s environment. This model representation of the environ­
ment, however, is only as accurate as the sensors sensing the actual environment
[Bro91]. This information is then made accessible to the planning module.

P lan n in g M odule: This module is also called the planner. Its job is to formulate plans
based on the agent’s internal world model and current goal. Once a plan has been
formulated, the plan is then given to the act module to execute. It should be noted
th a t the model module and the planning module are sometimes combined into one
module and referred to as the planning module.

A ct M odule: This module is also called the executive. The job of this module is to
execute the plan given to it by the planning module. This plan is executed as a
computer program is executed [Gat98], by carrying out a sequence of instructions
or actions.

Prom the description of the above modules the deliberative controllers can be seen as
modified general problem solvers in artificial intelligence [Bro91].
deliberative controllers are discussed in section 2.1.5.1.

The limitations of

2.1.3

R ea ctiv e C ontroller P aradigm

Reactive controllers are characterized by their tight coupling of perception and action
[Ark98, Bro86, Bro89]. In reactive controllers, the flow of control is explicitly or implicitly
parallel [Bro86, NHS89], which can be seen in Figure 3. The parallel nature of a reactive
controller is done for modularity, multiple goals and robustness [Ark98, Bro86]. The
reactive controller paradigm decomposes an agent in several decompositions. Reactive
controllers typically follow an incremental design and test philosophy [Ark98, Bro86]. An
incrementally designed agent is designed and tested on a small set of behaviors. Once
the agent has been tested successfully, more behaviors can then be added.
Behavior

Sensors

Action
Selection
Mechanism

Behavior

Actuators

Behavior

Environment

Fig. 3: Generalized Reactive Architecture

A reactive controller is usually decomposed into a set of behaviors and an action
selection mechanism [Ark98, Bro86]. The job of a behavior module is to map perceptions
into an action. This mapping is typically done so th a t there is very little or no planning
involved in generating the action. Each behavior module runs concurrently with other
behavior modules. A behavior module can be decomposed in the same manner as a
deliberative controller [Bro86]; however, planning is typically kept to a minimum as
7

emphasis is put on the speed of the behavior module. Behavior modules can receive
perceptual information in several ways [Ark98, Bro86, Bro89]. According to Arkin there
are three ways in which perceptual information can be received by a behavior module
[Ark98] :

Sensor Fission: Sensor fission occurs when a behavior specific stimulus is used to pro­
duce a response, and thus a dedicated perceptual module (sensor) is used to channel
its output directly into the behavior module [Ark98].
A ction -O rien ted Sensor Fusion: Action-oriented sensor fusion occurs when transi­
tory representations local to a behavior are constructed from multiple perceptual
modules (sensors) [Ark98]. These transitory representations restrict the sensor in­
put to the requirements of a behavior modules [Ark98].
P ercep tu al Sequencing: Perceptual sequencing occurs when the behavior module al­
lows for the coordination of multiple perceptual algorithms over time in support of
a single behavioral activity [Ark98]. These perceptual algorithms are sequenced in
and out, based on the agent’s needs and the environmental context in which the
agent is situated [Ark98].

The job of the action selection mechanism is to determine which action to select
from the set of actions generated by the set of behavior modules [Bro86, Ark98, NHS89].
Once an action has been selected by the action selection mechanism, the action is then
executed. Action selection mechanisms can be divided into two branches: arbitration
and command fusion [Pir99]. A rbitration is a competitive approach where behaviors

compete for control of the robot’s actuators [Pir99], whereas, command fusion is where
each behavior contributes to the final action th a t is selected to be executed by the robot’s
actuators [Pir99].
The most notable and popular reactive architecture is Brook’s subsumption architec­
ture [Bro86]. It was developed in response to the difficulties and limitations (see page 12)
associated with the deliberative controller paradigm [Bro86]. The limitations of reactive
controllers are discussed in section 2.1.5.2.

2.1.4

T h e H ybrid C ontroller P aradigm

Hybrid controllers attem pt to capture the reactivity of reactive controllers and the
deliberative planning of deliberative controllers [Ark98]. Unlike reactive and deliberative
controllers, there is no clear breakdown of modules used in hybrid controllers [Ark98].
Hybrid controllers are usually described in terms of layers, where each layer is really a
controller in its own right. At the very least, an agent implementing the hybrid controller
paradigm requires a deliberative layer and a reactive layer [Ark98]. In fact these are called
two level architectures [SRK93a]. In two level architectures, a deliberative planner is used
to generate guidelines for the reactive controller [SRK93a]. The two level architectures
follow Agre and Chapm an’s [AC90] philosophy of using plans as being a resource for
the agent as opposed to explicit execution. Some notable two level architectures are the
Theo-agent architecture[Mit90], the DAMN architecture [Pir99] and the fuzzy controller
architecture in [SRK93b, SRK93a].
The other popular class of hybrid architectures, th a t have been relatively well de­
fined are the three layer architectures. Figure 4, shows a diagram representation of this
9

Sensors
Environment

Controller

Sequencer

Deliberator

Actuators

Fig. 4: Three Layer Architecture

architecture, which consists of the following three layers:

• A reactive feedback control mechanism called the controller. Gat [Gat98] describes
the controller as follows:

“The controller consists of one or more threads of computation th a t im­
plement one or more feedback loops, tightly coupling sensors to actuators.
The transfer function(s) computed by the controller can be changed at
run time. Usually the controller contains a library of hand-crafted trans­
fer functions (called primitive behaviors or skills). Which ones are active
at any given time is determined by an external input to the controller.”

The controller is in essence a reactive controller th a t implements a set of behaviors.
• A slow deliberate planner, which is called the deliberator. G at [Gat98] describes it
as follows:

“The deliberator is the locus of time-consuming computations. Usually
this means such things as planning and other exponential search-based
algorithms, but as noted before, it could include polynomial-time algo­
rithm s with large constants such as certain vision processing algorithms
10

in the face of limited computational resources. The key architectural
feature of the deliberator is th a t several Behavior transitions can occur
between the time a deliberative algorithm is invoked and the time it pro­
duces a result. The deliberator runs as one or more separate threads
of control. There are no architectural constraints on algorithms in the
deliberator, which are invariably w ritten using standard programming
languages.”

The deliberator is basically a deliberative controller.
• A sequencing mechanism called the sequencer is used to connect the controller and
the deliberator. Gat [Gat98] describes it as follows:

“The sequencer’s job is to select which primitive Behavior (i.e. which
transfer function) the controller should use at a given time, and to sup­
ply parameters for the Behavior. By changing primitive Behaviors at
strategic moments the robot can be coaxed into performing useful tasks.
The problem, of course, is th a t the outcome of selecting a particular
primitive in a particular situation might not be the intended one, and
so a simple linear sequence of primitives is unreliable. The sequencer
must be able to respond conditionally to the current situation, whatever
it might be.”

The sequencer is more of an action selection mechanism than a robot controller.
Its prim ary purpose is to interface the controller and the deliberator.

11

Some notable three layer architectures are Connell’s SSS architecture [Con92], G a t’s
Atlantis architecture [Gat98] and 3T architecture [BK96]. The limitations of hybrid
controllers are discussed in section 2.1.5.3.

2.1.5

D iscu ssion s o f D ifferent R o b o t C ontroller Paradigm s

This section will discuss the advantages and disadvantages associated with the three
primary robot controller paradigms.

2.1.5.1

D elib erative C ontrollers

The major advantage of deliberative controllers over reactive controller (Section 2.1.3
on page 7) is their ability to globally plan a course of actions for the agent to perform.
The reactive controller’s designer typically focuses on carefully designing the behaviors
of the agent in order to achieve a task. Planning in deliberative controllers provides a
higher level of control and task execution as compared to the reactive controllers.
While planning is a major advantage to the deliberative controllers, it is also the
source of most of the deliberative controller’s disadvantages. Planners in deliberative
controllers rely heavily on an accurate world model to formulate plans. However, it is
impossible to take any sensor data and create an accurate representation of the world
[Bro91]. All sensors have margins of error. As well, the environment may not allow
sensors to perform optimally. Not to mention, actuators are not accurate enough to
precisely carry out actions dictated by the planning module. This kind of error in the
actuators can cause an invalidation of the plan and the need for re-planning. Planners
in deliberative controllers prevent the agent from being a part of its environment. The

12

executive will execute a plan regardless of whether it is still relevant. Some executives
are programmed to stop executing the plan if the plan can not succeed. Then control
is given back to the planner so th a t it can formulate a new plan [AC90]. In dynamic
environments, the formulation of a plan is time consuming and costly of the robot’s
resources. If the agent has to rethink plans continually due to a changing environment,
then the agent will be slow to react to situations th a t call for immediate action. Hybrid
controllers (Section 2.1.4 on page 9) attem pt to alleviate this problem by combining the
deliberative controller paradigm and the reactive controller paradigm together. Agre and
Chapman [AC90], alternatively, propose th a t a plan be used more as a resource than as
explicit instructions. This appears to be one of the primary ideas in hybrid controller
design. Another disadvantage of deliberative controllers is th at they are not robust. A
single failure in any module will cause either an error in all proceeding modules, or the
complete system failure of the agent.

2 .1.5.2

R ea ctiv e C ontrollers

The reactive controllers have several advantages.

They are able to react almost

immediately to situations. However, immediate reaction may be a disadvantage, in th at
the robot may blindly act and miss opportunities presented to it by its environment.
Reactive controllers have the ability to contend with multiple (possibly conflicting) goals
due to the parallel nature of their behaviors and action selection mechanisms. Reactive
controllers are robust in th a t a single failure will not cause the whole system to fail. This
robustness is achieved through the parallelization of behaviors in reactive controllers.
The biggest difficulty in reactive controllers is their blind reaction to different situa13

tions. The robot’s reaction to stimuli is mapped out by the robot designer who assumes
what will be im portant stimuli for the robot and how the robot will react to th a t stimuli
[Neh92], The problem is th a t the way humans perceive things is not necessarily the same
way a robot should perceive things [Neh92], A solution to the problem of robot percep­
tion is to have the robot learn a behavior and determine the stimuli th a t are im portant
for th a t behavior (see section 2.2.2 on page 21). Since explicit planning is discouraged in
reactive controllers, implicit planning usually takes place through the controller’s action
selection mechanism. The vast majority of these implicit plans are hand designed by the
designer of the robot, according to the designer’s perceptions as opposed to the robot’s.
Another disadvantage in reactive controllers is th a t they do not scale up well to complex
tasks [Gat98]. As well reactive controllers typically are unable to adapt from one envi­
ronment to another, as a result of the specificity of the behaviors in reactive controllers
for particular environmental cues.

2.1.5.3

H ybrid C ontrollers

The major advantage of hybrid controllers is th a t they provide the best features
of both deliberative controllers and reactive controllers. They are able to react when
necessary and to formulate plans when they have to.
The major difficulty with hybrid systems is th a t there is no definitive way of inter­
facing a deliberative planner with a reactive controller. Under the two level architec­
tures, there are a variety of ways of interfacing the deliberative planner with the reactive
controller, each of which has its own advantages and disadvantages. The three layer
architectures are a bit more defined as to what layers are needed. However, there is not
14

a lot of literature on the sequencer layer and more specifically how it interfaces with the
deliberator and the controller.

2.2

Learning in R o b o t C ontrollers

The field of robot learning tries to tackle the question of how to make a robot acquire
knowledge in order to perform certain tasks within an environment successfully [NSM93]
and how to adapt existing behaviors in order to gain improved task performance from the
robot [Ark98]. In this section, we present learning algorithms as well as case examples
of learning in robot controllers.

2.2.1

Learning A lgorith m s

This section will present a selection of learning algorithms from cybernetics, neural
networks and reinforcement learning.

All of the presented learning algorithms learn

incrementally through trial and error.

2.2.1.1

C yb ern etics

Robotic controllers are designed to map perception to action. The first held to study
this mapping from perceptions to action is cybernetics, which can be described in the
following manner:
“Cybernetics is the application of control theory to complex systems.” [Neh99].
“A marriage of control theory, information science, and biology th a t seeks to
explain the common principles of control and communication in both animals
and machines.” [Ark98]
15

Both of the above descriptions include the field of control theory, which is the mathe­
matical study of how to manipulate the param eters affecting the behavior of a system in
order to produce an optimal outcome.
Error

Monitor

Sensors

Controller

f(Error)

Actuators

Environment

Fig. 5; Cybernetic Controller Model

A cybernetic control model is shown in Figure 5. It uses a function called the m onitor
to compare the actual status of a system Xt at time t with a desired system status Tt
[Neh99]. The monitor returns the error between the actual system state and the desired
system state which is defined as Cf =

- Tt [Neh99]. The error

is then used as an input

signal to a controller module [Neh99]. The job of the controller is to minimize the error
between the actual system state and the desired system state [Neh99]. Therefore, the
controller will choose an action yt+k such th a t yt+k = f{^t) = f {xt —"p) at the next time
step t+k [Neh99]. Thus, the next action selected by the controller is based on a function
of the error Cj. The function / in cybernetics is defined as the control function [Neh99],
where learning takes place. The main focus in cybernetics is to define the control function
/ and minimize its param eter

2.2.1.2

[Neh99].

N eu ral N etw ork s

The field of neural networks attem pts to emulate the neural networks found in bi­
ological creatures.

Neural network learning fall into two categories: supervised and

16

non-supervised learning. In supervised learning, we already know the inputs and outputs
th a t are used to train the neural network. In non-supervised learning, the neural network
attem pts to cluster inputs into categories.
The neural network learning algorithm th a t we will examine is a supervised learning
algorithm called the incremental backpropagation learning algorithm. This learning algo­
rithm works on a neural network with a multilayer feed forward network topology, whose
hidden layer and output layer employ sigmoid transfer functions [Mit97]. The learning
algorithm employed by a backpropagation neural network allows it to learn non linearly
separable functions [Fau94]. The algorithm for the incremental backpropagation learning
algorithm is given in Appendix C (page 80).

2.2.1.3

R einforcem ent Learning

Reinforcement learning occurs when an agent learns through trial and error inter­
actions with its environment [SB98]. In order to adequately describe how reinforcement
learning is connected with competence acquisition and competence adaption, four main
elements to reinforcement learning need to be defined. A description of these four ele­
ments is given below:

Policy:
“A policy defines the learning agent’s way of behaving at a given time.
Roughly speaking, a policy is a mapping from perceived states of the
environment to actions to be taken when in those states.” [SB98]

A policy is a way for choosing an action based upon numerical values. A typical
17

policy used in reinforcement learning is if an agent is in state i, all the values
of actions th a t can be performed in state i are recalled and the action with the
maximum value is performed [SB98]. This process is known as exploitation in
th at the agent is exploiting previous knowledge [SB98]. Exploration occurs when a
random action in state i is chosen. The policy of an agent usually chooses between
actions th at exploit the agent’s knowledge or expand the agent’s knowledge [SB98].
This is done so th a t the agent can have a balance between exploiting already know
good actions and exploring actions th a t have the potential of being good.
R ew ard Function:

“A reward function defines the goal in a reinforcement learning problem.
Roughly speaking, it maps each perceived state (or state-action pair) of
the environment to a single number, a reward, indicating the intrinsic
desirability of th a t state. A reinforcement learning agent’s sole objective
is to maximize the total reward it receives in the long run. The reward
function defines what are the good and bad events for the agent.” [SB98]
The reward function gives an immediate numerical reward to the agent for transi­
tioning from one state to another state [GosOS]. The immediate reward received by
an agent can be positive or negative. If the reward is negative, it is typically referred
to as a cost. In dynamic programming (which is a precursor to reinforcement learn­
ing), each state is given a immediate reward [GosOS]. However, in reinforcement
learning only a subset of all the states in the agent’s world are given immediate
rewards, whereas the rest of the states are assumed to have an immediate reward
18

of zero [GosOS]. Thus, only states th a t are considered to be significant are given
immediate rewards.

V alue Function:

"... a value function specifies what is good in the long run. Roughly
speaking, the value of a state is the total amount of reward an agent can
expect to accumulate over the future, starting from th a t state.” [SB98]

The value function is used to compute the value of a state or state-action pair. It
is a way of looking ahead and judging how valuable it is for an agent to perform an
action or enter a certain state. Typically, a value function is defined in a tabular
format th a t maps state-action pairs to numerical values. These numerical values
provide the basis for selecting which action to perform. Usually a value function can
also be defined in a neural network with an incremental style updating rule [GosOS].
An incrementally trained neural network is one th a t is updated in increments rather
than in batches. A neural network has the property th a t it can generalize output
values across different input values, even input values th a t it has not seen before.
By incrementally teaching the neural network, it becomes a function approximator.

M odel:
“This is something th a t mimics the behavior of the environment.” [SB98]
A model is a simulation, which allows the agent to learn quicker than if it had to
learn in the actual environment [GosOS]. A simulation allows the agent to experience

19

more states in less time than in its actual environment. In essence the model is a
training world for the agent.

The reinforcement learning algorithm th a t we will examine in this section is the
Q-learning algorithm. Q-learning has a learning component called the value function
Q (state, action) and a decision component called the policy. The learning in Q-learning
works by shaping a value function as the agent interacts with its environment [SB98].
Decision making in Q-learning is accomplished through the policy [SB98], which is a
function th a t chooses an action based on the results of the value function for all combi­
nations of the current state i and the set of possible actions th at can be executed in state
i [SB98]. If the chosen action causes the robot to be in an undesirable state, then th at
action’s expected reward for the previous state is deceased. Otherwise, the action’s ex­
pected reward for the previous state is increased. By repeating this process, the agent’s
policy will eventually converge to the actual optimal policy [SB98]. An algorithm for
standard Q-learning is given in Appendix A (page 78).
Typically a value function of Q-learning is defined in a tabular format th a t maps stateaction pairs to numerical values. These numerical values provide the basis of selecting
which action to perform for a particular state. A value function can also be defined
in a neural network with an incremental style updating rule [GosOS]. In a Q-learning
algorithm th a t is coupled with neural networks, each action is associated with its own
neural network. Each of these action neural networks takes a state as their input and
outputs the value of taking th a t action from the input state. The algorithm for this form
of Q-learning is given in Appendix B (page 79).

20

2.2.2

C ase S tu d ies

In this section, we will analyze some representative cases of learning algorithms ap­
plied to robotics. A competence acquisition learning algorithm can be defined as the
process used for an agent to acquire a new behavior through interacting with its environ­
ment [Neh92]. And a competence adaptation learning algorithm can be defined as the
process used by an agent to modify an existing behavior so th a t the behavior’s perfor­
mance improves with respect to the agent’s interactions with its environment [Ark98].
In both competence acquisition and competence adaptation, the goal for the learning
algorithm is to improve a behavior’s performance [Ark98].

2.2.2.1

E xam ple o f C o m p eten ce A cq u isition

The Really Useful Robot (RUR) project conducted at the University of Edinburgh
provides a good example of competence acquisition learning in a robot agent [NHS89]. In
the classification of robot controllers discussed earlier, this architecture would be classified
as a reactive controller since it just reacts to situations without any planning.

The

RUR control architecture was developed in order to explore the question of competence
acquisition learning in a robot agent [NHS89]. This control architecture uses a set of
instinct rules to determine when the architecture trains an association between an input
vector and an action to its associative memory. Instinct rules act as conditions placed on
the robot’s sensor readings (e.g. a reading on a distance sensor). This usage of instinct
rules shapes the perceptron network of the agent. Moves th at violate the agent’s instinct
rules are not learned by the robot’s associative memory, whereas moves th a t do not

21

violate the agent’s instinct rules are learned by the robot.
The process of learning in this architecture is accomplished by pushing the outputs
of a feed forward perceptron neural network towards a desired output value. It should
be pointed out th a t the output value generated by the neural network for a particular
state may not be the desired output. The learning by the perceptron network in this
architecture does not go on until the desired output has been achieved but rather for a
certain number of iterations.
The learning algorithm [Neh95] in this architecture works by checking the current
state of the robot against the robot’s instinct rules.

If any of the instinct rules are

violated, an input vector of the current state of the robot is generated.

This input

vector is then sent into the robots associative memory. A vector of possible actions
is generated by the associative memory for the architecture to execute. The choice of
possible actions from the action vector proceeds by choosing the first action in the vector
to execute first. The retrieved action is executed and if the resulting state satisfies the
robot’s violated instinct rule, then the association between the input vector and action
is taught to the robot’s associative memory. Otherwise, if the robot’s violated instinct
rule is still violated, the next action in the action vector is then executed for a longer
period of time than the first unsatisfactory action. According to the authors [Neh95],
the execution of an action for a longer period of time is done so th a t the robot can
get itself out of “deep problems” . The architecture repeats this process until an action
th a t satisfies the robot’s violated instinct rule is found. If a satisfactory action cannot
be found within the action vector, then a random action is generated from the robot’s
set of possible actions. If the randomly generated action satisfies the ro b o t’s violated
22

instinct rule, then the association between the input vector and the action is taught
to the robot’s associative memory. Otherwise, if the robot’s violated instinct rule is
still violated, another random action is generated and is executed for a slightly longer
period of time than the first unsatisfactory action. This process of generating and testing
random actions continues until an executed action satisfies the robot’s violated instinct
rule. Once tire robot’s instinct rule has been satisfied, the last executed actions continues
to be performed until one of the robot’s instinct rules becomes violated. W hen an instinct
rule becomes violated, the architecture executes the above learning algorithm until the
violated instinct rule becomes satisfied.
The RUR control architecture has several difficulties. In the RUR controller, the
associative memory is where the robot’s learning of competences is dynamically and
incrementally shaped via the robot’s interactions with its environment. Actions th a t are
considered good cause the associative memory to be pushed towards choosing an action
in a particular state. This architecture picks actions incrementally from the action vector
[Neh92]. Thus it does not exploit the knowledge stored in the neural network as much
as it should. Another problem with this architecture is th a t when it exhausts all of the
suggested actions in the action vector, it will then choose actions at random for different
durations until the robot goes into a non-violated instinct rule state. W hen the control
architecture gets into the phase of the learning algorithm where solely random actions are
chosen, the architecture is no longer exploiting prior knowledge th a t it has learned about
a state. In this case, the architecture is just randomly exploring. Once it has found an
action th a t satisfies its instinct rules, it then teaches this action to the associate memory
for the first instinct rule violated state. However, this action could be inappropriate for
23

the first instinct rule violated state.
Despite the above difficulties, the RUR controller was able to learn and perform the
tasks of obstacle avoidance, wall following, corridor following, phototaxis and box pushing
[Neh94]. These tasks were accomplished through modifying the RUR controller’s set of
instinct rules for each task. As stated above instinct rules are conditions th a t are put
on the robots sensors. When an instinct rule is violated, actions are performed until the
instinct rule is satisfied. For example, say we have a robot with a forward sensor and
two front whisker sensors (one to the left and one to the right). A possible set of instinct
rules for wall following could be:

1. Forward sensor is true.
2. Front left whisker sensor is true XOR front right whisker sensor is true.
3. Front left whisker is false or front right whisker is false.

It should be noted th a t the order of the instinct rules is im portant. W ith this set of
instinct rules the robot will operate in the following manner. W hen the robot is turned
on it will initially be stationary, thus instinct rule # 1 is violated, which causes the RUR
controller to try actions until the robot is moving forward. After satisfying instinct rule
# 1 , instinct rule # 2 is violated, because none of the robot’s whisker sensors are in contact
with any objects. This causes the robot to try actions until it hits an obstacle (in our
case a wall). Instinct rule # 2 becomes satisfied once the robot has hit an obstacle with
one of its whisker sensors. Now the robot is violating instinct rule # 3 , because one of its
whisker sensors is tonching the wall. In order, to satisfy instinct rule # 3 the robot must
try actions until it has moved away from the wall. After instinct rule # 3 is satisfied the
24

first instinct rule is tried again.
The Edinburgh researchers th a t created the RUR controller [Neh94] primarily used
simple push button and whisker sensors for all their robots th at used the RUR controller.
By using simple sensors the RUR creators made it easy to check if an instinct rule had
been violated since the sensors had only two discrete states. Unfortunately, we were
unable to find research on how the RUR controller would scale up to sensors th a t return
continuous states (e.g. Sonar or IR sensors).
R, appears th a t the RUR researchers in designing their architecture have taken some
ideas from the fields of Cybernetics, Neural Networks and Reinforcement learning. The
idea of using functions as couplings between sensors and motors in the RUR controller
is a very common idea in Cybernetics [Bra86]. The RUR researchers used an associative
memory (neural network) as a dynamic function th a t uses sensor values as inputs and
outputs actions for the motors to execute. In this architecture the instinct rules act as a
means of determining which actions to learn (reward) and which actions to throw away
(punish). The RUR researchers use the controller’s associate memory almost like a value
function from reinforcement learning in th a t it outputs a set of actions for a particular
sensor state. However, it differs from a reinforcement learning value function in th a t it
does not determine the value of an action for a state. The RUR associate memory only
creates a vector of possible actions for the robot to execute and not a particular value
associated with those actions.

2.2.2.2

E xam ples o f C om p eten ce A cq u isitio n and C o m p eten ce A d a p ta tio n

C arbot A rch itectu re
25

A good example of a competence acquisition and competence adaptation learning
algorithm is shown in the Carbot architecture th a t was designed by Meeden, McGraw
and Blank [MMB93]. In this architecture, the robot has to learn the value function for
the task of navigation in a simple environment [MMB93].
Like in the RUR project [Neh92], the Carbot architecture combines elements of connectionist methods and reinforcement learning into the architecture. In this architecture,
the neural network is trained with a sophisticated learning algorithm called ComplemeMtary Reinforcement Back-Propagation (CRBP) [MMB93], Meeden, McGraw, and Blank
[MMB93] use conditions on the robots sensors in order to indicate when the robot should
either be rewarded or punished. This use of conditions placed on the robot’s sensors is
similar to instinct rules in the RUR project [Neh92]. This architecture works by using its
neural network to generate an action to be executed based on the robot’s current state. If
the action does not violate any of the conditions put on its sensors, then the prior state
and successful action are updated in the neural network by using the CRBP learning
algorithm with a step size set at 0.3. If an action is unsuccessful, then the prior state
and unsuccessful action are updated to the neural network by using the CRBP learning
algorithm with a step size of 0.1. Thus, if an action is successful (the action satisfied the
robot’s sensor conditions), the neural network is pushed towards th a t action, otherwise
the neural network is pushed away from th a t action [MMB93]. This process of generating
actions repeats for the entire run of the robot.
An interesting thing about this architecture is th a t it does not limit the robot’s moves
to a certain set of moves like the RUR architecture, because the Carbot architecture has
neural networks directly connected to its motor inputs. The number of states for the
26

Carbot architecture to learn is limited by the number of sensors and types of sensors
th a t are employed. The Carbot architecture employs touch sensors and light sensors. As
well, the environment used to evaluate the Carbot architecture was a simplified partially
observable static environment [MMB93]. The Carbot architecture does not randomly
generate moves in the same sense as the RUR architecture. The neural network used
by the Carbot architecture is initially untrained; thus its output values are going to be
random values. The training of the architecture’s neural network as stated above shapes
the values th a t are outputted by the neural network. Thus, the Carbot architecture
has an implicit random move component. The actions generated by the architecture for
unknown states will be essentially random.
This architecture has proven to be successful in learning and adapting the compe­
tences of phototaxis and obstacle avoidance [MMB93]. However, the C arbot architecture
has two significant drawbacks. The first difficulty with this architecture is th a t it requires
a learning phase of 3000 decisions [MMB93] before it is able to perform its tasks compe­
tently within its environment. This leads to the second drawback; the robot’s learning
is tailored specifically to the environment in which it has been trained. Unfortunately,
we have been unable to find any research involving this architecture where the controller
has been trained in one environment and then moved into another environment; thus
the adaptability of the architecture is questionable. Finally, the Carbot architecture has
only been used in simplified static environments. We have been unable to find any re­
search where the Carbot controller has been deployed in more complex environments. It
is therefore questionable whether the architecture would operate well in more complex
environments.
27

Su b su m p tion -b ased A rch itectu re
Another good example of learning algorithms for competence acquisition and com­
petence adaptation is found in the architecture designed by Mahadevan and Connell
[MC92]. In this architecture, the robot has to learn a value function for box pushing
in a simple environment. Their architecture was based on the subsumption architecture
[Bro86], so th a t the box pushing task could be split into the following behaviors for the
robot to acquire and adapt [MC92]:
Findin g a Box: This behavior is used by the agent in order to find a box.
P u sh in g a Box: This behavior is used by the agent in order to push a box.
G ettin g unw edged: This behavior is used by the agent in order to get unwedged from
an immovable obstacle.
These behaviors are prioritized in the subsumption architecture such th a t the get­
ting unwedged behavior has the highest priority, the pushing a box behavior has the
second highest priority and finding a box has the lowest priority [MC92]. By splitting
up the box pushing task into the above behaviors, Mahadevan and Connell have effec­
tively parallelized the reinforcement learning algorithm used in this architecture. In this
architecture, sensors are used to determine the state of the agent and the applicability
of the agent’s behaviors. As well, the sensors are used by the reward function of the
agent [MC92]. Q-learning is used by this architecture in order to train the agent in the
competence of box pushing [MC92].
The designers of this architecture point out th a t it is a non-trivial task to decompose
the learning task into subtasks and reward functions [MC92], in th at, there are no guide­
28

lines on how to setup a reward function for each learning task[MC92]. Another problem
th a t the designers of this architecture point out is th a t the number of states and actions
can easily make the reinforcement learning problem with Q-learning intractable [MC92],
The reason for this is th a t it is not possible to store the value of every state-action pair
when there are a large number of states and actions. The designers of this architecture
also point out th a t the generalization of states the agent can significantly reduce the
number of states th a t need to be considered [MC92].

2.3

Sum m ary

In this chapter, we have reviewed the three primary robot controller paradigms. As well
we have briefly examined the field of robot learning. We have also reviewed learning
algorithms used in cybernetics, neural networks and reinforcement learning. Finally we
studied three representative robot learning controllers.
All three of the example robot learning architectures have common drawbacks. The
first drawback is th a t these architectures are unable to function with any degree of com­
petence when they are first deployed in an environment. A second difficulty with regard
to the Carbot architecture [MMB93] and the subsumption-based architecture [MC92] is
th a t learning in these architectures takes a great deal of time. The RUR architecture,
however, does possess the ability to learn quickly due to its use of a perceptron network
for its associative memory. The RUR architecture suffers a drawback due to its use of
a perceptron network in th a t it acts erratically since the perceptron network can not
approximate non-linearly separable functions. As well, all of these architectures have
been used only in partially observable static environments. We have been unable to find
29

any research th a t would indicate if any of these architectures would scale up to more
dynamic environments.
The generalized Tutor-Student learning algorithm th a t we have proposed in this thesis
allows us to deploy a robot, which is able to function competently in an environment as
soon as it is deployed. The Tntor-Student learning algorithm learns quickly and does not
make the robot act erratically. As well, onr learning algorithm works in both partially
observable static and dynamic environments. More importantly this learning algorithm
can be used to extend any of the three primary robot controller paradigms, where as, the
three example architectures have only been able to add learning capabilities to reactive
controllers.

30

C hapter 3
M eth od ology

This chapter presents the methodology of the Tutor-Student learning algorithm,
which attem pts to im itate the interactions between a tu to r and a student. The job of
the tutor is to teach and guide the student, which also includes correcting the student
when the student is wrong. The job of the student is to absorb the teachings of the tu to r
and try to improve upon those teachings. Through this learning algorithm, an existing
robot controller can become more attuned to its environment, as well as becoming more
adaptable to changes in its environment. In this chapter, we will start with a review of
some prior research th a t has inspired this approach. We will also present the generalized
Tutor-Student learning algorithm as well as an application based upon this method.

3.1

Inspirations

The Tutor-Student learning algorithm was inspired by prior experiments with our
earlier prototype 1 robot controller, as well as some ideas from the fields of reinforce­
ment learning and cybernetics. In the following sections, we will briefly present these
inspirations.

3.1.1
3.1.1.1

T h e P r o to ty p e 1 R o b o t C ontroller
O verview

The prototype 1 controller was designed to acquire behaviors based upon a set of
evaluation conditions. This controller is basically a modified subsumption controller,

31

where each behavior is given a rule system, a learning system and a set of conditions.
The goal of this robot controller is to teach each behavior’s learning system how to act
through a combination of the behavior’s rule system and through interaction with the
robot’s environment.
The rule system in each behavior is a non-deterministic rule system, in th a t it ran­
domly chooses from a set of situation applicable rules. A non-deterministic rule system
was chosen in order to allow equally applicable rules a chance to be activated. For ex­
ample, if there is a fork in the path and no other knowledge is known about which path
is better, then taking the left fork is as good as taking the right fork.
The learning system for each behavior uses a modified Q-learning algorithm, which is
different from the standard Q-learning approach. The main difference between them is
in our use of the function approximator, which is a method used to generalize a desired
function from only a subset of known inputs and outputs. For each possible action, we
used several incremental backpropagation neural networks to approximate the action’s
value function. By using multiple neural networks for each action and averaging their
outputs, the modified Q-learning can estimate the value function more accurately.
The set of conditions for each behavior is used to indicate when the behavior is
applicable, as well as when to reward or punish the learning system.

3.1.1.2

L im itations

In experiments, the prototype 1 robot controller failed to show any long term com­
petence acquisition. The experimental results for the controller were in fact quite erratic
for experiments with long decision runs. There are several possible reasons for this. How­
32

ever, in this section we will only focus on the limitations th at became apparent with this
controller.
One of the main limitations is its inability to construct a sequence of actions for
getting out of bad situations. The prim ary reason is th a t both the Q-learning algorithm
and the rule system choose actions based on the current situation. Consequently, the
rule system has no proper sequencing of actions th a t the Q-learner could learn from. As
well, it takes a lot of time and experiences for the Q-learner to build up a sequence of
actions for its value function.
Another limitation th at the prototype 1 controller has is th at the rule system and
the Q-learner often disagree with each other. A common example of such a situation
occurs when the robot hits an obstacle and rule system causes the robot to back up and
then the Q-learner has the robot go forward again. The reason why this disagreement
occurs can be explained as follows. In each behavior, the Q-learner makes decisions until
a behavioral condition is violated. Then on the behavior’s next decision, the rule system
chooses an action. However, when the modified Q-learner punishes an action, it does not
punish it enough so th a t even though the action’s value is decreased, it is not less than
any of the other actions’ values. In this case, insufficient punishment by the Q-learner
leads to a repeat of the punished action for th a t state, (i.e., the Q-learner would undo
the action previously chosen by the rule system).

3.1.2

Inspirations from R einforcem ent Learning

In the field of reinforcement learning, the value function is used to store the values
of state-action pairs [SB98]. Typically a function approximator is used instead of an
33

explicit table to represent the value function [GosOS].

The reason for this is th a t a

function approximator allows for the generalization of the value for state-action pairs
and uses less memory to store those values [GosOS]. This means th a t the learner does
not necessarily have to experience every state in its environment, because the value of
state-action pairs can be generalized by the function approximator [GosOS].

3.1.3

Inspirations from C yb ern etics

The field of cybernetics has shown th a t the behavior of a machine can be modified
through the amplification of its actions [Ash64]. This means th a t it is possible to change
the behaviors of an existing machine without changing the rules governing th a t machine.

3.2

G eneralized T utor-S tu d en t Learning A lgorith m

3.2.1

O verview

In the Tutor-Student learning algorithm, the tu to r is an existing proven robot con­
troller and the student is a function approximator, which is charged with learning stateaction pairs. This learning algorithm attem pts to improve the actions of either the tutor
or the student by amplifying either of those actions.
A generalized version of this algorithm works in the following way. The robot’s sensors
retrieve the current state of the robot and its environment. The current state is then
supplied to both the tutor and student modules. Each of these modules then generates
an action based upon the current state. The generated actions, which we call the tutor
action and student action respectively, are then outputted to the action selector module,

34

which chooses an action. This selected action is then given to the amplifier component
for possible amplification. The amplifier produces an action called the resulting action,
which is sent to the robot’s actuators for execution. This entire process is called the
propagation phase of the algorithm. Note th a t the student’s actions are based upon
previously evaluated actions. The evaluation of an action for a state happens during the
pre-evaluation phase and evaluation phase of the algorithm. The pre-evaluation phase
is used to gather information needed for the evaluation phase and happens during the
same time period as the propagation phase. During the evaluation phase the resulting
action is evaluated by the evaluator module based on the resulting state of the robot and
its environment. The resulting state is produced by executing the resulting action from
the propagation phase of the algorithm.
The following sections will discuss the components used by the Tutor-Student learning
algorithm as well as the three phases of the algorithm.

3.2.2

C om p on en ts

The Tutor-Student learning algorithm utilizes the following modules: tutor, student,
action selector, amplifier, and evaluator.

The robot’s sensors and actuators are not

included in this discussion since they are components th a t are provided by the robot.

3 .2.2.1

Tutor

The tu to r is the name given to the existing robot controller. This component is
responsible for providing basic behaviors in order, to guide the learning of the student.
As well the tu to r provides behaviors th a t allow the robot to operate in critical situations.

35

3.2.2.2

Student

The student is the learning component for the Tutor-Student learning algorithm. It
can be any type of function approximator. The job of the student is to learn actions for
given situations.

3.2.2.3

A ctio n S elector

The action selector module is used to select either an action generated by the tutor
or the student. This module works by comparing the types of actions generated by the
tutor and the student. If both actions have the same type, then the student action is
selected, otherwise the tutor action is selected. In this way the selected action is always
of the same type as the tu to r’s action in order to guarantee the basic behavior of the
robot.

3.2.2.4

A m plifier

The amplifier module’s job is to decide if the components of an action are to be
modified through amplification by a preset amount. Amplification is used to modify
the actions of either the tu to r or the student. By using amplification as a quantity for
learning, we are able to modify the behaviors of the tu to r without changing the rules
governing the tutor.

3 .2.2.5

E valuator

The evaluator module is used to determine if the previous action performed by the
agent was appropriate for the previous state. The appropriateness of an action is decided
36

by checking if the state resulting from the execution of the action has violated any of
the agent’s operational conditions. An operational condition is an assertion based on the
current state of the agent. Operational conditions are used to determine which action
the student is taught for the previous state. An example of such condition is “If the
robot hits an obstacle” . The evaluator module evaluates an action using the procedure
outlined in the evaluation phase of the learning algorithm (see page 39 for details).

3.2.3

A lgorith m

The Tutor-Student learning algorithm is split into three different phases: the propa­
gation phase, the pre-evaluation phase and the evaluation phase. The propagation phase
and the pre-evaluation phase happen in parallel during the same time period. These
two phases are separately described in two diagrams and in each diagram only modules
relating to the corresponding phase are illustrated. This is done in order to facilitate the
understanding of the algorithm. The evaluation phase happens one time period after the
propagation and pre-evaluation phases.
Current
State

Sensors

Ih i

Tutor
Action
Tutor
Resulting
_Selected _
(7)1 Action
Action (5) Action
Atnplifier
Selector
(8)
(6 )

(4 )

(2 )

(4 )

Current
State

Student

Actuators

—i

(3Î1

Student
Action
Environment

Fig. 6: Tutor-Student Propagation Phase

Figure 6, shows the propagation phase of the Tutor-Student learning algorithm. The
numerical labels in Figure 6 refer to the steps in the following procedure for the propa37

gation phase.
1. Get the current state from the agent’s sensors.
2. The agent’s sensors send the current state to the tu to r and student modules.
3. The tu to r and student modules each generate an action based on the current state.
4. The tu to r and student modules send their actions to the action selector module.
5. The action selector module selects an action from its inputs.
6. The action selector module outputs the selected action to the amplifier module.
7. The amplifier module possibly amplifies the selected action.
8. The amplifier module sends its resulting action to the agent’s actuators.
9. The agent’s actuators execute the resulting action.
Evaluator
Resulting State

Environment

Sensors

Current
State (b)
Tutor
Action

(c)

Tutor
Action
Selector
Amplifier

Selected
Action (d)
Resulting
Action (e)

Previous State

Tutor Action

Selected Action

Resulting Action

Fig. 7: Tutor-Student Pre-Evaluation Phase

Figure 7, demonstrates the pre-evaluation phase, which is a preparation process for the
evaluation phase. The information stored in the evaluator can be seen as the evaluator’s
38

model of the robot’s current situation. The letter labels in Figure 7, correspond to the
steps for the following pre-evaluation procedure.

a. Get the current state from the agent’s sensors.

b. The agent’s sensors send the current state to the evaluator module and the evaluator
module stores the current state as the previous state.

c. The tutor sends its action to the evaluator module and the evaluator module stores
the tutor action.

d. The action selector sends its selected action to the evaluator module and the evaluator
module stores the selected action.
e. The amplifier sends its resulting action to the evaluator module and the evaluator
module stores the resulting action.

Environment —►

Sensors

-^Current
( 4 State (B)|

Evaluator
Resulting State
Previous State

(C)OXi) ^
_(C)(_aXip"_

Tutor Action
Selected Action

_(C)(b)(i) _

Student

(cxbxüir

Resulting Action

Fig. 8: Tutor-Student Evaluation Phase

The evaluation phase happens one time period after the propagation and pre-evaluation
phases. Its process is shown in Figure 8, whose labels correspond to the steps for the
following evaluation procedure.
39

A. Get the current state from the agent’s sensors.

B. The agent’s sensors send the current state to the evaluator module and the evaluator
module stores the current state as the resulting state.
C. The evaluator module checks if the stored resulting action was amplified.

(a) If it was amplified, then the evaluator tests the agent’s operational conditions
with the stored resulting state.
i. If any of the agent’s operational conditions are true, then the evaluator
module teaches the student module the stored previous state and the stored
selected action.
ii. Otherwise, the evaluator module teaches the student module the stored
previous state and the stored resulting action.
(b) Otherwise, the evaluator module tests the agent’s operational conditions with
the stored resulting state.
i. If any of the agent’s operational conditions are true, then the evaluator
module teaches the student module the stored previous state and the stored
tutor action.
ii. Otherwise, the evaluator module teaches the student module the stored
previous state and the stored selected action.
The essence of the evaluation phase is to choose an action stored during the pre­
evaluation phase of the learning algorithm, which is to be taught to the student with the
previous state. The choice of action taught to the student is based upon whether the
40

stored resulting action was amplified during the previous propagation phase and whether
the resulting action was successful. The criterion of a successful action is th a t none of
the robot’s operational conditions has been violated. In this algorithm if any of the
operational conditions is true, then a violation has occurred.
If the resulting action was amplified, then the evaluator checks the robot’s operational
conditions. If any operational conditions are true, then the resulting action should not
be taught to the student. Instead the selected action should be taught to the student.
However, if none of the operational conditions are true, then the resulting action is taught
to the student.
If the resulting action was not amplified, then the evaluator checks the ro b o t’s opera­
tional conditions. If any of these conditions are true then the resulting action should not
be taught to student. The tutor action is then taught to the student because the tutor
action is assumed to always be a successful action. Otherwise, if none of the operational
conditions is true then the resulting action is taught to the student.

3.3

A p p lication

This section presents the details of an implemented version of the generalized TutorStudent learning algorithm. This version uses a behavior based robot controller called
prototype 2 as the tutor and an incremental backpropagation neural network as the stu­
dent. The robot controller resulting from applying the Tutor-Student learning algorithm
to the prototype 2 robot controller is called prototype 2A.

41

3.3.1

Tutor

The prototype 2 robot controller is designed for the tasks of obstacle avoidance
and attractor path planning with predator avoidance in partially observable static and
dynamic environments.
The task of obstacle avoidance is for the agent to wander around its environment
without hitting any obstacles. The task of attractor path planning with predator avoid­
ance is one in which the agent has three subtasks: The first subtask is to avoid hitting
obstacles in its environment; The second subtask is for the agent to move towards a prey
(also called attractors) and to move away from the prey after it gets within a certain
range of the prey; the third subtask is for the agent to detect predators (also called
repulsers) and flee from them when they get within a certain range.
A partially observable environment is an environment where the agent does not have
full knowledge about its environment. A static environment is an environment th a t does
not change while the agent is deciding what to do next and a dynamic environment is
an environment th a t has the ability to change as the agent is making decisions.
The heart of the prototype 2 robot controller is a state planner th a t acts as the
sequencing rule system for the controller. A state planner consists of a set of states
where each state in the state planner possesses its own set of rules. All the rules in the
state planner have three components: condition, action and transition. The condition
component of a rule is used to indicate if the state rule is applicable. If it is applicable,
then the action component gives an action for the robot to execute; as well the transition
component gives the next state transition for the state planner. The state rules within

42

each state are tested for applicability from the first rule to the last in the state. Therefore
there is a priority ordering of state rules in each state. The state planner used by the
prototype 2 robot controller consists of the following states:

Su bsum p tion State: This state consists of a priority ordering of the following types of
state rules:

O bstacle A void an ce Rules: These rules govern how the robot controller avoids
hitting obstacles in its environment.
P red ator A void an ce Rules: These rules govern how the robot avoids predators.
A predators for the prototype 2 robot controller is any blue object found in the
controller’s environment. These blue objects come in two varieties: a static
blue box and a moving blue robot.
F leeing R ules: These rules govern how the robot flees after it gets too close to
prey. A prey for the prototype 2 robot controller is any red object found in
the controller’s environment. These red objects come in two varieties: a static
red box and a moving red robot.
Seeking Rules: These rules govern how the robot locates and tracks prey in its
environment.
W ander R ules: These rules govern how the robot wanders about its environment.
These rules consider open areas in the robot’s environment good areas for the
robot to wander to.

Note th a t the subsumption state rules are not necessarily grouped together by type
43

in the state ’s list of rules.

Forward State: This state provides the forward behavior of the robot.

A void Spin State: This state provides a spin behavior for obstacle avoidance. This
spin behavior happens when the robot is close to obstacles on its front side, left
side and right side.

P rey Spin State: This state provides a spin behavior for fleeing from prey. This be­
havior is triggered when the robot gets too close to its prey, where upon the robot
will spin until it no longer sees its prey.

P red ator Spin State: This state provides a spin behavior for fleeing from predators.
This behavior is triggered when the robot gets within a certain range of a predator.
Once the robot is within th a t range, the robot will spin until it no longer sees the
predator.

The actions generated by the state planner consist of three components: translation,
rotation and duration. Translation is a continuous floating point number between -1.0
and 1.0, which indicates whether the robot is to go forwards or backwards. R otation is a
continuous floating point number ranging from -1.0 to 1.0, which indicates whether the
robot is to turn left or right. Finally, duration is a continuous floating point value greater
than or equal to 0.0, which describes how long the robot should perform the action. The
prototype 2 robot controller has five types of actions: stop, forward, backward, left and
right.
The prototype 2 robot controller has been tested throughly in various environments
44

and has performed its tasks successfully. The success of the prototype 2 controller made it
a good candidate for being the tuto r module for our implemented Tutor-Student learning
algorithm.

3.3.2

Student

The student as mentioned earlier is a function approximator. We chose to use an
incremental backpropagation neural network as a function approximator, because of its
well known ability to approximate a multivariate function [Gos03]. The student’s sensor
inputs were converted to a binary representation in order to improve the performance of
the neural network. As well, the student’s outputs are in a binary representation and are
interpreted to other values which are action description, amplification level and ampbit.
These output values are taught to the student during the evaluation phase of the learning
algorithm.
The action description is used to retrieve the component form of an action used by
the tutor. The amplification level is used to compute the amplification value, which is
then added or subtracted to the appropriate component of the retrieved action. The
amplification value is equal to an amplification constant times the amplification level.
For this implementation the amplification constant is equal to 0.1. The am pbit is used to
determine if further amplification should occur on the action generated by the student.
If the ampbit is 1 then the student’s action is a candidate for further amplification, else
if the ampbit is 0 then the student’s action is not a candidate for further amplification.

45

3.3.3

A ctio n S elector

In the prototype 2A robot controller, the action selector module compares string
descriptions of actions generated by the tu to r and the student. If the tu to r action de­
scription is the same as the student action description, then the student action becomes
the selected action along with its amplification level and ampbit, otherwise the tu to r
action becomes the selected action along with an amplification level of zero and ampbit
set to one.

3.3.4

A m plifier

The amplifier module takes the selected action in its component form, (Translation,
Rotation, Duration), along with a string description of the action, and the selected
ampbit. Based on the string description of the action, the amplifier then amplifies the
appropriate component of the action depending on the value of the ampbit or whether
the controller is still in its learning phase.
A learning phase is employed in the amplifier module because we are using an incre­
mental backpropagation neural network as a function approximator for the student. If
the ampbit is zero at the start of the learning algorithm then amplification may not be
applied to the selected action.
The algorithm for the amplifier checks to see whether the ampbit is true or the robot
controller is still in learning phase. If either of these conditions is true then the action is
amplified, otherwise the action will not be amplified.
The amplifier for this applied Tutor-Student learning algorithm only amplifies the

46

translation or rotation components of an action and not the duration component. The
effectiveness of amplifying the duration component will be investigated in our future work
(see section 5.2.2 on page 70).

3.3.5

E valuator

The evaluator module works as described above in the generalized Tutor-Student
learning algorithm (Section 3.2.2.5 on page 36), except th at the evaluator teaches the
student an action description, an amplification level and an ampbit value with the previ­
ous state. The prototype 2A robot controller has only one operational condition, which
is “If the robot hits an obstacle” . For successful actions, the student is taught with an
ampbit equal to 1, whereas for unsuccessful actions the student is taught with an ampbit
equal to 0.
When the evaluator teaches a state-action pair to the student, it is done for a fixed
number of iterations, called the teaching rate. The teaching rate for the incremental
backpropagation neural network is how many times the state-action pair is taught to the
neural network at one time.

47

C hapter 4
E xp erim en ts

This chapter focuses on presenting the experiments th a t have been conducted using
the prototype 2 and prototype 2A robot controllers th a t we introduced in Chapter 3.
The parameters of the experiments will be explained, as well as the evaluation procedures
used. The experimental results for each experiment will also be presented and discussed.

4.1

E nvironm ents

The prototype 2 and prototype 2A robot controllers will operate in two types of
environments: partially observable static simulated environment and partially observable
dynamic simulated environment. A partially observable environment is an environment
where the robot’s sensors only have partial access to the state of the environment. For
example a robot in an office-like environment would be unable to know what is happening
on the other side of a wall (unless the robot has x-ray vision). A static environment is
an environment th a t does not change while the robot is deciding what to do next. A
dynamic environment is an environment th a t can change while the robot is deliberating
what to do next.
The controller’s environment is a simulated environment generated by the Player/Stage
simulator^, which is controlled through the Pyro^ programming environment. Pyro is a
programming environment designed for constructing robot controllers and artificial in^Player

and

Stage

are

freely

available

under

the

GNU

General

Public

License

from

HTTP :/ /playerstage.sourceforge.net
^Pyro is freely available under the GNU General Public License from HTTP :/ /pyrorobotics.org

48

telligence programs [BKMY03].
The robot controller’s environment consists of two parts. The first part is the robot’s
internal environment, which is a simulated Pioneer 2 DX robot^. The Pioneer 2 DX
has a differential drive train and 16 sonar sensors positioned around its 50cm x 50cm
body. A subset of these sonar sensors will be used in the robot controller experiments.
The second part of the controller’s environment is its external environment, which is its
immediate surroundings.
For the experiments involving a static environment, the prototype 2 and prototype 2A
robot controllers will operate alone in their environments, whereas, in the experiments
involving dynamic environments both controllers will operate in their environments along
side predator and prey robots. A predator robot uses a subsumption-like controller to seek
and chase the prototype 2 and prototype 2A robot controllers around its environment.
A prey robot uses a subsumption-like controller to wander around its environment.
Generally speaking, there are several concerns about simulated environments. The
first one is th a t simulators provide an agent with accurate sensors and actuators, which
real robots do not possess. The second concern is th a t simulators usually make strong
assumptions about the robots and environments.
W ith regard to the first concern, the Stage simulator does provide relatively accurate
sensors and actuators. However, since we are in early developing stages, we preferred
to take the approach of starting with accurate sensors and actuators to make sure th a t
the algorithm would function properly under ideal situations. The question of how the
®The simulated Pioneer 2 DX is provided with the Flayer/Stage simulator software. The Pioneer 2
DX itself is an actual robot realized in hardware.

49

learning algorithm cooperates with real robots is left for future work.
As for the second concern, we have not been able to find any papers th a t discuss
the assumptions made by the Stage simulator. However, we have found several papers
[JS02, MWBDW02, VSSM02] where researchers have developed clients using the Stage
simulator th a t work with little or no modification on real robots.

4.2

E valuation

The prototype 2 and prototype 2A robot controllers are evaluated based upon how
well they perform the tasks of obstacle avoidance and attractor path planning with preda­
tor avoidance in four different environments. The task of obstacle avoidance will be eval­
uated by recording how many times the robot controller experiences a collision. The task
of attractor path planning with predator avoidance will be evaluated by recording the
number of times the robot controller locates attractors.
The robots will be started at the same locations and in the same order. The indi­
vidual performances of the controllers in the tasks of obstacle avoidance and attractor
path planner with predator avoidance will then be averaged over all trial runs. In each
environment, the prototype 2 robot controller and each of three different learning phases
of the prototype 2A robot controller will be given 10 trial runs, each of which will consist
of 10,000 decisions. Three different learning phases of the prototype 2A robot controller
were tested: they are 2500, 5000 and 7500 decisions.

50

4.3

E xp erim en ts

The prototype 2 and prototype 2A robot controllers are tested in four different envi­
ronments: the Chase-Avoid environment, the Cave-Box environment, the Cave-PredatorPrey environment, and the Cave-Predator-Prey-Box environment. All of the these envi­
ronments are modified versions of environments th a t come with the Player/St age soft­
ware.

4.3.1

P r o to ty p e 2A C ontroller: P aram eters

In each of the following experiments, the prototype 2A robot controller is set with
the following param eters as shown in Table 1.
Parameters

Values

Number of Neural Networks

1

Inputs Units

14

Hidden Units

30

O utput Units

7

Learning Rate

0.3

Momentum Rate

0.9

Teaching Rate

3

Amplify Amount

0.1

Table 1: Prototype 2A Parameters

51

4.3.2

T h e C hase-A void E nvironm ent

File

View

A ction

iPredatorl

^Robotl

Fig. 9: The Chase-Avoid Environment

4.3.2.1

D escrip tion

The Chase-Avoid environment shown in Figure 9 is a partially observable dynamic
environment. This environment is designed to examine how well both controllers, 2 and
2A, are able to avoid obstacles and locate a prey robot in a small enclosed environment,
while avoiding a predator robot. Due to the small area of the environment, there is a
strong possibility of collisions with other robots.

4.3.2.2

R esu lts and D iscu ssion

The results of the prototype 2 and prototype 2A robot controllers for their sets of
trial runs are given in Table 2.

52

Controller

A ttractor P ath Planning

Obstacle Avoidance

with Predator Avoidance

Prototype 2

Avg. Number

Standard

Avg. Number of

Standard

of Collisions

Deviation

A ttractors Found

Deviation

31.2

57.62

84.6

18.11

Prototype 2A

2500

20.6

6.82

130.1

12.93

with different

5000

19.4

5.85

125.4

16.59

learning phases

7500

31.5

17.10

125.5

18.42

Table 2: Controllers Results for the Chase-Avoid Environment

Table 3 presents the average improvement of the prototype 2A robot controller over
the prototype 2 robot controller for the tasks of obstacle avoidance and attracto r path
planning with predator avoidance.
Learning Phase

Obstacle Avoidance

A ttractor P ath Planning with

(Number of Decisions)

Improvement

Predator Avoidance Improvement

2500

34.0%

53.8%

5000

37.8%

48.2%

7500

-0.96%

48.3%

Table 3: Average Improvement by the Prototype 2A Controller for the Chase-Avoid
Environment

Prom the results shown in Table 2 and Table 3, it is demonstrated th a t for the task of
obstacle avoidance, the prototype 2A robot controller performed significantly better or
53

almost as well as the prototype 2 robot controller. As well, the results also demonstrate,
for the task of attractor path planning with predator avoidance, th a t the prototype 2A
robot controller performed significantly better than the prototype 2 robot controller.
For the Chase-Avoid environment, we can say th a t the Tutor-Student learning algo­
rithm has greatly improved the performance of the prototype 2 robot controller for this
environment.

4.3.3

T h e C ave-B ox E nvironm ent
File

View

Action

IAttractor I— ►( J

0
iRcpul.scr

c

/A

iRcpul.scr]

I R obot I

lAltiactorl

Fig. 10: The Cave-Box Environment

54

4.3.3.1

D escrip tion

The Cave-Box environment shown in Figure 10, is a partially observable static en­
vironment.

It is designed to examine how well both controllers, 2 and 2A, are able

to perform the tasks of obstacle avoidance and attractor path planning with predator
avoidance, in a large complex environment with three static attractors and three static
repulsers. This environment presents a challenge for both controllers in th a t they have
to explore a complex environment in order to locate attractors.

4 .3 .3 .2

R esu lts and D iscu ssion

The results of the prototype 2 and prototype 2A robot controllers for their sets of
trial runs are given in Table 4.
Controller

Obstacle Avoidance

A ttractor P ath Planning
with Predator Avoidance

Prototype 2

Avg. Number

Standard

Avg. Number of

Standard

of Collisions

Deviation

A ttractors Found

Deviation

0.0

0.0

8.1

2.02

Prototype 2A

2500

4.4

4.14

41.6

4.22

with different

5000

7.1

9.80

39.1

7.20

learning phases

7500

2.0

3.37

40.9

5.86

Table 4: Controllers Results for the Cave-Box Environment

From the results in Table 4, it is clear th a t for the task of obstacle avoidance, the
prototype 2A robot controller performed slightly worse than prototype 2 robot controller.
55

Despite this, the prototype 2A robot controller performed very well considering th a t in
the worst case scenario, the prototype 2A controller’s average collision is 7.1 out of the
10,000 decisions made.
Table 5, presents the average improvement of the prototype 2A robot controller over
the prototype 2 robot controller for the task of attractor path planning with predator
avoidance.
Learning Phase

A ttractor P ath Planning with

(Number of Decisions)

Predator Avoidance Improvement

2500

413.6%

5000

382.7%

7500

404.9%

Table 5: Average Improvement by the Prototype 2A Controller for the Cave-Box Envi­
ronment

The results shown in Table 4 and Table 5 dem onstrate th a t for the task of attractor
path planning with predator avoidance, the prototype 2A robot controller performed
significantly better than the prototype 2 robot controller.
For the Cave-Box environment, we can say th a t the Tutor-Student learning algo­
rithm did not improve the performance of obstacle avoidance in the prototype 2 robot
controller. However, the learning algorithm did significantly improve the prototype 2A
robot controller at the task of attractor path planning with predator avoidance. It is
within our expectation th a t due to the nature of this learning algorithm, (i.e. to conduct
modification over tu to r’s actions) more collisions may occur because of this process.
56

4.3.4

T h e C ave-P red ator-P rey E nvironm ent
File

View

A chon

|Predator|

0
f

"^Roboll

Fig. 11: The Cave-Predator-Prey Environment

4.3.4.1

D escrip tion

The Cave-Predator-Prey environment shown in Figure 11, is a partially observable
dynamic environment.

This environment is designed to examine how well both con­

trollers, 2 and 2A, are able to avoid obstacles and locate prey (attractors), in a large
complex environment with two prey robots and one predator robot. The task of attrac­
tor path planning with predator avoidance in this environment is a difficult task in th at
there are only two attractors and they are moving around a large complex environment.

57

4.3.4.2

R esu lts and D iscu ssio n

The results of the prototype 2 and the prototype 2A robot controllers for their sets
of trial runs are given in Table 6.
Controller

Obstacle Avoidance

A ttractor P ath Planning
with Predator Avoidance

Prototype 2

Avg. Number

Standard

Avg. Number of

Standard

of Collisions

Deviation

A ttractors Found

Deviation

1.6

1.84

29.2

22.00

Prototype 2A

2500

7.2

4.21

68.8

29.94

with different

5000

9.1

4.82

42.1

12.42

learning phases

7500

5.7

4.85

49.7

15.09

Table 6: Controllers Results for the Cave-Predator-Prey Environment

From the results shown in Table 6, it is clear th a t for the task of obstacle avoidance the
prototype 2A robot controller performed slightly worse than prototype 2 robot controller.
Despite this, the prototype 2A robot controller performed very well considering th a t the
most average collisions of the prototype 2A controller is 9.1 out of the 10,000 decisions
made.
Table 7, presents the average improvement of the prototype 2A robot controller over
the prototype 2 robot controller, for the task of attractor path planning with predator
avoidance.

58

Learning Phase

A ttractor P ath Planning with

(Number of Decisions)

Predator Avoidance Improvement

2500

135.6%

5000

44.2%

7500

70.2%

Table 7: Average Improvement by the Prototype 2A Controller for the Cave-PredatorPrey Environment

The results in Table 6 and Table 7 demonstrate th a t for the task of attracto r path
planning with predator avoidance, the prototype 2A robot controller performed signifi­
cantly better than the prototype 2 robot controller.
For the Cave-Predator-Prey environment, we can say that the Tutor-Student learning
algorithm did not improve the performance of obstacle avoidance in the prototype 2 robot
controller, however, the learning algorithm did significantly improve the prototype 2A
robot controller at the task of attractor path planning with predator avoidance.

4.3.5
4.3.5.1

T h e C ave-P red ator-P rey -B o x E nvironm ent
D escrip tion

The Cave-Predator-Prey-Box environment shown in Figure 12 is a partially observ­
able dynamic environment.

This environment is designed to examine how well both

controllers, 2 and 2A, are able to avoid obstacles and locate attractors in a large complex
environment. In this environment, the robot will be accompanied by two prey robots

59

File

V ie w

A ction

[A llraclork.^ ,- ,

O
iR e p u lse rl

C)

lA u r a c to r ]

Fig. 12: The Cave-Predator-Prey-Box Environment

and one predator robot, which adds a dynamic element to this environment. As well,
there will be three static attractors and three static repulsers located throughout the
environment.

4.3.5.2

R esu lts and D iscu ssion

The results of the prototype 2 and prototype 2A robot controller, for their sets of
trial runs are given in Table 8.
The results in Table 8, dem onstrate th a t for the task of obstacle avoidance, the
prototype 2A robot controller performed slightly worse than prototype 2 robot controller.
Despite this point, the prototype 2A robot controller performed very well considering th a t
the highest number of average collisions with prototype 2A controller is 9.2 out of the

60

Controller

Obstacle Avoidance

A ttractor P ath Planning
with Predator Avoidance

Avg. Number

Standard

Avg. Number of

Standard

of Collisions

Deviation

A ttractors Found

Deviation

0.3

0.67

30.3

15.60

Prototype 2
Prototype 2A

2500

9.2

5.47

85.1

18.13

with different

5000

8.5

4.57

76.2

13.69

learning phases

7500

4.4

1.78

78.5

9.50

Table 8: Controllers Results for the Cave-Predator-Prey-Box Environment

10,000 decisions made.
Table 9, presents the average improvement in the prototype 2A robot controller over
the prototype 2 robot controller for the task of attracto r path planning with predator
avoidance.
Learning Phase

A ttractor P ath Planning with

(Number of Decisions)

Predator Avoidance Improvement

2500

180.9%

5000

151.5%

7500

159.1%

Table 9: Average Improvement by the Prototype 2A Controller for the Cave-PredatorPrey-Box Environment

The results shown in Table 8 and 9 clearly dem onstrate th at for the task of attractor
61

path planning with predator avoidance the prototype 2A robot controller performed
significantly better than prototype 2 robot controller.
For the Cave-Predator-Prey-Box environment, we can say th a t the Tutor-Student
learning algorithm did not improve the performance of obstacle avoidance in the pro­
totype 2 robot controller, however, the learning algorithm did significantly improve the
prototype 2A robot controller at the task of attractor path planning with predator avoid­
ance.

4.4

C onclusions

The experiments conducted on the prototype 2 and prototype 2A robot controllers
have presented the effectiveness of the Tutor-Student learning algorithm. For the task of
obstacle avoidance, the learning algorithm was able to modify the actions of the proto­
type 2 robot controller so th a t the controller performed either better or slightly worse.
As stated above, the prototype 2A robot controller having slightly worse results than
the prototype 2 robot controller for the task of obstacle avoidance is expected. The
Tutor-Student learning algorithm learns through trial and error interactions with its en­
vironment. It is expected th a t during this learning process the robot will make mistakes.
However, for all of these experiments, the prototype 2A robot controller was involved
in collisions for less then 0.1% of the total decisions made by the controller, which is a
significant statistic, since as far as we know, there is no other trail and error learning
method th a t can perform as well.
In all of our experiments, the prototype 2A robot controller obtained the best results
when the controller’s learning phase was set to 2500 decisions. As stated previously
62

(Section 3.3.4 on page 46), a learning phase is used to force the learning algorithm to
amplify the controller’s actions, in order to ensure th a t the function approximator does
not stop the amplification of the controller’s actions too early, due to an initial spurious
output value for the ampbit. By forcing amplification of the controller’s actions for all
of the experimental environments past 2500 decisions, it appears to have caused the
controller to decrease its performance. This decrease in performance could be because
th a t the student has reached a stable point where the amplification of the controller’s
actions is no longer required and further amplification past this point may cause the
controller to decrease in its performance.
From our experiments using the implemented Tutor-Student learning algorithm we
are unable to say if the vast majority of learning occurs during the learning phases.
There are two reasons th a t make this difficult to determine. First, it is possible th at
one function approximator may require less training than another function approximator
(e.g. the initial weights in one neural network can be more conducive towards a particular
task than the initial weights in another neural network of the same type). As well, the
Tutor-Student learning algorithm is designed to allow the robot to learn throughout its
entire set of decisions, therefore learning maybe occurring all through the trial run of
the robot. To determining whether there is a decision interval or set of decision intervals
where the m ajority of learning takes place is left for future work (see section 5.2.5 on
page 71 for more details).
In all of our experiments, the Tutor-Student learning algorithm considerably improved
the performance of the prototype 2 robot controller in the task of attractor path planning
with predator avoidance. This learning algorithm improved the average performance of
63

the prototype 2 robot controller’s ability to locate attractors by almost 100% to over
400% (depending on the environment).

64

C hapter 5
C onclusion

This chapter will present a summary of this thesis, as well as open questions for
future research with the Tutor-Student learning algorithm.

5.1

Sum m ary

This thesis has proposed the generalized Tutor-Student learning algorithm, which
is an innovative method th a t is designed to convert an existing robot controller into a
learning robot controller. This learning algorithm mimics the interactions between a
human tutor and student. A sample application of the Tutor-Student learning algorithm
has been presented.
In contrast to the learning robot controllers explored in Section 2.2.2 on page 21, the
generalized Tutor-Student learning algorithm allows us to deploy a robot th a t is able to
function competently in an environment. The robot controller is able to learn quickly
and does not make the robot act erratically. As well our learning algorithm works in
both partially observable static and dynamic environments. The Tutor-Student learning
algorithm is able to add learning capabilities to any of the three prim ary robot controller
paradigms, whereas other methods in the literature have only been able to add learning
capabilities to reactive controllers.
Experiments have been conducted employing the generalized Tutor-Student learning
algorithm on a behavior-based robot controller in four different test environments. The
experimental results have shown the effectiveness of the generalized Tutor-Student learn­

65

ing algorithm upon a robot controller. For the task of obstacle avoidance, the learning
algorithm was able to modify the actions of the prototype 2 robot controller so th a t
the controller performed either better or slightly worse. For all of these experiments,
the prototype 2A robot controller was involved in collisions for less than 0.1% of the
total decisions made by the controller, which is a significant statistic, since as far as
we know, there is no other trail and error learning method th a t can perform as well.
The Tutor-Student learning algorithm also considerably improved the performance of
the prototype 2 robot controller in the task of attractor path planning with predator
avoidance. This learning algorithm improved the average performance of the prototype
2 robot controller’s ability to locate attractors by almost 100% to over 400% (depending
on the environment).

5.2

O pen Q uestions

This section presents the open questions for future research with the Tutor-Student
learning algorithm.

5.2.1

Stream lin ing th e T utor-S tu d en t Learning A lgorithm

Can the Tutor-Student learning algorithm be streamlined? Since amplification is the
only necessary learning quantity used by this algorithm, is it possible for the student to
just learn states-amplification level pairs instead of state-action pairs? This modifica­
tion will effect all three phases of the learning algorithm. This modified Tutor-Student
learning algorithm is briefly described below.
Figure 13, shows this modified propagation phase.
66

Current
State

Sensors

( 1)

Tutor

Tutor
(3) Action
Resulting
Modified.
Action "
(9Ï
( 11)1
Action (2) Action
Amplifier
Actuators
Modifier
( 10)
(8)

(5 )

(2)

(6)

(4)'

Current
State

Student

Amplification
Level
Environment

Fig. 13: Modified Propagation Phase

The numerical labels in Figure 13, refer to the steps in the following procedure for
the propagation phase.

1. Get the current state from the agent’s sensors.

2. The agent’s sensors send the current state to the tutor and student modules.

3. The tutor module generates an action based on the current state.

4. The student module generates an amplification level based on the current state.

5. The tutor sends its action to the action modifier module.

6. The student sends its amplification level to the action modifier module
7. The action modifier module modifies the tu to r’s action with the student’s amplifi­
cation level.
8. The action modifier then sends the modified action to the amplifier module.

9. The amplifier module possibly amplifies the modified action.

10. The amplifier module sends its resulting action to the agent’s actuators.

67

11. The agent’s actuators execute the resulting action.

As can be seen from above, the student only outputs an amplification level for a given
state, which is then applied to the tu to r’s action. Due to this change, the action selector
module is no longer needed. Figure 14, shows the modified pre-evaluation phase.
Evaluator
Resulting State

Environment

Sensors

Tutor
Action
Modifier
Amplifier

Current
State (b)
Tutor
Action

(c)

Modified
Action (d)
Resulting
Action (e)

Previous State

Tutor Action

Modified Action

Resulting Action

Fig. 14: Modified Pre-Evaluation Phase

The letter labels in Figure 14, correspond to the steps for the following pre-evaluation
procedure.

a. Get the current state from the agent’s sensors.
b. The agent’s sensors send the current state to the evaluator module and the evaluator
module stores the current state as the previous state.
c. The tu to r sends its action to the evaluator module and the evaluator module stores
the tutor action.
d. The action modifier sends its modified action to the evaluator module and the evalu­
ator module stores the modified action.
68

e. The amplifier sends its resulting action to the evaluator module and the evaluator
module stores the resulting action.

Environment -

Sensors

-^Current
( 4 State (B)|

Evaluator
Resulting State
Previous State

(CXaXi) ^
_(C)(a)_(ii)l‘

Tutor Action
Modified Action

(C)(bXi2 _

Student

(C)(b)(ii)|- -

Resulting Action

Fig. 15: Modified Evaluation Phase

The evaluation phase happens one time period after the propagation and pre-evaluation
phases. Its process is shown in Figure 15, whose labels correspond to the steps for the
following evaluation procedure.

A. Get the current state from the agent’s sensors.
B. The agent’s sensors send the current state to the evaluator module and the evaluator
module stores the current state as the resulting state.
C. The evaluator module checks if the stored resulting action was amplified.

(a) If it was amplified, then the evaluator tests the agent’s operational conditions
with the stored resulting state.
i. If any of the agent’s operational conditions are true, then the evaluator
module teaches the student module the stored previous state and the stored
modified action.
69

ii. Otherwise, the evaluator module teaches the student module the stored
previous state and the stored resulting action.
(b) Otherwise, the evaluator module tests the agent’s operational conditions with
the stored resulting state.
i. If any of the agent’s operational conditions are true, then the evaluator
module teaches the student module the stored previous state and the stored
tu to r action.
ii. Otherwise, the evaluator module teaches the student module the stored
previous state and the stored modified action.

5.2.2

A m p lification S trategies

In this thesis, amplification is handled by adding a preset amplification amount to
the appropriate component of an action. Are there any other amplification strategies?
As well, is it possible to amplify more than one component of an action (i.e. either the
translation or rotation component along with the duration component)?

5.2.3

S tu d en t T ransplants

Is it possible to transplant a student th a t is trained for a task in a partially observable
static environment into a partially observable dynamic environment? Can it adapt more
quickly than if the student were just trained for the same task in the partially observable
dynamic environment?

T hat is, is it possible to give the student a “head-start” by

training in a partially observable static environment? Would one expect this to be true
in human performance?
70

5.2.4

R eal R o b o ts

Experiments with the Tutor-Student learning algorithm on real robots need to be
conducted. Will this learning algorithm perform as well in real robots as in simulated
robots?

5.2.5

Learning Intervals

An interesting experiment would be to try and determine if there is a decision interval
or set of decision intervals where the majority of learning takes place.

71

A cknow ledgm ent

I would like to thank my supervisor, Dr.

Charles Brown, for his encouragement

and guidance during my graduate studies. He gave me the freedom to pursue my own
research and my graduate experience has been enhanced by his expertise, understanding
and patience. As well, I would like to thank Professor Desa Polajnar and Dr. William
Owen for their counsel and valuable suggestions on how to improve my research and this
thesis.
I would like to give special thanks to my wife Jia Zeng for her support, help and
advice. As well I would like to thank my parents for helping me to reach this point.

72

R eferences

[AC90]

P. E. Agre and D. Chapman. W hat are plans for? In P. Maes, editor,
Designing Autonomous Agents: Theory and Practice from Biology to E n­
gineering and Back, pages 17-34. The MIT Press: Cambridge, MA, USA,
1990.

[Ark98]

Ronald C. Arkin. Behavior-based Robotics. MIT Press, 1998.

[Ash64]

W. Ross Ashby. A n Introduction to Cybernetics. Rontledge Kegan and
Paul, 1964.

[BK96]

R. P. Bonasso and D. Kortenkamp. Using a layered control architecture
to alleviate planning with incomplete information. In In Proceedings of
the A A A I Spring Symposium ’Planning with Incomplete Information for
Robot Problems’, pages 1-4. Menlo Park, CA: AAAI Press, Stanford, CA,
1996.

[BKMY03]

D. S. Blank, D. Kumar, L. Meeden, and H. Yanco. Pyro: A pythonbased versatile programming environment for teaching robotics. Journal
of Educational Resources in Computing (JERIC), 3(4):1531-4278, 2003.

[Bra86]

Valentino Braitenberg.

Vehicles: Experiments in Synthetic Psychology.

MIT Press, 1986.

[Bro86]

R. A. Brooks. A robust layered control system for a mobile robot. IE EE
Journal of Robotics and Automation, RA-2(l):14-23, 1986.

73

[Bro89]

R. A. Brooks. A robot th a t walks: Emergent behaviors from a carefully
evolved network. Technical Report AI MEMO 1091, MIT, 1989.

[Bro91]

R. A. Brooks. Intelligence without reason. In John Myopoulos and Ray
Reiter, editors, Proceedings of the 12th International Joint Conference on
Artificial Intelligence (IJCAI-91), pages 569-595, Sydney, Australia, 1991.
Morgan Kaufmann publishers Inc.: San Mateo, CA, USA.

[Con92]

J. Connell. Sss: A hybrid architecture applied to robot navigation. In
Proceedings of the 1992 IE E E International Conference on Robotics and
Automation (ICRA-92), pages 2719-2724, 1992.

[Fau94]

L. Fausett. Fundamentals of Neural Networks: Architectures, Algorithms,
and Applications. Prentice Hall, 1994.

[Gat98]

E. Gat. On three-layer architectures. In D. Kortenkamp, R. P. Bonasso,
and R. Murphy, editors. Artificial Intelligence and Mobile Robots: Case
Studies of Successfid Robot Systems, pages 195-210. MIT Press, Gambridge MA, 1998.

[GosOS]

A. Gosavi.

Simulation-Based Optimization: Parametric Optimization

Techniques and Reinforcement Learning. Kluwer Academic Publishers,
2003.
[JS02]

B. Jung and G. S. Sukhatme. Tracking targets using multiple robots: The
effect of environment occlusion. Autonomous Robots, 13(3): 191-205, 2002.

74

[Mae94]

P. Maes.

Modeling adaptive autonomous agents.

Artificial Life, /,

(1&2)(9), 1994.

[MC92]

S. Mahadevan and J. Connell. Automatic programming of behavior-based
robots using reinforcement learning. Artificial Intelligence, 55(2-3):311365,1992.

[Mit90]

T. Mitchell. Becoming increasingly reactive. In Proceedings of 8th Na­
tional Conference on Artificial Intelligence (AAAI-90), pages 1051-1058.
Morgan Kaufmann, 1990.

[Mit97]

T. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997.

[MMB93]

L. Meeden, G. McGraw, and D. Blank. Emergent control and planning
in an autonomous vehicle. In Touretsky D. S., editor, Proceedings of the
Fifteenth Annual Meeting of the Cognitive Science Society, pages 735-740.
Lawerence Erlbaum Associates, Hillsdale, NJ, 1993.

[MWBDW02] A. Makarenko, S. Williams, F. Bourgault, and H. F. Durrant-W hyte. An
experiment in integrated exploration. In In Proceedings of the lE E E /R S J
International Conference on Intelligent Robots and Systems (IROS), pages
534-539,2002.
[Neh92]

U. Nehmzow. Experiments in Competence Acquisition for Autonomous
Mobile Robots. PhD thesis. University of Edinburgh, Department of A rti­
ficial Intelligence, 1992.

75

[Neh94]

U. Nehmzow. Autonomous acquisition of sensor-motor couplings in robots.
Technical Report UMCS-94-11-1, The Univerity of Manchester, D epart­
ment of Computer Science, 1994.

[Neh95]

U. Nehmzow. Flexible control of mobile robots through autonomous com­
petence acquisition. In Measurement and Control, volume 28 of 2, 1995.

[Neh99]

U. Nehmzow. Mobile Robotics: A Practical Introduction. Springer Verlag,
1999.

[NHS89]

U. Nehmzow, J. Hallam, and T. Smithers. Really useful robots. In Pro­
ceedings of Intelligent Autonomous Systems, volume 2, 1989.

[NSM93]

U. Nehmzow, T. Smithers, and B. McGonigle. Increasing behavioural
repertoire in a mobile robot. In From Animals to Animats, volume 2,
pages 291-297. MIT Press, 1993.

[Pir99]

P. Pirjanian. Behavior coordination mechanisms - state-of-the-art. Tech­
nical Report IRIS-99-375, University of Southern California, 1999.

[SB98]

R. Sutton and A. Barto. Reinforcement Learning: A n Introduction. MIT
Press, 1998.

[SRK93a]

A. Saffiotti, E. H. Ruspini, and K. Konolige. Blending reactivity and goaldirectedness in a fuzzy controller. In Proc of the 2nd IE E E Inti Conf on
Fuzzy Systems, pages 134-139, San Francisco, CA, 1993.

[SRK93b]

A. Saffiotti, E. H. Ruspini, and K. Konolige. Robust execution of robot
76

plans using fuzzy logic. In A. L. Ralescu, editor, Fuzzy Logic in Artificial
Intelligence — Procs. of the IJC AF93 Workshop, pages 24-37. LNAI 847,
Springer, Berlin, Germany, 1993.

[VSSM02]

R. T. Vaughan, K. Stoy, G. Sukhatme, and M. Mataric.
Localization-space trails for robot teams.

In IE E E Transactions on

Robotics and Automation, pages 796-812, 2002.

77

Lost:

A p p en d ix A

Standard Q -learning
An algorithm for standard Q-learning is given below [SB98, GosOS].

a = step size

A = discount value

1. Initialize Q{state, action) to arbitrary values

2. Repeat:
(a) Get current state i
(b) Ghoose action a from i using a policy
(c) Execute action a, observe received reward r and next state j
(d) Update Q{i, a) = (1 - a)Q{i, a) + a[r{i, a, j ) + \niaXb^A{j)Q{j, b)]
3. Until: End of the agent’s trial run

78

A p p en d ix B

Standard Q -learning w ith a F u nction A pproxim ator
An algorithm for standard Q-learning with a function approximator is given below
[GosOS].

a — step size

A = discount value

1. Repeat:

(a) Get current state i
(b) Ghoose action a from i using a policy
(c) Execute action a, observe received reward r and next state j
(d) Determine the output q of the action neural network for action a in state i
(e) Set Qnext to be equal to the maximum value of the outputs for the action neural
networks with state j as input to these networks.
(f) U pdate g = (1 -

0 !)g -1- a[r(%, a,

+ Agneit]

(g) Update the action neural network associated with action a with q

2. Until: End of the agent’s trial run

79

A p p en d ix C
Increm ental B ack propagation Learning A lgorith m
The algorithm for the incremental backpropagation learning algorithm is given below
[Mit97],
input vector = x

ou tp u t vector = o

target vector = t
w = weight
rj = learning rate

• The input from neuron i to neuron j is denoted by Xij.
• The weight from neuron i to neuron j is denoted by

.

For a specified number of iterations
For all: ( x, t ) in a set of training examples
Input the X to the network. Compute the output of each neuron.
For all: neurons k in the output layer
Calculate output-neuronkS error ôk = 0^(1 - Ok){tk —
For all: neurons A in the hidden layer
Calculate hidden .neur ouhS error Sh = Ok(l - Ok)T,keoutputs'Wkhdk
For all: Network Weights
80

W j i — W j ; “H T j ô j X j i

End For

81