A F L E X IB L E S IM U L A T IO N F R A M E W O R K F O R P R O C E S S O R
S C H E D U L IN G A L G O R IT H M S IN M U L T IC O R E S Y S T E M S

by

V isw a n a th a n M anickam
B.E., Anna University, Chennai, India, 2006

THESIS SUBM ITTED IN PARTIAL FULFILLM ENT OF
THE REQUIREM ENTS FOR TH E D EG R EE OF
M ASTER O F SCIENCE
IN
MATHEMATICAL, COM PUTER AND PHYSICAL SCIENCES

THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA
April 2012

© V isw anathan Manickam, 2012

1+1

Library and Archives
Canada

Bibliotheque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de I'edition

395 Wellington Street
Ottawa ON K1A0N4
Canada

395, rue Wellington
Ottawa ON K1A 0N4
Canada
Your file Votre reference
ISBN:

978-0-494-94438-7

Our file Notre reference
ISBN:

978-0-494-94438-7

NOTICE:

AVIS:

The author has granted a non­
exclusive license allowing Library and
Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distrbute and sell theses
worldwide, for commercial or non­
commercial purposes, in microform,
paper, electronic and/or any other
formats.

L'auteur a accorde une licence non exclusive
permettant a la Bibliotheque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par telecommunication ou par I'lnternet, preter,
distribuer et vendre des theses partout dans le
monde, a des fins commerciales ou autres, sur
support microforme, papier, electronique et/ou
autres formats.

The author retains copyright
ownership and moral rights in this
thesis. Neither the thesis nor
substantial extracts from it may be
printed or otherwise reproduced
without the author's permission.

L'auteur conserve la propriete du droit d'auteur
et des droits moraux qui protege cette these. Ni
la these ni des extraits substantiels de celle-ci
ne doivent etre imprimes ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting forms
may have been removed from this
thesis.

Conform em ent a la loi canadienne sur la
protection de la vie privee, quelques
formulaires secondaires ont ete enleves de
cette these.

W hile these forms may be included
in the document page count, their
removal does not represent any loss
of content from the thesis.

Bien que ces formulaires aient inclus dans
la pagination, il n'y aura aucun contenu
manquant.

Canada

A b stra ct

In traditional uniprocessor systems, processor scheduling is the responsibility of
the operating system. In high performance com puting (H PC ) domains th a t largely
involve parallel processors, the responsibility of scheduling is usually left to the appli­
cations. So far, parallel computing has been confined to a small group of specialized
HPC users. In this context, the hardware, operating system, and the applications
have been mostly designed independently with minimal interactions. As the m ulti­
core processors are becoming the norm, parallel program m ing is expected to emerge
as the m ainstream software development approach. This new trend poses several chal­
lenges including performance, power management, system utilization, and predictable
response. Such a demand is hard to meet w ithout the cooperation from hardware,
operating system, and applications. Particularly, an efficient scheduling of cores to
the application threads is fundamentally im portant in assuring the above mentioned
characteristics. We believe, operating system requires to take a larger responsibility
in ensuring efficient multicore scheduling of application threads.
To study the performance of a new scheduling algorithm for the future multicore
systems with hundreds and thousands of cores, we need a flexible scheduling simula­
tion testbed. Designing such a multicore scheduling sim ulation testbed and illustrat­
ing its functionality by studying some well known scheduling algorithms Linux and
Solaris are the main contributions of this thesis. In addition to studying Linux and
Solaris scheduling algorithms, we dem onstrate the power, flexibility, and use of the
proposed scheduling testbed by simulating two popular gang scheduling algorithm s adaptive first-come-first-served (AFCFS) and largest gang first served (LGFS). As a
result of this performance study, we designed a new gang scheduling algorithm and we
compared its performance with AFCFS. The proposed scheduling sim ulation testbed
is developed using Java and expected to be released for public use.

A ck n o w led g em en ts
Throughout my time as graduate student a t UNBC, I came across many people
who have encouraged me in one way or another. I would like to thank them all.
Among them , some deserve special thanks.
First of all, I would like to thank my supervisor Dr. Alex Aravind for his continuous
support and encouragement. Alex has been a mentor, a well wisher, and a friend. I
thoroughly enjoyed working with him, his style and ideas inspired me in getting into
research and I enjoy doing that. Next to my supervisor, the people who contributed
to my thesis are Dr. W aqar Haque and Dr. A ndrea Gorrell. I thank them for their
valuable time, effort, suggestions, and encouragement. In addition, I would like to
thank the external examiner Dr. Balbinder Deo and Dr. Ajit Dayanandan, chair of
my defence for reading my thesis.
I would like to thank my peers and friends who supported me and involved me in
informative discussions. First of all, I want to specially thank Hassan Tahir for all
his support starting from the day I arrived to Prince George, till I get settled, and
also for his brilliant programming ideas which helped me a lot. I thank my fellow
graduate students Baldeep, Adiba M ahjabin Nitu, Azizur Rahman, Manoj Nambiar,
Narek Nalbandyan, Nahid Mahmud, Fakhar U1 Islam, and Behnish M ann for their
support. I thank our computer adm inistrator Heinrich Butow for his continuous
support. Finally, I would like to sincerely thank Dr. Mahi Aravind, for th e wonderful
family dinners and parties on various events and occasions.
Finally, much of the credit goes to my family, especially, my parents who supported
me right to the end. A special thanks to my sister K avitha and my brother in law
Ramasamy for encouraging me to apply UNBC. Once again, I thank them all. Thanks
Guys!

C

'j / 'Y r

Contents

A b stra ct

i

A ck n o w led g em en ts

ii

C o n ten ts

iii

List o f F ig u res

vii

List o f T ables

ix

1

1

In tro d u ctio n
Hardware T r e n d ................................................................................................

2

1.1.1

Multicore Processors vs.Parallel C o m p u te r s .................................

3

1.2

Computing T r e n d .............................................................................................

4

1.3

Applications T r e n d .........................................................................................

4

1.4

O perating Systems Trend

............................................................................

5

1.5

Schedulers T r e n d .............................................................................................

7

1.6

W here do the contributions ofthis thesis fit i n ? ......................................

8

1.7

Thesis O rg an izatio n..........................................................................................

9

1.1

2

M u lticore S ch ed u lin g

10

2.1

11

Load B a la n c in g ................................................................................................

iii

2.2

2.3

2.4
3

4

Scheduling A p p ro a c h e s ....................................................................................

12

2.2.1

Gang Scheduling in Multicore and Cloud C o m p u t in g ..............

14

Related W o r k ...................................................................................................

15

2.3.1

Scheduler S im u la to r s .........................................................................

16

2.3.2

Fairness and Performance Issues in Gang Scheduling

...............

19

2.3.3

Performance S t u d y ............................................................................

20

Summary

..........................................................................................................

21

M o tiv a tio n and C o n trib u tio n s

22

3.1

Multicore Scheduler Simulation F ram ew o rk ..............................................

22

3.2

A New Gang Scheduling A lg o r ith m ...........................................................

25

3.3

C o n trib u tio n s ....................................................................................................

26

3.4

Research M e th o d o lo g y ...................................................................................

28

3.5

S u m m a r y ..........................................................................................................

29

M u ltico re Scheduler S im u lation F ram ew ork

30

4.1

S im u latio n ..........................................................................................................

30

4.2

M ulticore Scheduler S im u la tio n ..................................................................

32

4.3

Architecture of MSS F ram ew o rk ..................................................................

34

4.3.1

T erm inology..........................................................................................

34

4.3.2

Workload G e n e r a t o r ...........................................................................

36

4.3.3

Multicore M a c h in e ..............................................................................

36

4.3.4

Multicore S cheduler..............................................................................

37

4.3.5

Execution T ra c e ......................................................................................

38

4.3.6

Performance Calculation E n g in e .....................................................

39

4.3.7

Statistical M easures.............................................................................

42

Activity Profile G e n e r a t o r ............................................................................

43

4.4

iv

4.5

4.6
5

43

4.5.1

Performance P aram eter Setting W i n d o w ....................................

43

4.5.2

Performance Observation W in d o w .................................................

45

4.5.3

Activity M onitor W i n d o w ...............................................................

46

Summary

...........................................................................................................

48

C ase S tu d ies - L inux an d Solaris S ch ed u lin g A lg o rith m s

49

5.1

Load B a la n c in g .................................................................................................

50

5.1.1

Real Time S c h e d u le r .........................................................................

51

Linux Scheduler - Completely Fair Scheduler ( C F S ) ...............................

51

5.2.1

Calculation of Q u a n t a ......................................................................

51

5.2.2

Calculation of v r u n t i m e ...................................................................

53

Solaris S c h e d u le r ..............................................................................................

53

5.3.1

The Default Solaris S c h e d u le r ........................................................

54

Simulation E x p e r im e n ts ................................................................................

56

5.4.1

Observations on Experim ent 1 .........................................................

57

5.4.2

Observations on Experiment 2 .........................................................

60

S u m m a r y ...........................................................................................................

64

5.2

5.3

5.4

5.5
6

User I n te r f a c e ....................................................................................................

A Fair an d E fficient G an g S ch ed u lin g A lg o rith m

65

6.1

Popular Gang Scheduling A lg o rith m s ..........................................................

65

6.1.1

AFCFS

.................................................................................................

66

6.1.2

LGFS

....................................................................................................

66

6.1.3

Simulation E x p e r im e n ts ...................................................................

67

6.2

A New Gang Scheduling A lg o r ith m ............................................................

70

6.3

The Algorithm

.................................................................................................

71

6.4

Simulation Experiments....................................................................................

72

v

6.4.1
6.5
7

Simulation S e t u p .................................................................................

Summary

............................................................................................................

73
76

C on clu sion and F u tu re D irec tio n s

77

7.1

78

Future D ire c tio n s ..............................................................................................

B ib liograp h y

93

vi

List of Figures
4.1

The S ch ed u ler......................................................................................

33

4.2 Architecture of MSS F ram ew o rk ....................................................

35

4.3

A Multicore M a c h in e .........................................................................

37

4.4 Sample Execution T r a c e ..................................................................

38

4.5 Sample I/O T r a c e ...............................................................................

39

4.6 Param eter Setting W in d o w ...............................................................

44

4.7 Simulations Run Window

..............................................................................

4.8 Performance Observation W in d o w .................................................

46

4.9 Activity Monitor W i n d o w ...............................................................

47

4.10 Core M onitor W in d o w .....................................................................

47

5.1 Average Turnaround time: Linux and Solaris Scheduling Algorithms
(by varying the number of cores) ................................................................

45

58

5.2 Standard deviation of Turnaround time: Linux and Solaris Scheduling
Algorithms (by varying the number of c o r e s ) ............................
59
5.3 Average Interactive Response tim e : Linux and Solaris Scheduling Al­
gorithms (by varying the number of c o r e s ) ...............................
60
5.4 Standard deviation of Interactive Response time: Linux and Solaris
Scheduling Algorithms (by varying the num ber of cores)............... .........
5.5 Average Turnaround time: Linux and Solaris Scheduling Algorithms
(by varying the w o r k lo a d ) ..............................................................
62

vii

61

5.6

Standard deviation of Turnaround time: Linux and Solaris Scheduling
Algorithms (by varying the w o rk lo ad ).........................................................

62

Average Interactive Response time: Linux and Solaris Scheduling Al­
gorithms (by varying the w w k lo a d ) .............................................................

63

Standard deviation of Interactive Response time: Linux and Solaris
Scheduling Algorithms (by varying th e workload) ..................................

63

6.1

Average Response Time for AFCFS and LGFS A lg o r ith m s ................

68

6.2

Standard deviation of Response Time for AFCFS and LGFS Algorithms 69

6.3

Core Utilization of AFCFS and LGFS A lg o rith m s .................................

69

6.4

Average Response Time of AFCFS and Proposed A lg o rith m s.............

74

6.5

Standard deviation of Response Tim e of AFCFS and Proposed Algo­
rithm s ..................................................................................................................

75

6.6

Average Core Utilization of AFCFS and Proposed Algorithms

. . . .

75

6.7

Bypass Count of Gangs of AFCFS and Proposed A lg o rith m s .............

76

5.7

5.8

viii

List of Tables
5.1

Completely Fair Scheduling A lg o r it h m ......................................................

52

5.2

Solaris 10 Scheduling A lg o rith m ....................................................................

56

5.3 Simulation P a r a m e te r s .....................................................................................

56

5.4 Simulation P a r a m e te r s .....................................................................................

57

6.1 AFCFS Gang Scheduling A lg o rith m .............................................................

66

6.2 LGFS Gang Scheduling A lg o rith m .................................................................

67

6.3 Simulation P a r a m e te r s .....................................................................................

67

6.4 New Gang Scheduling Algorithm

.................................................................

72

6.5 Simulation P a r a m e te r s .....................................................................................

73

7.1 Solaris 10 Scheduling Classes Priority R a n g e .............................................

81

7.2

Dispatch Table for RT Scheduling C la ss.....................................................

81

7.3

Dispatch Table for FX Scheduling Class

..................................................

81

7.4

Dispatch Table for TS and IA Scheduling C l a s s e s .................................

82

7.5

Tick Processing of Solaris Scheduling A lg o rith m ....................................

83

7.6

U pdate Processing of Solaris Scheduling Algorithm

84

ix

..............................

Chapter 1
Introduction

The scheduling problem in com puting systems has been studied in the last several
decades. W ith the recent arrival of multicore processors, th e scheduling problem has
gained renewed interest. The contribution of this thesis is related to the scheduling
problem in multicore com puter systems.
The prim ary goal of com puter systems is to execute applications safely, securely,
correctly, and efficiently. The hardware and system software have been designed to
work together to achieve this goal. Until recently, th e developments in hardw are and
system software have kept pace with each other to meet this goal. O perating system,
which has scheduling as its central component, is th e m ajor part of system software.
Hence, scheduling plays a pivotal role in effective use of com puter systems.
A paradigm shift in hardware technology has happened very recently. Instead of
increasing speed, the chip designers are starting to put a number of execution cores
within a single processor. Such processors are called multicore processors. To exploit
the capabilities of multicore processors effectively, th e software community requires
to make at least two main changes: (i) a new way of designing software for m ulti­
1

core processors, and (ii) system software m ust be redesigned with new capabilities
to manage multicore processors. The proposal of this thesis is related to the efforts
in meeting the latter demand. Particularly, this thesis proposes to develop a flexible
simulation framework to study the performance of scheduling algorithms for multicore processors. The purpose of this chapter is to explain why multicore scheduling is
interesting and worthy of investigation. For th at, we first briefly describe th e recent
trends in hardware, computing, applications, operating system, and scheduler.

1.1

Hardware Trend

Nearly forty five years ago, Intel co-founder Gordon Moore predicted th a t transis­
tor density on integrated circuits will be doubled about every two years. In term s of
speed, this law is equivalent to: processor speed will be doubled about every eighteen
months. Ever since M oore’s prediction, th e hardw are technology has been driven to
produce processors with highest possible speed by increasing the clock rate.
As limited by the laws of physics, the microprocessor design hit the clock cycle
wall, and therefore the chip designers had to come up with a new way of exploiting the
benefit of Moore’s law. The new way is to use the extra transistors to add multiple
execution units (referred to as cores) w ithin a single processor. Each core is capable
of executing independently of other cores in the processor. This trend in multicore
processors indicates th a t all future processors will be multieore [1],
As Intel, AMD, Fujitsu, IBM, and Sun Microsystems are already shipping their
desktops and workstations with multicore processors [2J, multicore systems are rapidly
emerging as the m ainstream computing platforms. This hardware trend seems to
continue, and we will have hundreds and even thousands of cores in a single processor
in the near future [lj. The abundant availability of execution cores is expected to

2

revolutionize the way we will design software for these systems in the future.
Another hardware trend, driven by high performance computing groups, has been
introduction of various parallel computers. A parallel com puter is a single com puter
with multiple internal processors or multiple com puters interconnected to form a co­
herent high-performance computing platform [3]. Multicore processors have similarity
to some parallel com puters in structure and functionality, b u t they differ in several
other aspects.

1.1.1

M u ltic o r e P r o c e sso r s v s. P a r a lle l C o m p u ters

Parallel processors or parallel com puters were around as early as single processor
systems. In term s of execution units, parallel processors and multicore processors
are similar. M ulticore processors and parallel processors have two or more execution
units. They differ mainly in their purpose, other resources, and application domains.
The main purpose of parallel computers is to increase the performance of applications
which have longer run times. Parallel processors are often designed for certain types of
applications and usually run them in static partitions. In order to utilize th e parallel
processors effectively, the applications must be parallelized.
In parallel processors, the scheduling is mostly done at th e user level rath er th an
taken care by the operating system. Typically, users of parallel computing design their
application programs incorporating the logic of scheduling and synchronization. Of­
ten, compilers or libraries like OpenM P [4] and MPI [5] help th e application designers
in achieving this task.
Multicore systems are emerging as general purpose computers, and they are ex­
pected to be used in a wide range of domains - from desktop, w orkstations, and
servers. T hat is, multicore systems are expected to handle multiple types of applica­

3

tions including interactive workloads, realtim e tasks, and norm al workloads. There­
fore, unlike parallel computers, the task of scheduling in multicore processors cannot
be left to applications. The operating system requires to take a major responsibility of
scheduling applications to multicore processors. So, there are fundamental differences
between parallel processor systems and multicore processor systems [6], and m ulti­
core processor systems require redesign of scheduling algorithms. Designing efficient
scheduling algorithm for multicore processor systems, we believe, is a complex task.

1.2

C om puting Trend

Until recently, most general purpose com puting are desktop based and most high
performance computing are based on parallel com puting and cluster com puting [3j.
Parallel computing often divides the tasks into smaller ones and uses parallel com put­
ers to execute them simultaneously. Cluster com puting has th e same objective, but
the computing infrastructure is a set of loosely connected computers w ith a suitable
software module to coordinate the com puters in executing parallel programs. A nother
computing paradigm called grid com puting [7] has taken th is trend one step further
by pooling computing resources from multiple adm inistrative domain to solve a single
problem. The most recent trend is cloud computing. Cloud is a computing infrastruc­
ture paradigm th a t offers com putation and storage as web-based services [8]. Cloud
computing typically has parallel com puters an d /o r cluster computers as its server
nodes. W ith the emergence of multicore processors, th e future clouds are expected to
have multicore blades as their servers [8,9].

1.3

A pplications Trend

In the 1980s, only a small group of people knew about computers, and even a
smaller group of people used one. Now, almost everyone knows what a com puter is,
4

most of us use it on a daily basis for reading news, listening to music, etc. Thus,
the use of com puting continues to infiltrate every aspect of our life and autom ation
in the practical world continues to increase. W ith the recent computing trends of
cloud com puting and mobile accesses, and th e access to social media and Internet,
the need for more autom ated services are expected to be accelerated. T h at means
more software for applications are expected to be developed, and they have to be
developed in a way to effectively run on the future com puting platforms.

1.4

O perating S ystem s Trend

O perating systems is one of the core areas in com puter science, and has accum u­
lated a large body of literature. However, m ost of th e work in the past was done either
in the domain of uniprocessor systems, where the scheduler has the full responsibility
of managing applications including scheduling, or in th e domain of parallel processing,
where the responsibility of task scheduling is largely left to application designers1.
O perating system is one of the most complex software systems, and designing one
is a challenging task. It has an influence on almost all other systems, bo th hardw are
and software th a t involve computing. O perating systems spend a huge portion of
their time in executing applications [10].
Generally, the functionalities of operating systems add and evolve constantly to
meet the needs of new technologies and applications. However, the operating system
design related to processor management has not been changed much in th e last several
years as the number of processors has not been changed much. C urrent operating
systems were designed for single or few cores. Thus, most developments have been
'P a ra lle l processing typically involves com plex problem s requiring high co m p u tatio n al tim e.
Based on th e softw are specifically designed for parallel program m ing, the program designers di­
vide a com plex problem into com ponent p a rts an d th e n assign th e com ponent p a rts to be executed
on individual processors [3J.

5

in the domain of file system, security, device management, and user interface. These
developments are not aimed to meet the dem and of m ulticore processors.
The traditional approach of building an operating system for every individual hard­
ware model is no more acceptable, because it will be out of date once new hardw are
arrives [11]. Some of the main issues of current operating system are:

1. S calab ility: C urrent operating systems are not designed to be scalable for
multicore processors. Therefore, adding hardw are resource would requires re­
designing of operating system, as the current operating system cannot utilize
the newly added hardware to increase its efficiency [1],
2. R eso u rce allocation : Current operating systems are designed for com putation
w ith limited resources. This may not be the case of multicore systems. Since
multicores require abundant resources for their com putation, such resources are
expected to be added accordingly for effective execution. For example, with
the recent advancements of hardware multilevel caches, inefficient allocation of
cache will result in performance degradation [12].
3. Parallelism : If the current trend of multicore processor continues, the workload
of an operating system managing the number of cores will continue to increase.
Dividing the core management workload and handling concurrently require par­
allelism in kernel level. Parallelizing the kernel is difficult and m arginally suc­
cessful. This pushes us to seek new approaches [13]. Simply tuning applications
to get advantage of the available cores may not be good enough when there is a
mass deployment of cores. Therefore, the operating systems for the future mul­
ticore systems have to be redesigned or developed to effectively manage the cores
among the applications to achieve/exceed the expectations of the users [14j.

6

As multicore systems offer more cores, handling these cores to serve the applica­
tions is becoming more challenging.

1.5

Schedulers Trend

To meet the requirement of multicore systems, th e most im portant com ponent of
operating system th a t m ight need a radical change is the processor scheduler. Most
developments in the operating system dom ain in th e last several years have been
on file system, security, device management, and user interface, and only minimal
changes have been proposed for schedulers. These schedulers have been focusing on
effectively multiplexing the CPU among th e com peting processes to assure fairness,
quick response, and minimize the turnaround time.
T he traditional operating systems such as Linux, Windows, and Solaris schedule
the processes using time multiplexing. T he approach of tim e multiplexing alone is
not suitable for multicore schedulers for several reasons. F irst of all, with the ad­
vancements of hardware and the availability of multilevel cache hierarchy, scheduling
a core to a thread exclusively could reduce the latency and hence increase the perfor­
mance. T hat is. the thread scheduled alone in a core can effectively use the cache to
reduce its execution time. Secondly, tim e multiplexing does not effectively deal with
distributing the work among different cores so th a t no core sits idle when there is
heavy workload on peer cores. Finally, tim e multiplexing does not reduce the im pact
of access to cache and DRAM, which are considered expensive operations. Rather,
it could increase the overhead on accessing shared resources such as cache, memory,
and network. These reasons force us to re-think th e scheduler design.
The operating system literature on multicore systems is relatively limited and
most of the publications are within the last five years [1,6,8,10,12-36]. In th a t, only

7

a small portion is related to multicore scheduling algorithm s [12]. Most of them are
related to effectively dealing w ith resource contention.
The proposed approaches to design scheduling algorithm s for multicore systems
vary greatly and differ in their recommendations. One view is th a t multicore systems
have in fact simplified the scheduling. T h at is, since we have plenty of cores, there is no
need to worry about intricate time multiplexing strategies, simply giving enough cores
to applications will simplify the scheduling. On the other extreme, several researchers
feel th a t multicore systems have complicated the scheduling task, as it needs to con­
sider several factors such as cores, caches, networks, and application requirements
together in offering best possible service. A number of work suggest ideas in between
these extreme cases. Most of these ideas are related to cache contention. Again,
related to resource contention, there are two views. One group strongly advocates to
incorporate contention aspects into th e scheduling algorithm s [12,23]. A nother group
argues to decouple contention management from scheduling [34,36]. There is another
direction of research th a t explores th e question of whether to keep operating systems
and applications together or separately [13,22].
Overall, the field of multicore scheduling is very young and the proposed ideas on
multicore scheduling are preliminary.

1.6

W here do th e contributions of th is th esis fit in?

From the above discussion, we infer th a t the software systems th a t worked well
for sequential systems might not effectively work w ith the multicore systems. There­
fore, there is a gap between the rapidly emerging hardware technology and relatively
slow software technology. The recent research trend indicates that the software sys­
tems, particularly the operating system m ust be redesigned to reduce this gap. More

importantly, new scheduling algorithms must be developed to utilize the multiple re­
sources offered by multicore systems. This thesis is an effort to help achieve this goal.
More specifically, this thesis contributes to help develop new scheduling algorithms
for multicore systems.
The most difficult aspects of developing a novel scheduling algorithm are imple­
menting and testing its performance [25]. We believe a flexible multicore scheduler
simulator framework with proper support for simulation and testing would be very
useful. Developing such a comprehensive framework is the prim ary goal of this thesis.

1.7

T hesis O rganization

The fundamentals of multicore scheduling and th e related work are presented in
C hapter 2. Next, in C hapter 3, we present the motivation, contributions, and re­
search methodology. In C hapter 4, we present the design and the im plem entation of
the multicore scheduler simulation framework. The framework and its im plem enta­
tion are the m ajor contributions of this thesis. We present the im plem entation and
simulation study of Linux and Solaris scheduling algorithms in C hapter 5, and a new
gang scheduling algorithm is presented and its performance compared to two well
known gang scheduling algorithms in C hapter 6. Finally, in C hapter 7, we conclude
the thesis and list some future directions to extend th e work carried out in this thesis.

9

Chapter 2
Multicore Scheduling

Scheduling is a fundam ental problem in several systems.

In processor scheduling,

threads are assigned to processors for execution w ith the objective of optim izing cer­
tain performance metrics such as maximum throughput, minimum average response
time, minimum average waiting time, an d /o r maximum C PU utilization. Threads
are schedulable entities which achieve the intended tasks by their execution. Cores
are physical execution units. In a multicore context, scheduling can be viewed at
two levels: balancing the system load among the cores and multiplexing threads on
a single core. In actual im plementations, these two tasks could be integrated as one
scheduling module.
A scheduling algorithm is a set of rules th a t define how to select the next thread for
execution. This problem is well studied in single processor(eore) context and numerous
scheduling algorithms exist in the literature [37,38]. Present multiprocessor operating
systems such as Linux and Solaris use a two-level scheduling approach [12]. In one
level the scheduler balances the load across cores, and in another level the scheduler
uses a distributed run queue model with per core queues and local scheduling policies

10

to manage each core.

2.1

Load B alancing

In general, load balancing in multicore or homogeneous multiprocessor systems
can be done in several ways [12,39]. In one extreme, all the jobs can be kept in one
shared global queue and schedule a job from this queue whenever a core becomes free.
This approach is simple and balances the workload effectively, and lienee appears to
increase the core utilization. But, in reality, this approach degrades th e performance
due to cache pollution as there is a high probability of jobs frequently m igrating
from one core to another. Modern systems achieve high performance by effectively
exploiting the locality of reference, and by keeping frequently accessed d ata in local
cache. Job migrations rarely utilize this benefit, and hence a significant am ount of
time is spent on accessing d ata from farthest locations such as last level cache and
memory.
On the other extreme, each core can m aintain its own queue of jobs to b etter m an­
age cache affinity and other local resources. This case allows several load balancing
strategies by m igrating jobs from one local queue to another [39]. There are four sim­
ple approaches. The first one is called sender-initiated policy in which lightly loaded
cores initiate requests for jobs from other cores. This technique is also referred to as
work stealing from other cores. In the second approach, called receiver-initiated pol­
icy, the heavily loaded cores request other lightly loaded cores to take jobs. T he third
approach is the combination of both sender-initiated policy and receiver-initiated pol­
icy, and therefore called symmetric policy. The fourth one is that the heavily loaded
core simply chooses a random destinations to m igrate some of its jobs. This simple
strategy is found to be working well in practice.

11

The above general load balancing schemes are applicable only for systems with
homogeneous processors, and are not suitable for parallel applications or the types of
jobs which are heterogeneous with differing priorities.

2.2

Scheduling A pproaches

In single processor system, scheduling of different jobs on a processor is typically
done by tim e sharing. The basic idea of tim e sharing scheduling is th a t th e processor
time is divided into chunks of time called tim e quanta, and each application executes
in different, time quanta to complete its task. The critical factor affecting this tech­
nique is the tim e quanta, say, q. W hen a q is set very large, the applications run
longer to complete their executions. W hen q is set smaller, th e applications interleave
frequently. The popular time sharing technique with effective interleaving executions
is round robin scheduling [37,38].
Almost all uniprocessor scheduling algorithms used in modern operating systems
are time sharing. Among the tim e sharing algorithms, the most practical algorithms
use multilevel feedback scheduling strategy. The basic idea behind multilevel feedback
algorithms is th a t they use different priority queues to manage jobs w ith varying
importance, and the jobs move between queues as their priorities change. The jobs
with a lower priority will be served only if the higher priority queues are empty.
The scheduling algorithms used by popular operating systems such as Linux, So­
laris, Mac OS, and Windows are some sort of multilevel feedback algorithm s. These
algorithms with suitable load balancing technique have been adapted for multicore
processors. (For this thesis, we have simulated Linux and Solaris schedulers.)
The orthogonal technique to time sharing is space sharing, and it is applicable only
in multiprocessor systems. It is an effective generic approach of scheduling m ulti­
12

threaded applications on multiprocessor systems.

T he basic idea is th a t different

applications use different sets of processors during their lifetime. That is, the scheduler
dedicates a set of processors to an application for its entire lifetime. Although this
technique might offer excellent service to the applications, it may not be good for the
utilization of system resources. In this approach, the processor will be idle when the
application goes for I/O , or waiting for an event or synchronization.
An effective variant of space sharing approach to parallel applications is th a t,
instead of dedicating a set of processors to an application for its lifetime, parallel
threads of an application are scheduled together for a fixed period of time.

This

technique, originally called co-scheduling later referred to in the literature as gang
scheduling, was introduced by O usterhout [40]. Gang scheduling efficiently uses busy
waiting for frequent synchronization. In th e literature, frequent synchronization is
also referred to as fine-grained synchronization. The idea behind gang scheduling is
simple th a t threads of a same process are scheduled together as a ‘gang’ on distinct
processors so th a t they can progress in parallel and synchronize with minimal busy
waiting involved. A gang is an application containing a set of parallel threads th a t
frequently communicate with each other.
During their executions, threads in a gang communicate for synchronization and
d ata exchange. Often, a thread in a gang cannot proceed further w ithout sufficient
progress from other threads. Such threads either do busy waiting or block themselves
by suspending from execution until other threads progressed enough. A long busy wait
on a processor wastes its execution time. On th e other hand, suspending and resuming
processes often are also not good, when only a small wait is needed. Blocking results
in context switches, which are costly. For several applications inducing small waits,
research shows th a t a busy wait is b etter th a n blocking. G ang scheduling algorithms
are typically designed to exploit the above observation.
13

In gang scheduling approach, different applications can use the same set of proces­
sors in different time quanta, and same application can use different sets of processors
in different time quanta. This is an effective technique th a t provides an excellent
service to parallel applications with threads involving similar loads and fine-grained
synchronization.
Using our scheduling simulator developed for this thesis, we study two popular
gang scheduling algorithms and propose an improved gang scheduling algorithm . We
believe th a t the proposed algorithm can be used for scheduling gangs in cloud com­
puting, as explained next.

2 .2 .1

G a n g S c h ed u lin g in M u ltic o r e a n d C lo u d C o m p u tin g

Recently, it is predicted th a t the next decade will bring microprocessors contain­
ing hundreds, thousands, or even tens of thousands of computing cores, and com­
putational clusters and clouds built out of these multicore processors will offer un­
precedented quantities of com putational resources [8,22,41]. We discuss the relevance
of multicore scheduling in cloud com puting assuming th a t th e above prediction will
come true.
Cloud computing is a service oriented com puting paradigm.

It is designed to

provide services such as com putation, software applications, data access, d a ta m an­
agement and storage resources to customers through internet transparently [8]. As
cloud com puting offers computing as a service, customer satisfaction about th e ser­
vices they receive is extremely im portant. Custom er satisfaction is mainly related to
cost, fairness, and quality of service. In th a t, fairness and quality of service are often
related to system performance. Particularly, these metrics are prim arily influenced
by the execution of applications in the cloud. T hat, in tu rn , heavily depend on the

14

processor scheduling in the cloud. T h at is, to offer services effectively to customers,
the service requests must be properly m apped to th e available computing resources
in the cloud. This is simply a scheduling problem, and it generally involves sequenc­
ing and assigning a set of applications on one or more processors (servers) so th a t
the intended criterion is met, while m aintaining the m aximum possible utilization of
system resources. Therefore, processor scheduling is a fundamental problem in cloud
computing as it is involved in almost all services th a t the cloud can offer. As the
servers of th e cloud are expected to be built from multicore processors, scheduling of
multicore processors is an integral component of cloud scheduling [8,22,41],
Among the applications of cloud computing, a considerable portion of applica­
tions are expected to be from high performance com puting groups. Such applications
require huge com putational resources. Some of these applications are typically de­
signed as parallel threads with frequent synchronization among themselves. These
applications are basically gangs. A recent research suggests th a t gang scheduling can
be effective in cloud com puting |42].

2.3

R elated W ork

To set th e context for our work, we reviewed th e work on operating systems for
multicore processors. In this section, we review th e work specifically related to our
contributions. This thesis has contributions relating to multicore scheduling simu­
lation, performance study of Linux and Solaris scheduling algorithms, performance
study of three gang scheduling algorithms, and fairness aspect of gang scheduling
algorithms. Next, we review the work related to these contributions.

15

2 .3 .1

S ch ed u ler S im u la to r s

A number of simulators for multiprocessors have been proposed in the literature
[25,26,28,43-47].

Among them, Simics [28], SimOS [46], SimpleScalar [43], and

AMD SimNow [47] em ulate the processor at th e instruction set level. They differ in
emulating different architectures and simulating other components such as I/O and
network.
Simics simulates the hardware which can ru n unmodified operating systems such
as Solaris, Linux, Windows XP, and Tru64. Simics supports the following processor
models: Ultrasparc, Alpha, x86, x86-64, PowerPC, MIPS, IP F , and ARM. In addition,
Simics simulates the device models well enough to execute the device drivers.
SimOS simulates the hardware components to boot, study, and run IRIX oper­
ating system and the application th a t runs on IRIX. SimOS fastens the simulation
by changing the mode of execution. There are three modes of execution proposed
in SimOS - emulation, rough characterization, and accurate mode. Em ulation mode
models the hardware th a t are required to execute the workloads, leaving other unin­
teresting execution such as booting the operating system, reading from the disk, and
initializing the workload. This is the fastest mode. T he rough characterization mode
approximate the behavior of the system by sim ulating those uninterested executions.
This mode is two or three times slower than th e em ulation mode. The accurate mode
emulates the complete system, and therefore it is the slowest and very tim e consum­
ing. The accurate mode can be used for measuring th e accuracy of the system under
simulation.
AMD SimNow simualtes the AMD family processors. SimpleScalar simulates a
close derivative of MIPS architecture. Turandot [44] emulates PowerPC. A recent

16

simulator called COTSon [45] uses AMD SimNow, and employs a statistical sampling
technique th a t can selectively tu rn on and off the sim ulation to reduce the overall
simulation time.
Since all these simulators em ulate machine instructions completely, they all have
fine grained accuracy at instruction level. Some of them are used as virtual machines.
However, they are very slow as they have to interpret each machine instruction at
software level.

For example, SimOS - the fastest among th e group - can execute

workloads only less th an 10 times slower th a n the underlying hardware. Note th a t
SimOS simulates other components to attain this speed. Also, as these sim ulators will
run on host operating system, the scheduling of host operating system will further
slow down th e execution tim e of the instruction. Such fine-grained simulators are
more suitable for studying low level functionalities of the processors.
These simulators are not suitable for rapid simulations aiming to get quick insight
and guidance to develop new scheduling algorithm s for future multicore processors.
They are machine dependent and emulates only existing hardware. Im plem enting a
scheduling algorithm in a system supported by a regular operating system is hard.
Therefore, simulating a new scheduling algorithm for performance study in these
simulators is time consuming and hard.
A sim ulator of Linux scheduler called Linshed was proposed in [26]. This sim ulator
was designed by making changes to original Linux kernel and it runs in user mode.
The objective of Linshed was to study the Linux scheduler in depth and was not
intended to implement any new scheduling algorithm or comparing with any existing
scheduling algorithm.
Recently, a toolset called AKULA [25] was proposed to study scheduling of threads
on multicores so as to reduce their cache contention. AKULA toolset assumes the
17

availability of the task profile term ed as bootstrap data on cache behavior.

Such

cache behavior can be obtained only through actual execution on a dedicated m ulti­
core system. The bootstrap d a ta contains two information: solo execution tim e and
degradation m atrix. Solo execution tim e is m easured when th e thread runs alone in
the real machine and the degradation m atrix is th e degradation value from their solo
execution time when a thread is scheduled w ith other threads in different cores which
share the cache. For example, there are two threads A and B which are scheduled
in two core system. The degradation m atrix contains degradation value of of thread
A when thread B is scheduled in another core and vice versa. Suppose the degra­
dation value of thread A when scheduled with th read B is 0.75, then the slow down
percentage is 75.
Threads on a multicore system can be scheduled in a num ber of ways. Consider a
system with two cores and two level cache memories LI - local to each core, and L2
shared by both. In this system, all threads can be scheduled to one core or they can
be distributed between two cores in several ways. These different ways of scheduling
will have different influence on both LI and L2 level caches. To observe th e cache
behavior, we need to collect the cache d ata in all possible ways of scheduling, which
will result in large number of combinations.
AKULA collects the cache behavior for a lim ited set of scheduling combinations.
It assumes th a t the threads scheduled in the cores are allotted with full LI cache and
observes the L2 level cache effect. T h at is, tim e sharing on a core is not allowed. For
example, there are four threads, say A, B, C, and D. need to be scheduled and there
are two cores in the system which share L2 cache. T he degradation m atrix will have
the degradation value for the following 12 com binations of threads: AB, AC, AD, BA,
BC, BD, CA, CB, CD, DA, DB, and DC.

18

If the number of cores is increased to 3, then a total of 24 combination of d ata
have to be collected. So, the size of the d a ta increases drastically w ith increasing
number of cores as well as increasing number of threads. Also, executing these tasks
in real machine to collect the bootstrap trace is tedious and very tim e consuming
task. More importantly, AKULA only runs on th e profile d ata created by actual
executions, it cannot be used for testing new workloads. Therefore, it limits its usage
only to the system and the workload for which the trace is collected. Even changing a
single param eter such as speed, number of cores, cache, workload will make th e trace
unusable. Due to this restriction, the use of AKULA is very limited.

2 .3 .2

F airn ess a n d P erfo rm a n ce Issu e s in G a n g S c h e d u lin g

Gang scheduling has been extensively analysed and several studies have concluded
th a t gang scheduling is one of the best approaches for parallel applications, and hence
several gang scheduling algorithms under different conditions have been proposed
in the literature [40,42,48-55]. Among them First F it (or First-Come-First-Served
(FCFS)) and Best F it (or Largest-Gang-First-Served (LGFS)) are popular [53].
W hen enough processors are free, FCFS chooses the job at the head of the queue
to schedule and LGFS chooses the largest job in the queue to schedule. FCFS assures
high fairness, b u t does not guarantee the best processor utilization. Consider th a t a
larger gang G is in the head of the queue, and several other smaller gangs are waiting
behind G. Assume th at there are not enough processors to schedule G, but several
other processes from the queue can be scheduled. Now, in FCFS, these processors will
be idle until enough processors become free and G is scheduled. Such situations will
not only make the processor utilization low, but also have the potential to increase
the average waiting time. To avoid such situations, a modification called adaptive
FCFS (AFCFS) was introduced [53]. W hen a gang in the head of the queue cannot
19

be scheduled, AFCFS schedules other gangs behind in the queue. All these algorithm s
are susceptible to starvation. To avoid starvation, these algorithms adopt the policy
of migrating jobs from processor to processor [53]. Such task m igration may not be
effective for multicorc systems.
Task migration is hardly possible between two heterogeneous m ulticore systems,
and generally expensive even between two homogeneous systems [39,56]. Also, an
efficient task migration can reduce th e wait tim e, but it does not guarantee to eliminate
starvation. Therefore, a simple solution to avoid starvation in gang scheduling is an
interesting open problem.
Avoiding starvation is an interesting theoretical problem. But, for practical appli­
cations, a b etter fairness measure th a n free of starvation is most desirable. We found
th a t no such fairness metric has been introduced and used in this context.
Regarding performance, the most widely used metric in the processor schedul­
ing context is average response time. We believe a predictable performance is more
valuable than b etter average response tim e1.

2 .3 .3

P erfo rm a n ce S tu d y

We have conducted performance studies on two sets of scheduling algorithms.
Next, we discuss the work related those studies.
2.3.3.1

L inux vs. Solaris S ch ed u lin g A lg o rith m s

Linux and Solaris schedulers have been constantly tuned and updated [12,57].
Even the most popular 0(1) Linux scheduler introduced in version 2.6 was initially
luit is m ore im portant to m inim ize v a ria n c e in th e response tim e th a n to m inim ize th e average
response tim e. A system w ith reasonable and p re d ic ta b le response tim e may b e considered m ore
desirable th a n a system th a t is faster on th e average b u t highly variable. However, little w ork has
been done on C P U scheduling algorithm s to m inim ize th e variance." [37]

20

expected to be used for a long time, has been overshadowed by the introduction of
a completely fair scheduler (CFS) [12,58].

We decided to study th e performance

comparison between Linux CFS and the Solaris 10 scheduler.
To th e best of our knowledge, we could not find a performance comparison between
Linux and Solaris schedulers. Even the complete description of the scheduling algo­
rithm s of these operating systems are not comprehensively described in one place. We
put a lot of effort to construct the complete algorithm in bits and pieces from various
sources for our simulation study.
2.3.3 .2

A F C F S vs. LG FS G an g S ch ed u lin g A lg o rith m s

Among the popular gang scheduling algorithm s, AFCFS alleviates the problem
of low processor utilization, and performs b etter th a n LGFS under light loads. The
performance of these two algorithms have been studied in [51-53]. LGFS, on the
other hand performs b etter than AFCFS under heavy loads.

2.4

Sum m ary

In this chapter, we explained multicore scheduling in a higher level, and looked
at load balancing and scheduling approaches. Then, we discussed an interesting class
of parallel job scheduling called gang scheduling and its relevance to multicore and
cloud computing. W ith this background, we are ready to present the m otivation and
contributions of the thesis.

21

Chapter 3
M otivation and Contributions

This chapter presents the motivation and contributions of this thesis, and the m ethod­
ology used. This thesis contains three main contributions: (i) a multicore scheduler
simulation framework and its implem entation; (ii) simulation studies of two popu­
lar scheduling algorithms to illustrate the use of th e proposed scheduler simulation
framework; and (iii) a new scheduling algorithm and its performance study. We start
with the motivation for th e first contribution.

3.1

M ulticore Scheduler Sim ulation Framework

From the literature study presented in C hapter 2, we identify two potential choices
for our thesis work: (a) design and propose a new or improved multicore scheduling
algorithm; and (b) design and propose a multicore scheduler simulation testbed where
any new scheduling algorithm can be easily sim ulated and analysed.
The first choice is from the observation th a t, as the field is young, and there are
no widely accepted concrete multicore scheduling algorithm s have been proposed in
the literature, there is a good possibility of inventing an efficient multicore scheduling
22

algorithm. The second choice is from the observation th a t, as the trend in processor
technology indicates th a t the future systems will have hundreds and thousands of
cores, any new algorithm proposed for such systems m ust be studied carefully using
a large number of cores using proper set of experiments before it can be adopted for
practical use.
After several brainstorm ing discussions, we started to realize th a t both choices
are risky and challenging as they have open ended goals. However, we felt th e sec­
ond choice has the potential to im pact widely, and therefore we chose to proceed in
designing and implementing a flexible multicore scheduler simulation framework.
Designing and implementing a multicore scheduler simulation framework involve
several research questions to be explored, and some of them are:

• W hat would be the main purpose of the framework?
• W hat would be the com ponents of the framework?
• How accurately can the com ponents be modeled?
• W hat is the level of accuracy th a t we want in modeling the com ponents of the
framework?
• To illustrate the use and flexibility of the framework, which scheduling algo­
rithm s can we implement and study?
• W hat kind of simulation experiments we would like to conduct?

From the literature, we understood th at no simulator could replace a real system.
However, the design and im plem entation choices vary greatly depending on th e cost
and accuracy. Here, the cost is a function of effort, time, complexity, and performance.
23

The accuracy depends on the purpose of th e simulation. O ur design choice of the
scheduler sim ulation framework is m ainly motivated from the following observations:

• “Simplicity is the key to understanding. ... Simplified simulations provide the
best grounds for extracting m ajor properties quickly. ... Simulations done with
realistic physical layers normally lead to investigating phenom ena w ith too many
variables, too many puzzles, leading to too few explanations, and too few hints
for future progress.” [59].
• “So, in practice, models th a t attem p t to be highly accurate end up running very
small “toy” workloads.” [28].
• “... the biggest difficulties in scheduling algorithm development: th e difficulty
of im plem entation and the duration of testing. The difficulty of im plem entation
refers to the time and effort needed to convert an idea into the actual code ...
The difficulty of im plem entation and th e duration of testing make it infeasible
to explore many different scheduling algorithm s.” [25].

These observations motivated us to design a multicore scheduler sim ulation frame­
work th a t is simple but flexible and comprehensive, so th a t the design space of the
scheduling algorithms for multicore systems can be explored rapidly.
Our design objective of the framework is mainly to provide accuracy sufficient to
gain initial insights into the performance of the scheduling algorithm under study.
We believe such insights will be valuable to guide th e researchers in developing new
scheduling algorithms with specific objective in mind. As a m atter of fact, we en­
countered such an opportunity of developing an improved scheduling algorithm for a
specific class of parallel applications called gangs. Next, we briefly explain how we

24

were m otivated to study gang scheduling algorithm s and propose an improved gang
scheduling algorithm.

3.2

A N ew G ang Scheduling A lgorithm

Cloud com puting is an emerging class of com putational platform th a t has the
potential to provide unprecedented com pute capacity to future organizations and
average users [8]. The trend in multicore processors indicates th a t all future processors
will be multicore, and hence the future cloud systems are expected to have their
nodes and clusters based on multicore processors [60]. So th e processor scheduling in
the future systems will most likely be all multicore processor scheduling. Therefore,
multicore scheduling is fundamental to future cloud com puting performance. Also,
due to multicore revolution, a considerable portion of large applications will be parallel
programs. From the literature, we can see th a t gang scheduling is a dom inant strategy
to schedule parallel programs with th e requirement of frequent synchronization.
Among the popular gang scheduling algorithms, AFCFS alleviates th e problem of
low processor utilization, but is susceptible to starvation. LGFS, as claimed in the
literature [42], outperform s AFCFS in large loads, b u t again is susceptible to starva­
tion. To avoid starvation, these algorithms adopt a process migration policy. Process
migration in this context is m igrating gangs between multicore systems. M igrating
gangs may not be even possible between two heterogeneous multicore systems, and
generally expensive even between two homogeneous systems [39,56]. Gang migration
could reduce the overall wait time of the m igrating gang, b u t it does not guarantee
to eliminate starvation.
These observations raise a question. Can we design a gang scheduling algorithm
with the following characteristics?

25

1. Freedom from starvation.
2. Predictable and acceptable response.
3. Better processor utilization.

Since LGFS favors large gangs, th e smaller gangs are susceptible to starvation or
will have longer wait time. This is unacceptable particularly in cloud environm ent
where custom er satisfaction hugely depends on fairness and predictable response time.
In practice, the customers who receive a little faster service (at the expense of oth­
ers’ long wait) may not be overly satisfied [61]. B ut, the customers who experience
unpredictably long delay, on the other hand, will readily notice the unfairness and
unpredictable response and th a t could potentially drive the cloud business in a nega­
tive direction. Therefore, in addition to fast response and high processor utilization,
minimal variance in response time is extremely im portant for quality of service in
cloud systems.

3.3

C ontributions

This thesis has contributions in the following three categories:

1. In ven tive
• A new multicore scheduler simulation framework
• A new gang scheduling algorithm with increased fairness
2. C reative
• A multicore scheduler sim ulator (expected to be released as open source
software)
26

3. E x p erim en ta l
• Performance study of Linux and Solaris scheduling algorithms
• Performance study of two popular gang scheduling algorithms A FCFS and
LGFS
• Performance study of a new gang scheduling algorithm and comparison
with AFCFS

At another angle, this thesis has contributions in theory, algorithm, im plem enta­
tion, and experiments. The scheduling framework is a theoretical abstraction of the
scheduling system. The new gang scheduling algorithm is an algorithmic contribu­
tion. The scheduling simulator is an im plem entation of th e framework. Finally, the
experimental study are the performance study of five scheduling algorithm s - Linux
and Solaris scheduling algorithms, and three gang scheduling algorithms.
These contributions have several benefits. The insights obtained from the experi­
mental evaluation will help: (i) the users to effectively exploit the hidden power of the
above mentioned schedulers, and (ii) th e researchers to design new efficient multicore
scheduling algorithms, by combining th e best ideas of the above studied algorithms,
and perhaps adding new ideas. The sim ulation tool can be used to evaluate any newly
designed multicore scheduling algorithm under various conditions before it is adopted
for real systems.
The proposed gang scheduling algorithm is simple, fair, and gives predictable per­
formance. Such a predictable performance is attractive from th e service point of view.
Also, the algorithm is scalable as it solves the starvation problem locally w ithout using
process migration. High performance, fairness, and scalability are attractive proper­
ties for cloud computing. Therefore, the algorithm is applicable for cloud systems
27

built from multicore processors.

3.4

R esearch M eth od ology

The methodology we followed is generally referred to as constructive research
methodology, where the construct could be a new theory, algorithm, model, software,
system or a framework. This methodology is most common in engineering and com­
puter science, particularly in systems research. Constructive research methodology
involves innovative modeling, design, im plem entation, and experim entation.
M ulticore scheduling is a challenging problem, and we determined to explore the
problem in a systematic fashion, using a com bination of theory and practice, with an
experimental approach. First, the focus is on thoroughly understanding the theory
behind processor scheduling by studying and evaluating existing scheduling schemes.
Second, based on this understanding of the literature, a multicore scheduler simula­
tion framework with components th a t can be useful in implementing a new multicore
scheduling algorithm is designed and implemented. Third, the com ponents neces­
sary for studying the performance of m ulticore scheduling algorithm are determ ined
and added to the framework. Finally, the functionality of the proposed framework
is dem onstrated by simulating five scheduling algorithms, and then illustrating the
performance of the simulated scheduling algorithm s through performance m onitoring
and performance metrics. The five m ajor steps involved in th e methodology are:

1. Literature survey
2. Design and implementation of th e multicore scheduler simulation framework
3. Identification and im plem entation of performance monitoring com ponents and
performance metrics
28

4. Implementation of three popular classes of scheduling algorithms: Linux, Solaris,
and Gang scheduling
5. Conducting simulation experiments on five scheduling algorithms and illu strat­
ing their performance results

3.5

Sum m ary

In this chapter, we presented the m otivation for our contributions and listed the
contributions. It also contains the research methodology th a t we followed. W ith this
background, we are ready to present th e m ajor contribution of thesis - th e multicore
scheduler simulation framework in the next chapter.

29

Chapter 4
Multicore Scheduler Simulation
Framework

This chapter presents the architecture of the multicore scheduling simulation (MSS)
framework, which is the m ajor contribution of this thesis.

The prim ary use of

this framework is to facilitate researchers in implementing and simulating multicore
scheduling algorithms to evaluate the performance visually and statistically. Before
presenting the framework, we briefly explain the simulation technique we used.

4.1

Sim ulation

Computer simulation is a technique to model and observe the behavior of some
real or imagined system over time [62]. A simulator is simply a com puter program
th a t transforms the state o f the system in discrete tim e points over a specified period
of time. Simulation is widely used to study the dynamic behavior of complex systems.
Based on how the system state is modeled and simulated, com puter simulations are
classified either as continuous or discrete. If the state variables change continuously
30

over time, then it is called a continuous simulation, and if th e state variables change
only at discrete times, then it is called a discrete simulation. In reality most systems
are a combination of both. However, depending on the purpose, most systems are
simulated either as continuous or discrete, and rarely as hybrid of both.
Discrete simulation is further divided into time-stepped and event-driven, based
on the advancement of simulation time and th e update of th e system state. In timestepped simulation, the system state is updated at every tim e step. In event-driven
simulation, the system state is updated at the occurrence of events.
Discrete event simulation consists of an events list, a simulation clock, and an event
scheduler. For instance, in simulating the behavior of a queue at the bank-teller, the
number of customers arrived and the number of customers served are state variables
and they will be updated on the occurrence of the events in th e system. T he number
of customers arrived will be updated when th e customer arrives in the bank, and the
number of customers served will be updated when th e bank-teller serves th e customer.
Simulation continues until either the events list becomes em pty or the sim ulation time
ends.
In discrete event simulation, the sim ulator m aintains a queue of events (also called
event list) sorted by the simulated tim e they should occur. Simulation clock m aintains
the simulation tim e and it is advanced to the time of occurrence of next event in the
event list.

Since, it is not im portant to execute the simulation in real-time, the

advancement of the simulation time can be the same, faster, or slower th a n real-time.
For example, in the simulation of hum ans evacuating a building, the queues buildup
can be visualized faster than real-time. The current flow through an electric circuit can
be simulated slower than real-time, and in-training simulations (for example, flight,
simulators) can be exhibited real-time speed. An Event scheduler executes events

31

from the events list and the system state changes a t the occurrence of each event in
the system.
We use both discrete tim e stepped and discrete event simulation to implement
different components of our framework.

4.2

M ulticore Scheduler Sim ulation

From systems point of view, multicore scheduling essentially has two tasks - main­
taining the load among th e cores and multitasking threads in each core. Generally,
the first task is referred to as load balancing and the second task is referred to as local
scheduling. We m aintain these abstractions in our scheduling framework.
Load balancing manages the jobs across the cores, and local scheduling directs the
core to switch between threads. T hread switching (or context switching), say from
Ti to Tj, requires saving the context of Tj and loading or restoring th e context of Tv
Also, certain tasks must be performed when a thread completes its execution. In the
simulation context, these tasks are basically updating the simulation system state. In
our framework, we provide generic routines to do these tasks.
In essence, implementing the routines of load balancing and local scheduling are
the programming effort needed to use our simulator to study the performance of a
new scheduling algorithm. Again, to reduce the effort of implementing the scheduler
further, we provide a default load balancing routine and sample local scheduling
algorithm routines. These routines can be suitably modified to implement th e new
scheduling algorithm, unless the new scheduling algorithm is completely novel and
does not follow the structure of load balancer and local scheduler combination. Even
in th a t case, the entire scheduling can be designed from scratch with little effort to use
our framework. In any case, from our experience, we are confident th a t, once th e logic
32

A rrival queue

Scheduler

updates

PQ
O

State

adds

T race

&

Sim ulation
M anager
advances

amulaticaa do ck

Figure 4,1: The Scheduler
of the new scheduling algorithm is completely understood, then im plem enting it in
our simulator is straightforward. W ith this background on multicore scheduler, next
we explain the overall multicore scheduler simulator, which is a part of the framework.
The simulator is illustrated in Fig. 4.1. It has seven components: sim ulation clock,
simulation manager, arrival handler, arrival queue, scheduler, state, and trace. The
scheduler has four routines: load balancer, local scheduler, context switch handler,
and completion handler.
The simulation of the system basically involves updating the state of the system
at every simulation time point, and incrementing th e simulation clock. Simulation
clock advancement and state update could be done in an integrated fashion. But, for
the modular design of th e scheduling framework, we decided to keep these two tasks
separate.
33

In our framework, we designed a component called Simulation M anager to do
the task of advancing the simulation clock and initiating th e appropriate routines
to update the system state. U pdating th e simulation system state typically involves
recording the arrival of new jobs, and updating the state related to scheduling. To
implement the arrival of new jobs, we introduced a queue called arrival queue, and
implemented a routine to register the newly arrived jobs in th e arrival queue. U pdat­
ing the state related to scheduling is dependent on the scheduler logic, and it must
be done as p art of the im plem entation of the scheduler.
The simulation of the executions of jobs are recorded as simulation trace, and it is
recorded at every scheduling point. To make this task generic, we have standardized
the format of the execution trace and implemented a routine to add the trace appro­
priately during the simulation. Actually, this task of updating the trace is taken care
of autom atically by the context switch handler routine. W ith this background, we
now introduce the architecture of the multicore scheduler simulation framework.

4.3

A rchitecture o f M SS Fram ework

First we introduce some basic terminology used in our framework. Hereafter, to
avoid confusion of what really refers to processor1 in the simulation context, we avoid
its usage in the rest of the thesis.

4 .3 .1

T erm in ology

• Core: The hardware execution unit.
• Chip: An integrated circuit containing one or more cores.
b e f o r e m ulticore era, the te rm processor was used to refer to as an execution u n it. T herefore,
it was used synonym ous w ith central processing u n it (C P U ). Now', w'ith m ulticore technolog}-, a
processor has several execution units.

34

W o r k lo a d
G en er a to r

S ch ed u ler
Chip l

M achin e

Cl^ip 2

Chip n
w-*-*

/

(^Coare^)

✓
1--------SI---------

j

f

Covet \

cS> <S>
©

P erfo rm a n ce
C a lcu lation E n gine

----------

v

Core

E x e c u tio n
T race

--------- ^N
---------

@

A c tiv ity P ro file
G en er a to r

Figure 4.2: Architecture of MSS Framework
• Machine: A collection of chips designed to work together.
• Thread: The smallest unit th a t can be scheduled to a core for execution.
• Gang: A set of parallel threads th a t can be executed to achieve a task.

A higher level architecture of MSS framework is given in Fig. 4.2. The framework
has five main logical components: workload generator, multicore machine, multicore
scheduler, execution trace, and performance calculation engine. We explain them
next.

35

4 .3 .2

W ork load G en er a to r

We have implemented the workload generator to generate two types of workloads:
threads and gangs. Threads are generated for traditional applications and gangs are
generated for parallel applications. The gangs of parallel threads typically execute in
a synchronized manner.
The input to generate a workload of threads are: the numbers of threads, mean
arrival rate, and mean service rate. To generate the workload of gangs, instead of the
number of threads, the number of gangs is given. The number of threads w ithin each
gang is generated uniformly between the range 2 to M , where M is th e num ber of
cores.
The output will be a set of threads or gangs depending on the generator chosen.
Each thread will have a unique id, arrival time, execution time, I/O points, priority,
and application class. Similarly, each gang will have a gang id, arrival time, execution
time, and a set of parallel threads with their own ids.
The I/O points of a thread are the times when the thread will go for I/O s. For
example, assume th a t a thread has an execution time 50, and will go for I/O at time
5, 20, 30. In this case, I/O requests will be invoked after the thread is executed for
5 units, 20 units, and 30 units. The number of I/O points are generated uniformly
w ithin the execution time range, and th e I/O wait tim e is generated uniformly within
a range specified in the configuration.

4 .3 .3

M u ltic o r e M a ch in e

The Multicore machine is the com puting unit th a t is organized in a hierarchy,
starting with machine at the highest level, as shown in Fig. 4.3.

36

Machine
Chip 1

Chip 2

Core
Core
I L1 I— — I L1 I—
L2 cache

Core
Core
i Ll I
—I L1 1
L2 cache

L3 cache
Main Memory
Figure 4.3: A Multicore Machine
A multicore machine contains one or more chips, and each chip contains two or
more cores. The core is the physical execution unit. In practice, a core is capable of
executing one or more threads in an interleaved way, referred to as hyper-threading.
In our simulator we model a core to execute one thread at a time. We believe th at,
w ith a suitable scheduling policy, the performance of hyper-threading effect can be
approxim ately simulated using single threaded cores.
A multicore machine has hierarchy of cache memories. C urrent m ulticore systems
have three levels of cache memories, referred to as L l, L2, and L3. L l is core level,
L2 is chip level, and L3 is machine level. As shown in Fig. 4.3, L l is local to each
core, cores in a chip share L2 of th a t chip, and L3 is shared by all the cores in the
machine.

4 .3 .4

M u ltic o re S ch ed u ler

As explained earlier, our framework implements a multicore scheduler having two
logical components - load balancer and local scheduler.

T he load balancer is re­

sponsible for maintaining a desired balance of the system load. This task involves
dispatching the new jobs to the appropriate local scheduler, and m igrating jobs from
one local scheduler to another when necessary. T he local scheduler is responsible

37

< cid , t i d , E x -sta r t, Ex-end, S>
< 0 ,0 ,4 ,1 2 ,0 >
< 1 ,1 ,9 ,1 3 ,0 >
< 1 ,1 ,1 3 ,1 7 ,2 >
< 3 ,4 ,1 6 ,1 9 ,2>
<0 , 2 , 12 , 2 2 , 0 >

< 4,3 54,12459,12459,3>

Figure 4.4: Sample Execution Trace
for scheduling jobs to the cores for execution. Generally, each core is assigned a lo­
cal scheduler, but other choices are possible. In our simulation, local scheduling is
implemented to have centralized control.
In simulation context, the local scheduler primarily makes a decision to choose a
job for execution, determines th e am ount of execution time, calculates its progress
rate, and produces the trace. The progress rate of a job is the crucial design factor
affecting the accuracy of execution, and it is dependent on several factors such as
execution speed of the core and the contention for shared resources. We have used a
simple cache contention model, but it can be easily replaced w ith an im plem entation
of a more refined model.

4 .3 .5

E x e c u tio n T race

During th e simulation, the execution trace is recorded at every context switch to
generate the activity profile and to com pute the performance metrics. There are two
types of traces collected: execution trace and I/O trace. The execution trace is a
collection of quintuple, as shown in Fig. 4.4.
In the execution trace, each quintuple has a core id (cid), thread id (tid), execution
start, time (Ex-start), scheduling end time (Ex-end), and a status (S). S tatus is 0 if
preem pted by quanta expiration, 1 if preem pted by higher priority thread, 2 if going
38

< tid , I /O -s ta r t, I/0-end>
<4,19,49>
<4,52,69>
<19,52,76>
<13,43,83>
<34,85,87>

< 681,12045,12070>

Figure 4.5: Sample I/O Trace
for I/O , and 3 if completed. The I/O trace is a collection of triples, as shown in Fig.
4.5, each triple has thread id, I/O start tim e (I/O -start), and I/O end tim e (I/O -end).

4 .3 .6

P erfo rm a n ce C a lc u la tio n E n g in e

The performance study prim arily aims to determine how well th e algorithm re­
sponds to satisfy certain criteria such as response time, fairness, and utilization of
resources. The performance calculation engine calculates these values for a given set
of data, and the result can be passed to the performance observation window for the
users to study. The performance criteria widely used to study scheduling algorithm s
are the five metrics: throughput, CPU utilization, turnaround time, waiting time,
and response time [37,38].
In addition, we have included three more measures: (i) interactive response time,
(ii) bypass count, and (iii) slow down factor. Interactive response tim e is introduced
to observe the interactive response of tasks. It is defined as the tim e from when a
thread is ready for execution to th e subsequent start of execution. To avoid starvation,
wre introduce the metric of bypass count. Bypass count will indicate the number of
threads which bypassed a waiting thread in scheduling, and it is an indication of
unfair scheduling. We use a bypass count graph to see the level of fairness th a t a
scheduling algorithm can assure. Finally, a relative performance would be interesting

39

for resource allocation purpose. For th a t, we include slow down factor.
The following list of performance metrics is supported in th e proposed framework.

1. Throughput: The to tal number of jobs completed execution in one unit of time.
Assuming the simulation starts at tim e 0, and n jobs complete in Ts period, the
throughput T P is com puted as follows:

TP = —
n
2. Core

(4.1)

utilization: This is the percentage of tim e the core spends on executing

jobs. Let Ts be the to tal sim ulation time and Tb be th e amount of tim e the core
is busy. Then, the utilization (Uc) of core c is com puted as follows:

Uc =

x 100%

(4.2)

S

3. Core idle time: This is the percentage of tim e the core is idle w ithout jobs to
execute. The idle time (Id lec) of core c is computed as follows:

Id lec = (l - ^ ) x 100%

(4.3)

S

4. Core

wait time: The total time th a t the thread waits for core.

5. I /O Wait time: The total time taken for all I/O waits of the thread.
6. Wait time: The sum of core wait tim e and I/O wait tim e of the thread.
7. Turnaround time: The sum of wait tim e and execution time of the thread.
8. First response time: The tim e from submission to sta rt of execution of the
thread.
40

9. Interactive response time: The tim e from when th e th read is ready for execution
to the subsequent start of execution.
10. Slow down factor: Slow down factor of a thread is defined as the ratio between its
turnaround tim e and execution time. This m etric is used to m easure th e delay
suffered by a job against its actual execution time. If w ti and exj, respectively,
are the wait and execution times of i, then the slowdown factor .Sj of the thread
i is computed as follows:

Wti
exi

'i

(4.4)

11. Bypass Count Graph: Every thread is associated with a bypass counter. This
counter is incremented whenever it is bypassed another thread in scheduling.
The bypass count graph reflects how many threads have bypassed a waiting
thread in scheduling. It gives the sense of fairness th at th e scheduling algorithm
can assure.

Let x i,x 2, .... x u be the values, th e average x and standard deviation a are com­
puted as follows:

(4.5)

n
and

n

n

41

(4.6)

4 .3 .7

S ta tis tic a l M ea su re s

For these metrics, when applicable, we calculated minimum, maximum, average,
and standard deviation.

Traditionally, average and percentage are used to study

the performance in this context. T h at is, typically, throughput and utilization are
com puted in percentage, and average is com puted for turnaround time, waiting time,
and response time. We believe b etter measures than average m ust be used in analysing
these metrics.
As cloud computing is fundam entally a service oriented system, its prim ary goal
is to provide quality service to its customers. Although the term quality of service is
often used and directly related w ith response time, it is not well interpreted in the
context of computing and communication systems. As indicated in [61], perceived
quality of customers need not be directly related w ith minimal response time. The
study indicates th a t users are often unaware of the quality differences until it crosses
certain threshold. Therefore, quality of service need not always be related to the
widely used metrics such as minimal response time or minimal average response time.
Thus, we believe th a t a predictable response time is a more appropriate measure for
the quality of service in the cloud com puting context than other measures such as
average response time and resource utilization.
One such measure we discussed earlier is predictable response time. Predictable
response could be measured using variance or standard deviation.

Therefore, we

decided to include the m etric of standard deviation (i.e., square root of variance) in our
framework. We believe th a t standard deviation is a better measure of predictability
than average value.

42

4.4

A ctivity Profile G enerator

Activity m onitor provides visualization of how th e algorithm schedules the threads
among the cores. Activity profile generator is responsible for providing th e d ata for
the visualization. It basically derives the d a ta from the execution trace. From the
trace, it generates d ata for every clock tick. The d ata contains the core statu s as idle
or occupied. Also, if occupied, the d a ta contains the information ab o u t the thread
and its status, whether it is a newly arrived thread or preem pted thread or m igrated
thread from a different core. Activity profile generator differentiates these different
states of the core by assigns different colors. The activity m onitor window visualizes
the derived d ata of core to thread allocation. Using this component, we can visually
observe core utilization, load balancing, and the individual core statistics.

4.5

U ser Interface

The multicore scheduling sim ulator has three main user windows, and the interface
within each window is organized as hierarchical panels. We explain these windows
next.

4 .5 .1

P erfo rm a n ce P a r a m e te r S e ttin g W in d o w

Performance param eter setting window is used to configure the param eters for the
simulation. It consists of two screens. First screen, shown in Fig. 4.6, is used to set
the workload generation param eters. The input for the workload generator are the
number of jobs, arrival time distribution, mean arrival rate, and job execution time
range.
The param eter setting window has the provision to create more th an one work-

43

....fwipimpik far

SLMtyal

niaanrtwr*

o*lAuMcotn

^

ferfarm aoc* f v w w n l a w ; WrKtaw

Workload generator
Mi— k c r a ( p i»t***

iW O

Arrival xkn*4lnribM hm

P«**cm

50

^

*»

1&U

Gang Job generator
)

N « « -»

Figure 4.6: P aram eter Setting Window
load. The window takes number of process, arrival rate distribution, arrival rate, and
execution range as the param eters for each workload. Once you configure the above
mentioned param eters, you can add it to the workload list by pressing th e ;Add to
Load’ button which will add to the list of workloads and will display it in th e right
side of the window. Using this option, we can configure more than one workload.
The simulations run window, shown in Fig. 4.7, is designed to get input for creating
simulation runs. Using this window, several simulation runs can be configured. The
param eters required for each run are the number of cores, workload selection, and
scheduling algorithm. Using the simulation run window, we will be able to create
simulation runs with different combinations are listed below:

1 . number of cores vs workload w ith the same scheduling policy
2 . number of cores vs scheduling policy under th e same workload

44

. r -

.x..... ..........

.

..............
j

...... .. .,

i ’tpmmark h* U ffr pf

N fto e w o O b s e n r tw W W e w

j

>iBcfn*w i g»

r t r tnrwu n ci h n a w t k m

n w

i

ACMr MOMHTVMM*

|

Configuration
.■iwatNfrafcer**

12*

W w U a * _____ ■MortiostfO__________ <*j C u e Arrttnri B ate j i . 4

& Ati
Sdw tfaitiig Policy

iS o U rto tO

Kutnfrtr sir c a re t

J j |j

: S chedule* Pete*

;*Ej

__
|

A d a to sim ulation n m i

]

W arttead

j

i&4.........................................................Linus 2.6 2 4 .........................................Wofdoadfl
64
U8
iiH

U n u x 2.6 26
SoUri* IQ
Sol-iru td

W o ftftu d i
W «kk>»40
W ontlwuU

Figure 4.7: Simulations Run Window
3. workload vs scheduling policy when executing with sam e number of cores

This simulations run setting feature simplifies th e effort involved in simulating the
scheduling algorithms under various conditions.

4 .5 .2

P er fo rm a n c e O b ser v a tio n W in d o w

Performance observation window, shown in Fig. 4.8, offers various performance
metrics th a t can be represented in charts. The window has a list of performance met­
rics which can be chosen to see th e computed values. Additionally, there is an option
to choose two different simulation run and compare the results. The performance
metrics can be analysed by varying the number of cores and arrival rate.

45

Charts
CfcMMCbantypt

t-TUca* ttittnatyw

Aiutyw pcrtarmaiK* •£*»«

Figure 4.8: Performance Observation Window

4 .5 .3

A c tiv ity M o n ito r W in d o w

Activity monitor window is an interesting component in th is simulator. This screen
shows the core usage in a graphical representation. W ith this screen, shown in Fig.
4.9, we can to analyse the following:

1 . Core utilization

2. Thread migration
3. Execution thread list

Using Activity m onitor window, we can visualizes how th e threads are distributed
among the cores and how effectively the load is balanced.

To see th e individual

core performance, a core m onitor child window is attached to each core to show its
statistics. The child window is shown in Fig. 4.10
46

k. ;»£»'..... '"^ «*»

; f~[." i . "J [..H I i ^ H

* 1m*mma+ HHH

Mm

"■r

*M*Mjaw «*M MM

“J |nl"
___

“1’ »
B iB

- B

J H

i i l

wj’i iii

i t

jlX

i M f T i i ...W
- “ i ^j^jjj^ m

iz g

iU
7*^

1_

' mm

■ ■

; am

m

m

g £ |j|

*t^*"

____

f m

hb i

_

m k

1^

7

-^ y ^ -

M m

w m
a h i

f t r
B s is

WM

■

■

:IH

w

] .....

- H H .... | H | ..... M i l '’

^"""^‘ "

»[ {,,»i f.J L..**'
.L.** :,..) •._“* |J_1* “
H E
H I
H I t m
! IH

i I H

^^L IH

- .n ..r * ** C “ * I..[ r
H H
'
H I IH I

t n

....."j j ^

r a

l i r

Figure 4.9: Activity M onitor Window

■■--i.X

.......

'^ . - c .£ ■ ■■:

t

C ow

Thread status
New

P'empiec
Prem ptttJ

Pfempwd
Prempttd
Prem pw d

summary

Figure 4.10: Core M onitor Window

47

The core m onitor window has three parts. The first part (left part) is to display
the threads running in th a t core using different colors. The second p art (right upper
part) is to display the statistics of busy and idle time of the core. A thread scheduled
in th a t core could be new, preem pted, or m igrated from another core. The ratio of
these three types of threads scheduled in th a t core is displayed in the th ird part (right
lower part) of the window. This p art shows how many threads are m igrated from
other cores due to load balancing. Such visual analysis is sometimes useful to capture
unusual behavior and patterns.

4.6

Sum m ary

In this chapter, we presented a new framework for simulation of m ulticore schedul­
ing algorithms. The framework is flexible, and serves as a base for designing new
scheduling algorithms and conducting experimental study on existing scheduling al­
gorithms for multicore processors.

48

Chapter 5
Case Studies - Linux and Solaris
Scheduling Algorithms

To test and illustrate the functionality of the proposed simulator, we implemented
the recent versions of Linux and Solaris 10 scheduling algorithms. We first present
the algorithms, and then describe the simulation experiments and observations. The
recent version of Linux scheduler is called completely fair scheduler (C F S )1, and the
Solaris scheduler is referred to as kernel dispatcher.
Although Linux and Solaris are popularly used operating systems, the documen­
tations precisely describing the scheduling algorithm s are rarely published. We had
to reconstruct the algorithms using inform ation obtained from different sources.
Both schedulers have two logical components - a load balancer and a local sched­
uler. The load balancer is responsible for m aintaining a desired balance of th e system
load. This task involves dispatching new threads to the appropriate local scheduler,
and migrating threads from one local scheduler to another when necessary. The local
'C F S has been im plem ented sta rtin g from Linux 2.6.23.

49

scheduler is responsible for scheduling threads to th e cores for execution. We first
present the load balancer of these two algorithms together first, and then the local
scheduling algorithms separately.

5.1

Load B alancing

Load balancing involves four factors: (i) initial placement; (ii) m igration criteria;
(iii) migration policy; and (iv) frequency of balancing.

• In itial T hread P la cem en t - Initial thread placement in both Linux and Solaris
is the same: the new threads are dispatched to the lightly loaded cores.
• Load B alan cin g C riteria - The difference in the num ber of threads between
any two cores is less than one.
• M ig ra tio n P o licy - Linux migrates threads from heavily loaded cores to lightly
loaded cores to satisfy the load balancing criteria.

Solaris, when the choice

occurs, moves the thread to the core in different chip t-lian to the core in the
same chip, to reduce the cache conflict.
• B a la n cin g frequency - Load balancing in Linux is done every 200ms, and the
load balancing in Solaris is done every 100ms.

Next, we present the local scheduling algorithms of Linux and Solaris. The threads
in both algorithms are classified as real-tim e 2 thread and normal thread. Here, real­
time implies th a t these threads have higher priorities than the normal threads, and
therefore real-tim e threads are always executed before normal threads.

We first

present how the real-time threads are handled in b o th Linux and Solaris.
2T he term real-tim e in th is context is m isnom er th a t it does n ot associate any specific deadline
to m eet, and no trad itio n al scheduling algorithm s like rate m onotonic (RM ) or earliest deadline first
(ED F) have been used to schedule th ese th read s.

50

5 .1 .1

R ea l T im e S ch ed u ler

Whenever a real-time thread arrives, it preem pts th e norm al thread and th e real­
time thread is scheduled to run. WThen a real-tim e thread is executing, if another
real-time thread w ith higher priority arrives, then th e current thread is preem pted
and the higher priority thread is scheduled. W ithin th e same priority level of real-time
threads, Linux uses either the First-In-First-O ut (FIFO ) or th e Round Robin (RR)
scheduling policy, and Solaris uses FIFO.

5.2

Linux Scheduler - C om p letely Fair Scheduler (C FS)

The basic idea behind CFS is to ensure fair share among threads in the overall
execution. This is achieved by quanta allocation. The execution time plays a key role
in quanta com putation. CFS m aintains the am ount of time th a t a thread has utilized
the core before, referred to as vruntime. T he thread with th e smallest vruntim e has
the highest preference to be selected next for execution. Calculation of qu an ta and
vruntime is given later.
Instead of run queue, for efficiency, CFS m aintains a d a ta structure called RedBlack tree, sorted by vruntime key. The scheduler picks from the left-most child of
the tree which is the smallest vruntime thread for execution. The pseudocode of CFS
algorithm is given in Table. 5.1.

5 .2.1

C a lc u la tio n o f Q u a n ta

The quanta value calculation plays a key role in CFS. T he time quanta Q \Ti\ of a
thread TJ is calculated using the following formula:

51

D ata Structures: Red-Black tree - R B T , T hread X,
1. w h ile ( R B T 7^ em pty ) do
2.
select leftmost thread 7) from R B T
3.
com pute time quanta
4.
schedule T
5. end w h ile

Table 5.1: Completely Fair Scheduling Algorithm

m \ =

J ; MKl9ht
x P
}
Tj. weight

( 5 . 1)

jeRBT

where,

Ti.weight is the weight value corresponds to nice 3 value of 7). (Every nice value

is mapped to a weight value.)

P= {

sched latency

if n > n r

~
m in _ g r a n u la r ity x n

~
otherwise

latency

(5-2)

where n is the number of threads in RBT, and sched ^ latency, nr _la te n c y , and
m in _ g r a n u la r ity are constants. In the current im plem entation, these values,

respectively are 6 , 8 and 0.75 [63]. The details of how these values are determined
and their significance are not known.
3In Linux, the priority of th e th rea d is controlled w ith th e nice value. Nice value ranges betw een
-20 and 19. Lower value corresponds to higher priority. T h e user level com m and nice can be used
to lower th e priority of a th re a d (i.e.. to b e nice to other users).

52

5 .2 .2

C a lc u la tio n o f vruntim e

For every clock tick, the scheduler calculates vruntime of th e executing thread and
also decreases its time quanta. Preem ption of a thread occurs when quanta expires
or RBT has a thread with smaller vruntime. The virtual vruntime of a thread 7) is
computed as follows:

T i.vru n tim e —

li.w eig h t

x T i.ru n tim e

(5.3)

where, w eight 0 is the value corresponding to th e nice value of 0 and 7).ru n tim e
is the execution tim e consumed so far by the thread 7).

5.3

Solaris Scheduler

Solaris 10 kernel dispatcher does both load balancing and local scheduling. Solaris
has implemented two schedulers - fair share scheduler and a default scheduler. The
fair share scheduler is typically used in server environments. We have implemented
the default scheduler, th a t we will describe next.
For simplicity, in Solaris, w'e consider ju st threads are scheduled for execution4.
Solaris m aintains the priority range of 0 - 169. Priority range of 160-169 is reserved
for Interrupts, and the rest of the priorities are assigned to different scheduling classes
(see Appendix).
Solaris manages the threads in th e following six different scheduling classes.
4In actual im plem entation, each ap p licatio n process in Solaris m ay co n tain one or m ore app licatio n
th read s and each application th re a d is scheduled (m apped) to a v irtu al core called light w eight process
(LW P). Each LW P is im plem ented using a kernel thread, an d these kernel th rea d s are eventually
scheduled and executed by th e physical core.

53

1. Time share (T S)
2. Interactive (IA)
3. Fair share scheduling (FSS)
4. Fixed priority(FX)
5. Real Time (RT)
6 . System (SY S)

By default, threads created by th e window manager are assigned to IA class for
better interactivity, and the rest are assigned to TS class. System threads are created
by the operating system. O ther threads are created using different levels of system
privileges. FSS class is used when the fair share scheduler is invoked.
Priority of a thread may be specified or inherited from the parent. In fixed priority
class, threads have the same priority throughout th eir execution.

RT class is the

highest priority thread class which requires attention right away, and needs to be
scheduled immediately. Next to the threads in real tim e class, the kernel threads get
attention. Finally, the threads in th e classes TA, IA, and FX are scheduled. The
scheduler always chooses the highest priority thread for execution. The scheduling
classes priority ranges are summarised in Table 7.1 (see Appendix).

5 .3.1

T h e D e fa u lt Solaris S ch ed u ler

In Solaris, except RT, every scheduling class has a local scheduling queue for every
core. For RT class, a global level kernel preem ption queue is maintained for every
chip.

54

After choosing a thread for scheduling, th e next task is to obtain the quanta value.
For th at, Solaris m aintains a set of tables, called dispatch tables (see A ppendix), from
which the quanta value is obtained. Also, th e scheduler provides the fairness among
the threads by boosting the priority up/dow n, in response to the following events.

• A thread successfully completes its execution for specified time quanta. Here,
the thread priority has to be boosted down to give fairness to other threads.
• A thread comes back from an I/O wait. Here, its priority has to be boosted up,
so th a t it will execute soon.
• A thread is waiting in its ready queue beyond certain threshold tim e period.
Here, the priority has to be boosted up to give chance to execute soon.

These scheduling subtasks are performed by specific routines called tick processing
and update processing (see Appendix). Tick processing is responsible for managing
quanta and invoking preemption. U pdate processing is responsible for boosting the
priority up/dow n. Dispatch tables are used to obtain new quanta, waiting threshold,
and new priority.
W ith this set up, the scheduling policy is: a t any scheduling time point, choose the
highest priority thread in the system and schedule for execution. If a higher priority
thread arrives when a lower priority thread is executing, then it will be preem pted to
allow the higher priority to execute. A higher level description of Solaris scheduling
algorithm is given in Table 5.2.

55

Data Structures: Scheduling queues - D Q T S , D Q
Thread T\

I A,

D Q F X , KPQRT,

1. w h ile ( K P Q R T V D Q T S V D Q I A V D Q F X ^ empty ) d o
2.
if ( K P Q R T 7^ em pty)
3.
select highest priority thread T from K P Q R T
4.
get the time quantum Q[T,]
from dispatch table
5.
schedule 7)
6.
e lse
7.
pick highest priority thread T) from D Q _ T S , D Q IA , D Q F X
8.
get the time quantum Q[Ti\
from dispatch table
9.
schedule T
10. e n d if
11. e n d w h ile

Table 5.2: Solaris 10 Scheduling Algorithm

5.4

Sim ulation E xperim ents

To illustrate the functionality and use of the proposed simulator, we conducted
two sets of experiments:

1. E x p erim en t 1: In this experiment, the workload is fixed and the number of
cores is varied. The simulation param eters used for this experiment are given
in Table 5.3.
P a ra m eter
Number of Threads
Mean Arrival rate 3
Arrival distribution
Execution tim e distribution
Number of cores

V alue
5000
2.5
Poisson
Exponential
10, 50, 100, 200, 500

Table 5.3: Simulation Param eters
5We use generic u nit for arrival rate . T h e inter arrival tim es derived from th e d istrib u tio n are real
num bers. T hey are th e n scaled to integer u n its of sim ulation clock. F or example, th e in ter arrival
tim e 0.4 is scaled to 4 sim ulation clock units.

56

2. E x p erim en t 2: In this experiment, the number of cores is fixed and the work­
load is varied. The simulation param eters used for this experiment are given in
Table 5.4.
P ara m eter
Mean Arrival rate
Time period
Arrival distribution
Execution tim e distribution
Number of cores

V alue
2.5, 3.5, 4, 5
1 m inute
Poisson
Exponential
50, 100, 150, 200

Table 5.4: Simulation Param eters

Linux schedules threads based on how much execution tim e it is consumed and
Solaris does by how long a thread waits w ithout execution. We believe scheduling
thread based on wait time would give b etter response time and predictability than
scheduling done based on execution time. Based on this observation, we make the
following hypothesis.
Hypothesis: Solaris scheduler will have b etter and predictable response time th an
Linux scheduler.
Predictable response tim e is studied using standard deviation of response time.
We computed the core utilization, turnaround time, standard deviation of turnaround
time, and interactive response time and its standard deviation for bo th experiments
and we explain our observations next.

5.4.1

O b serv a tio n s o n E x p e r im e n t 1

• Observation on Core Utilization: We observe th a t b o th algorithms keep the
cores 99% busy. In term s of core utilization, there is no significant difference
between these two scheduling algorithms. The consistent behavior is due to
their load balancing which periodically runs to evenly distribute the workload
57

j — Unu< A vyapeT -j^ atsyr-d *Jne - -S olans A v y ^ T u n ; a r j j r g t r a j

:;3,ooo
ho.om- V
-,
\n
lOO.OCtfl ■
n
«
.

S3.SJO-

')

\
J£ 600 0 0
0

5 ) 000 .
40,000-

\
\

\

%

75
o f cores

Figure 5.1: Average Turnaround time: Linux and Solaris Scheduling A lgorithms (by
varying the number of cores)
among the cores.
• Observation on Average Turnaround Time: The average turnaround tim e graphs
of Linux and Solaris scheduling algorithms are shown in Fig. 5.1. The average
turnaround tim e gradually decreases as the num ber of cores increases. Both
Linux and Solaris algorithms show almost same behavior until 200 cores, and
after 200 cores, Solaris performs b etter than Linux by a factor of 10 when the
number of cores is 500. This is due to th e fact th a t Solaris boosts up the th read ’s
priority when the thread is waiting in th e ready queue longer.
• Observation on Standard deviation of Turnaround Time: The standard devia­
tion of turnaround tim e graphs of Linux and Solaris scheduling algorithm s are
shown in Fig. 5.2. We observe th a t Solaris and Linux show similar behavior
between 10 and 200 cores. As we increase th e number of cores, Solaris shows

58

Figure 5 .2 : Standard deviation of Turnaround time: Linux and Solaris Scheduling
Algorithms (by varying the number of cores)
b etter predictability than Linux.
• Observation on Interactive Response Time: The average interactive response
tim e graphs of Linux and Solaris scheduling algorithm s are shown in Fig.5.3. We
observe th a t Solaris outperform s Linux consistently providing b etter interactive
response time. This is because, whenever a th read stays in ready queue and
reaches the maximum wait time, Solaris boosts up the thread priority. Also,
whenever a thread returns from I/O , th e thread priority is boosted up so th a t
it can be scheduled soon.
• Observation on Standard deviation o f Interactive Response Time: The standard
deviation of interactive response time graphs of Linux and Solaris scheduling
algorithms are given in Fig. 5.4. The prediction of how soon the thread will be
scheduled for execution, when it is ready, is measured by computing the standard

45.000.000 ••
4 2 ,5 0 0 ,0 0 0 ;
40.000.090 •
37-500,000
35 .000.000;

\

32.500.000-|

Clock

(tic k s )

30 .0 0 0 .0 0 0 ;
27;509;000-;
25,000,030
22 500,000 •

\

20 .000,000

17.500,000 ■

\

15,009.000

\

12.500,000 •;

10;000;000
’,500,000
5 000,000 2,500.000
25

50

75

ICO

125

150

1 75

2SC

225

250

275

303

325

350

375

400

425

450

475

500

Number of cores

Figure 5.3: Average Interactive Response tim e : Linux and Solaris Scheduling Algo­
rithm s (by varying the number of cores)
deviation in interactive response time. It is clearly visible that Solaris performs
better than Linux by assuring b etter fairness to the threads.

As mentioned

before, the boosting of the priority by Solaris influences the predictability b etter
than Linux.

5 .4.2

O b serv a tio n s on E x p e r im e n t 2

• Observation on Core Utilization: We observed th a t both algorithms keep the
cores busy and there is no significant difference.
• Observation on Turnaround Time: T he average turnaround tim e graphs of
Linux and Solaris scheduling algorithm s are shown in Fig. 5.5. The average
turnaround time constantly increases when th e arrival rate increases.

From

these experiments, it is clearly seen th a t Solaris performs b etter then Linux,

60

— Ufi‘jx Standard deviation r- htgracdve response line ‘ -Solaris S t a n d s dewatffir= -r Intaactlve response time

- 22.500,030 •

2 20.000,000
0 17,500.000 •
15 000 000 -

12“» OH)
5 003

a

25

325

50

350

375

400

425

450

475

500

Figure 5.4: Standard deviation of Interactive Response time: Linux and Solaris
Scheduling Algorithms (by varying the number of cores)
when the number of cores is 50, but th e trend gradually converges as the num­
ber of cores increases.
• Observation on Standard deviation in Turnaround Time: The standard devi­
ation of turnaround tim e graphs of Linux and Solaris scheduling algorithm s
are shown in Fig. 5.6. We observe th a t the predictability also increases when
the arrival rate increases. Compared to Linux, Solaris predictability is always
better.
• Observation on Interactive Response Time: T he average interactive response
time graphs of Linux and Solaris scheduling algorithm s are shown in Fig. 5.7.
Solaris outperforms Linux consistently providing low interactive response time.
As explained earlier, the priority boosting of Solaris heavily im pacts th e inter­
active response time. This is an expected behavior.

61

j—So&is1050

10100--Soiy;s1020C— 10150—u-u*2.6.265-3 - Ur=ut; 626100—Urui2626200—L-mjT2.626150-j

0
2.5

3.0

3.5

4.0

4.5

5.0

5.5

6.0

6.5

7.0

7-5

8.0

S.S

90

9.5

1G.G

Arrival rate

Figure 5.5: Average Turnaround time: Linux and Solaris Scheduling A lgorithms (by
varying the workload)

— Solaris 1050

Sharis 1D100 — So^ns 10200 — S ote's 10150 — _ny« 2,6.2650

Ur^x 2.6.26100 — Unux 2 6.26200 — unu* 2.6 2615C

32 500

27.500

2.5

4.5

55

6.0

7.0

8.0

90

9.5

10.0

Arrival rate

Figure 5.6: Standard deviation of Turnaround time: Linux and Solaris Scheduling
Algorithms (by varying the workload)

62

I—’ Solaris 1050

Solans 10100 — Solars 10200 — Solars 10150 — L m i 2.6.2650

b to x 2.6.26100 — b ru x 2.6.26230 — bmjx 2.6 2615C!

■000,030-

.000

,500.000

2.5

3.0

3.5

4.0

4.5

5.0

ss

6.0

5.5

7.0

7.5

$.0

8.5

9-0

9.5

1C.Q

Arrival rate

Figure 5.7: Average Interactive Response time: Linux and Solaris Scheduling Algo­
rithm s (by varying the workload)

•Solans 1050 — Solaris 1Q1GC — Solans 1020C — Solars 10150 — unux 2.6.265D

U *ux2.6.26100 — L irux2 6.26200 — bfiu* 2 .6 261SC

2 0 ,0 0 0 .0 0 0 ;

4.0

4,5

5.0

55

6.0

6.5

7.0

7,5

S.O

8.5

9.0

9.5

lC.fl

Arrival rate

Figure 5.8: Standard deviation of Interactive Response time: Linux and Solaris
Scheduling Algorithms (by varying th e workload)

63

• Observation on Standard deviation o f Interactive Response Time: The standard
deviation of interactive response tim e graphs of Linux and Solaris scheduling
algorithms are shown in Fig.

5.8.

We observe th a t Solaris predictability is

much b etter compared to Linux. The trend is consistent.

These observations confirm our hypothesis, which is an evidence th a t the imple­
m entation of the scheduler is fairly accurate.

5.5

Sum m ary

In this chapter, we have presented the sim ulation experiments of Linux and Solaris
scheduling algorithms. As we expected from our hypothesis, the simulation experi­
ments prove th a t Solaris scheduler offers b etter and predictable response tim e th an
Linux scheduler. From these experiments, we are confident th a t our scheduler simu­
lator im plem entation is fairly sound.

64

Chapter 6
A Fair and Efficient Gang Scheduling
Algorithm

The trend in multicore processors indicates th a t all future processors will be mul­
ticore, and hence the future cloud systems are expected to have their nodes and
clusters based on multicore processors. So th e processor scheduling in the future sys­
tems will most likely be all multicore processor scheduling. Therefore, we believe,
multicore scheduling is fundam ental to future cloud com puting performance. Also,
due to multicore revolution, a considerable portion of large applications will be par­
allel programs. From the literature, we can see th a t gang scheduling is a dom inant
strategy to schedule parallel programs.

6.1

Popular G ang Scheduling A lgorithm s

Among the gang scheduling algorithms, AFCFS and LGFS are the most popular
algorithms and we present them next.

65

6 .1 .1

AFCFS

The scheduling algorithm AFCFS places the gangs in th e run queue in order of
their arrival. The scheduling starts from the head of the queue and the gang which
can fit into the available cores are scheduled for execution. Unlike FCFS which stops
scheduling when it finds th e gang which cannot fit into the free cores, AFCFS iterates
over the whole run queue and schedules all gangs which can fit into the free cores.
The AFCFS algorithm is given in Table. 6.1.

Data Structures: RQ: Queue of gangs
1. w h ile (RQ ^ empty ) do
2.
for i = 1 to size of R Q do
3.
if RQ[i] fits in free cores then schedule RQ[i]
4.
en d for
5. end w h ile
Table 6.1: AFCFS Gang Scheduling Algorithm

6 .1 .2

LG FS

Largest gang first scheduling algorithm orders th e run queue based on th e size
of the gangs.The size of the gang is determ ined by the number of

threads. LGFS

schedules the gang from the head of the queue(largest gang) until the gang which can
fit into the available free cores. LGFS favors the large sized gang over the small sized
gang. This will influence the response tim e of small sized gangs, and th a t makes the
small gangs to wait for longer time to get their turn. Also, when a large size gang
arrives, it may overtake these small gangs. The algorithm of LGFS is given in Table.
6 .2 .

66

Data Structures: RQ: Queue of gangs
1. w h ile (RQ ^ em pty ) do
2.
sort R Q based on largest gang first
3.
for i = 1 to size of R Q do
4.
if RQ[i] fits in free cores then schedule RQ \i\
5.
en d for
6. en d w h ile
Table 6.2: LGFS Gang Scheduling Algorithm
P a ra m eter
Number of cores
Time period
Tasks per gang
Mean Arrival rate
Arrival distribution
Execution rate
Execution tim e distribution

V a lu e(s)
200
1 minute
Uniformly distributed over [2..200]
1.5, 2, 2.5
Poisson
2
Exponential

Table 6.3: Simulation Param eters
From the literature [53), we note th a t AFCFS performs better th an LGFS in
response time in case of lighter workloads with small gangs. Our experim ent using
the proposed simulation confirms the result.

6 .1 .3

S im u la tio n E x p e r im e n ts

The param eters used in our simulation are listed in Table. 6.3. We conducted sim­
ulation experiments to compute average response time, standard deviation of response
time, and average core utilization of AFCFS and LGFS algorithms. The observations
are presented next.

• Observation on Average Response time: The average response tim e graphs of
AFCFS and LGFS algorithms are shown in Fig. 6.1. The average response
tim e of AFCFS outperforms LGFS for small sized gangs. Since LGFS favors
67

— AFCFS ■ LGFS

13.0SC
17.000
16.000
15,000

„ 14-000
|

13,000

“ 17.000

|

11,000

* 1 3 ,0 0 0
|

3,000

iX
| «»»
3

5,050
4.000
3.000

2.000

1,45

1,50

1.55

1.50

1.65

1.73

1.75

1,85

1.90

1.95

2.00

2.05

2,10

2.15

2.20

2.25

230

235

2,40

2,45

2,50

Figure 6.1: Average Response Tim e for AFCFS and LGFS Algorithms
the large size gangs, it pushes the small sized gangs for longer wait tim es which
is reflected in their average response time. This result confirms the observation
from the literature.
Observation on Standard deviation in Response time: The standard deviation
response time graphs of AFCFS and LGFS algorithm s are shown in Fig. 6.2.
The predictability of AFCFS is worse than LGFS. Since the AFCFS favors
the small size gang, the large size gangs have to wait longer for their turn for
execution which in tu rn increase the deviation in response time.
Observation on Average Core Utilization: T he average core utilization graphs
of AFCFS and LGFS algorithms are shown in Fig. 6.3. LGFS performs b etter
th an AFCFS, because it favors large jobs which fits into more num ber of cores
and makes the core busy executing these larger jobs. As AFCFS favors smaller
jobs, the possibility of core to stay idle is higher.

68

-A F C F S

LGFS

12.000 <
j;

11,000

1 0,000-i;

I 5.000 |i
v
|

S.COC i:

a

7.000 |!

I 5,0 0 0 1!

i 4 ,0 j0 i:

*

11

3,oao-;
2.000jj
1.000 ji

1.45

1.50

1.55

1.60

1.65

1.7D

1.75

1.60

1,65

1.90

1.95

2.00

2.05

2.10

2 15

2 20

2 25

2.30

2.35

2 40

2,45

2.5C

Arrival R ate (A)

Figure 6 .2 : Standard deviation of Response Time for AFCFS and LGFS Algorithms

I— AFCFS ■ LGFS I

i

1.45

ISO

1.55

1.60

1.55

1.70

1.75

IK

1.S5

l.M

1.9S

2,00

2 05

2.10

2,15

2,20

2 25

2.30

2.35

2.40

Figure 6.3: Core Utilization of AFCFS and LGFS Algorithms

69

2.45

2,50

From the observations shown in Fig. 6.2 and Fig. 6.3, we can see th a t LGFS is
a preferred algorithm , despite A FC FS 7 low average response time. This is because
LGFS utilizes the system resources b etter and offers b etter predictability in response
time. This proves our earlier point th a t th e average response time is not a desirable
metric.
Fairness and predictability are particularly im portant th a t the expectation of users
under light load is normally high, and failure to provide such guarantee even under
light load could expose the system very badly. Therefore, it would be nice to have an
algorithm which yields low average response time and standard deviation w ith high
processor utilization. This is the m otivation for our algorithm presented next.

6.2

A N ew G ang Scheduling A lgorithm

The gang scheduling algorithms AFCFS and LGFS are susceptible to starvation.
To avoid starvation, these algorithms adopt a process migration policy. Process mi­
gration may not be even possible between two heterogeneous multicore systems, and
is generally expensive even between two homogeneous systems [39,56]. Also, although
it alleviates, process migration does not eliminate starvation.
These observations bring us a question: Can we design a gang scheduling algorithm
with the following characteristics?

1. Freedom from starvation.
2. Predictable and acceptable response time.
3. B etter processor utilization.
4. Simple.

70

Since AFCFS favors small gangs, th e larger gangs are susceptible to starvation or
to longer wait times. This is unacceptable particularly in cloud environment where
customer satisfaction hugely depends on fairness and predictable response time. In
practice, the customers who receive a little faster service (at the expense of oth­
ers’ long wait) may not be overly satisfied [61]. B ut, the customers who experience
unpredictably long delay, on the other hand, will readily notice the unfairness and
unpredictable response and th a t could potentially drive the cloud business in a nega­
tive direction. Therefore, in addition to fast response and high processor utilization,
minimal variance in response is extremely im portant for b etter cloud services.

6.3

T h e A lgorithm

The gang scheduling algorithm proposed in this thesis combines the ideas of
AFCFS and priority boosting. In the AFCFS algorithm proposed for m ulticore clus­
ters in [53], each multicore has a run queue and all gangs stay in the run queue until it
gets a chance to execute. The scheduler always chooses the next fit gang from the run
queue so th a t overall response tim e is reduced. Such behavior degrades the overall
core utilization which in tu rn increases the variation in response time as seen in the
experiments.
The proposed algorithm uses an additional variable for each gang which stores the
information about how many gangs bypassed it for execution when it stayed in run
queue. We call th a t variable ‘Bypass count’. W hen the gang’s bypass count reaches
the threshold value T . it gets the highest priority to schedule next. This pushes other
gangs to force wait until the highest priority gang gets scheduled.

The proposed

algorithm is given in Table. 6.4.
A new gang joins RQ, and its bypass count is set to zero. W henever a gang

71

is scheduled, the bypass count of gangs precedes th e scheduled gang in RQ will be
incremented by 1. At any time, gang in RQ w ith bypass count greater or equal to the
threshold value has the highest priority over other gangs. This guarantees th a t the
gangs will be served in a predictable tim e period. W hen there is no gang writh bypass
count greater or equal to the threshold value, it acts as AFCFS algorithm.
W hen enough cores are not available to schedule the highest priority gang, the
system has to wait for some of the currently executing gangs to leave, and this delay
is unavoidable to assure fairness and predictable response.

The sim ulation results

show th a t such a wait rarely happens.

D ata Structures: RQ: Queue of gangs;T: Threshold value; i.bpc: bypass count
of gang i
1 . w h ile (RQ ^ empty ) do

2.
for i = 1 to size of R Q d o
3.
if RQ(i) .bpc > T th e n wait until RQ[i] fits in free cores
4.
if RQ[i\ fits in free cores th e n
5.
for k = 1 to i — 1 d o RQ[k\.bpc + + end for
6.
schedule RQ[i\
7.
end if
8.
en d for
9. en d w hile
Table 6.4: New Gang Scheduling Algorithm
N o te: The proposed algorithm becomes AFCFS if the threshold is set to oo.
When the threshold is 0, it emulates FCFS algorithm . So, choosing a proper threshold
is the key of the proposed algorithm.

6.4

Sim ulation E xperim ents

As explained earlier, the proposed algorithm tries to achieve predictable and fast
response for gangs (users) and b etter utilization for the system.
72

P aram eter
Number of cores
Time period
Tasks per gang
Mean Arrival rate
Arrival rate distribution
Execution rate
Execution rate distribution
Threshold

V a lu e(s)
200
1 minute
Uniformly distributed over [2..200]
5, 7.5, 10
Poisson
2
Exponential
700

Table 6.5: Simulation Param eters

6 .4 .1

S im u la tio n S e tu p

We used the simulation param eters listed in Table 6.5 for our experiments. To keep
the results generic, the execution is shown in term s of simulation clock ticks. Through
simulation study, we com puted average response time, standard deviation in response
time, average core utilization, and bypass count for the gangs. The observations are
presented next.

• Observation on Average Response Time: T he average response tim e graphs
of AFCFS and the proposed algorithms are shown in Fig. 6.4. The average
response tim e of the proposed algorithm is b etter than th a t of AFCFS. This is
because, whenever the gangs’ bypass count reaches the threshold, it guarantees
the gang to schedule next which reduces the response tim e of long waiting gangs.
Choosing a proper threshold value is crucial.

Choosing a small num ber will

unnecessarily make others gangs to wait more often, which will in tu rn increase
the average response time. For our experiments, we have chosen 7001 bypass
count as the threshold value. From the Fig. 6.4, it is clear th a t the average
response tim e of proposed algorithm performs better th a n AFCFS consistently.

1T he threshold value is derived from th e rep etitiv e experim ents for consistent behavior.

73

|— AFCFS

Fair AfCPs|

95.000;
90.00085.000 •
86-000 i

time

(dock

t ic k s )

75 000
70.000
65.000
60.000 •
55.000

50.000
45 0 0 0

% 40,000
0 35.-000
» 30,000
* 25,000
20.000

15.000
10.000

5,-000

0
4.75

S.DO

5.25

550

5.75

6.00

6.25

6-50

6.75

7 00

7.25

7.50

7.75

8.30

6.25

S.50

8.75

9.20

9.25

9.50

9.75

10.00

10.2!

Arrival Rate (A)

Figure 6.4: Average Response Tim e of AFCFS and Proposed Algorithms
• Observation on Standard deviation of Response time: The average turnaround
time graphs of AFCFS and the proposed algorithms are shown in Fig. 6.5. The
proposed algorithm offers b etter predictability in response tim e th an AFCFS
algorithm.

Since, the proposed algorithm avoids the longer wait times, the

predictability in response time will be lower th an AFCFS. By controlling the
bypass threshold value, b etter predictability may be assured.
• Observation on Average Core Utilization: The average core utilization graphs
of AFCFS and the proposed algorithm s are shown in Fig. 6.5. The proposed
algorithm outperforms AFCFS algorithm. This is because, AFCFS favors only
small gangs but the proposed algorithm favors all sized gangs once the threshold
is reached.
• Observation on Bypass count graph: The bypass count graphs of AFCFS and
the proposed algorithm are shown in Fig. 6.7. The proposed algorithm shows

74

fi
■§ 35.000 i:
| 30.000-.

3
Jl 25.000C
1 20.000

i!

* 15.00010.000 -

5 000

4.7S

5.00

5.25

S.50

5.75

6.00

6.25

6.50

6.75

700

7 25

7. S3

7.75

3.03

S.25

S.50

S, 75

S.00

9.25

9.53

9.75

10.00

10.21

Arrival Rate (A)

Figure 6.5: Standard deviation of Response Time of AFCFS and Proposed Algorithms

|- A F C F S

fa rA K B |

757 0 -p
65^60# 5 5 -;

* 45 i;

|*
03
g35li

3 3 0 ’.
2520 ■

5 i|
5.25

5.50

5.75

S.C3

6.2S

6.50

6.75

7.3C

US

7.50

7.75

3.00

8-25

6.50

8.75

5.00

5.25

5.50

5 75

10.00

Arrival R ate {X)

Figure 6 .6 : Average Core U tilization of AFCFS and Proposed Algorithms

75

10.2!

the fairness among gangs, once it reaches th e threshold, it starts giving the
priority for long waited gangs. The longer wait time is completely avoided in
the proposed algorithm, which makes our algorithm interesting.

4 .7 5 0 ;'

0

250

500

750

1.000

S.230

1.500

1.750

2.000

2.250

2.500

2.750

5.000

5.250

3,500

J /v

4 '. 0

4,253

4.530

4.730

Job
[~~ AFC3S -1 0.0

f t ; r 7K r e - 1 0 0 |

Figure 6.7: Bypass Count of Gangs of AFCFS and Proposed Algorithms

From these observations, we conclude th at, the proposed algorithm outperform s
AFCFS in all three metrics, and of course solves th e starvation problem completely.

6.5

Sum m ary

In this chapter, we proposed a fair and efficient gang scheduling algorithm for
multicore processors. The algorithm is simple, fair, and gives predictable performance.
Such a predictable performance is attractive from th e service point of view. Since this
algorithm solves the starvation problem locally w ithout using process migration, it
is highly scalable and attractive for cloud com puting involving a large number of
multicore processors.
76

Chapter 7
Conclusion and Future Directions

Recently, multicore processors and software development for multicore systems have
received increasing attention from the research community. The contributions of this
thesis are: (i) the design and im plem entation of a flexible multicore scheduler simula­
tion framework; (ii) illustration of the power and flexibility of the proposed framework
by simulating the scheduling algorithm s of Linux and Solaris, and a sim ulation study
of two gang scheduling algorithms; and (iii) a new gang scheduling algorithm and
its performance study. The experience gained by developing this sim ulator is very
rich. It involved software design, im plem entation, algorithm discovery and design,
and performance analysis.
The proposed simulator can be used for rapid simulation studies of multicore
scheduling algorithms, and th a t can provide initial insights on how the proposed
algorithm will perform in practical multicore systems. These insights will be helpful,
not only in testing the performance of the proposed algorithm , but also to identify
the bottlenecks and offer guidelines for improvements. Also, from our experience,
we believe th a t the simulator can be used to develop new scheduling algorithm s by

77

experimenting and identifying the lim itations of the existing algorithms.
Although, initially we did not expect to propose a new multicore scheduling algo­
rithm , we finally ended up designing one for an im portant class of parallel applications
called gangs. We compared the performance of the new gang scheduling algorithm
with th e known best algorithm in its category. T he proposed algorithm seems to
perform better.

7.1

Future D irections

We believe the proposed scheduling framework for multicore systems is an im por­
ta n t first step in the performance study of multicore scheduling algorithms. There are
many directions in which the work presented in this thesis can be expanded to study
the performance of scheduling algorithm s deeper and more accurately. We outline
some of them next.

• Modeling workload could be refined and improved.
• Modeling I/O waits could be refined and improved.
• The modeling of cache can be refined and improved.
• The framework can be expanded to model heterogeneous cores.
• Cache effect can be inferred from the values of hardware performance counters
available in the recent multicore machines, and used in scheduling simulations.
• More statistical metrics can be included.
• In cloud computing context, modeling clusters with many chips and its associ­
ated cores would be more interesting.

78

• The proposed gang scheduling algorithm works b etter for light loads. Is there
a gang scheduling algorithm th a t can perform b etter under all workload condi­
tions? This is an interesting research question to be explored.

I would like to continue to work on some of these directions in th e future.

79

Appendix

Tables used by Solaris Scheduler
Solaris scheduler uses the inform ation given in th e following four tables -

1. P rio rity R a n g e T able (Table 7.1) - specifies the priority range for the schedul­
ing classes.
2. D isp a tch T able for R e a l-tim e Tasks (Table 7.2) - defines the time quanta
of real time scheduling class.
3. D isp a tch T able for F ix ed P rio rity T asks (Table 7.3) - defines th e time
quanta of fixed priority scheduling class.
4. D isp a tch Table for N o rm a l Tasks (Table 7.4) - defines the tim e quanta of
time sharing and interactive scheduling classes.

8 0

S ch ed u lin g class
Realtime
System
Fair share
Fixed priority
Time share
Interactive

G lobal p riority ra n g e
100 - 159
6 0 -9 9
0 -5 9
0 - 59
0 - 59
0 - 59

U se r lev el p riority ran ge
0 - 59
0 - 60
-60 - 60
-60 - 60

Table 7.1: Solaris 10 Scheduling Classes Priority Range

Q u an ta
100
80
60
40
20
10
10

P rio rity
100
110
120
130
140
150
159

Table 7.2: Dispatch Table for RT Scheduling Class

Q uanta
0
10
20
30
40
59

P rio rity
20
16
12
8
4
2

Table 7.3: Dispatch Table for FX Scheduling Class

81

Q uanta
20
16
12
8
4
2

P rio rity
on
Q u an ta E xp iry
0
0
10
20
30
49

P rio rity on
IO R etu rn
50
51
52
53
55
59

W ait
T h resh o ld
0
0
0
0
0
3200

P r io r ity
o n W ait
50
51
52
53
55
59

P rio rity
0
10
20
30
40
59

Table 7.4: Dispatch Table for TS and IA Scheduling Classes

Tick Processing and U p d a te P rocessing
Here we present two key routines used in Solaris scheduling.

T ick P r o c e ssin g
Tick processing will be executed for every scheduling tick. This m ethod is mainly
responsible for managing the tim e quanta and preemption control. The overview of
what tick processing is doing is given in Table. 7.5. When th e tick processing m ethod
is invoked, it first checks whether th e executing thread is in system mode.

This

is because, system threads are scheduled for its full execution time and preem ption
of system threads are not allowed. O ther than system threads, all other scheduling
classes are preemption enabled. The m ain task of the tick processing is to m anage the
allocated time quanta. First, it decrements the quanta allocated for th e th read and
checks whether the thread executed for i t ’s whole tim e quanta. If so, it additionally
checks whether the preemption control is enabled for the thread. The preem ption
control is a variable to give few more time for a thread when it finishes it allotted
time quanta if preemption control is enabled. This variable can be configured. If the
preemption control is not enabled for the thread, then the first task is to re-assign
the priority of thread from their dispatch table in case of TS and IA class thread.

82

RT and FX class threads are m aintained w ith the same priority. Then, it invokes
the preem ption and places the thread in their dispatch queue based on their priority.
When the tim e quanta is not over, then it checks whether th e thread is going for I/O
and places the thread in the sleep queue. The m ethod also checks for any highest
priority thread in the dispatch queues and preem pts the current thread. W hen the
thread is preem pted by a high priority thread, then th e priority of th e current thread
won’t change.

Data Structures: Thread t
1. if thread t is not in system priority th e n
decrement the time quanta t.quanta
if (t.quanta < 0) th en
3.
4.
check thread preem ption control enabled th en
give additional tim e q uanta for execution
5.
else
6.
if / £ T S or I A scheduling class th en
7.
re assign t.p rio rity from d isp atch _ tab le_ ts w ith the value of ts tqexp
8.
en d if
9.
enable preemption and place thread t in their dispatch queues
10.
11. en d if
12 . if t going for I/O th en
enable preemption and place thread t in sleep queue
13.
14. en d if
15. if t.priority < highest th rea d ’s priority in dispatch queues th e n
enable preemption and place the thread t in dispatch queue
16.
17. en d if
18. en d if
2.

Table 7.5: Tick Processing of Solaris Scheduling Algorithm

83

U p d a te P r o c e ssin g
U pdate processing m ethod is invoked only for tim e sharing and interactive schedul­
ing classes. The main purpose of this m ethod is to boost up the priority. The higher
level im plem entation of update processing is given in Table. 7.6.
Update processing m ethod will be invoked periodically. This m ethod will iterate
over the dispatch and sleep queues to increment the waiting tim e of the threads. Once
the waiting time of a thread reaches m aximum wait time specified in dispatch table,
the th read ’s priority will be boosted up by th e ts_ lw ait value. This is done to provide
fairness among the threads.

Data Structures: Dispatch queues - D Q _ T S , D Q _ I A , sleep queue - S Q , T hread t
1. for each thread t is not in D Q _ T S A DQ I A A S Q do
2.
increment t.w ait value by 1
3.
if (t.w ait > m ax_w ait) th e n
4.
boost up thread priority from dispatch table ts with the value of ts_ lw ait
5.
en d if
6. en d for

Table 7.6: U pdate Processing of Solaris Scheduling Algorithm

84

Bibliography
[1] D. Wentzlaff and A. Agarwal, “Factored operating systems (fos): th e case for
a scalable operating system for multicores,” SIG O P S Operating System Review,
vol. 43, pp. 76-85. April 2009.
[2 ] A. Fedorova, M. Seltzer, and M. D. Smith, “Cache-fair thread scheduling for
multicore processors,” tech. rep., Harvard University, 2006.
[3] B. Wilkinson and M. Allen, Parallel Programming Techniques and Applications
Using Networked Workstations and Parallel Computers Second Edition. Pearson
Prentice Hall, 2005.
[4] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory
programming,” Computational Science Engineering, IEEE, vol. 5, pp. 46 -55,
jan-m ar 1998.
[5] M. Snir, S. W. O tto, D. W. Walker, J. Dongarra, and S. Huss-Lederman, MPI:
The Complete Reference. Cambridge, MA, USA: MIT Press, 1995.
[6 ] S. Peter, A. Schupbach, P. Barham , A. Baumann, R. Isaacs, T. Harris, and
T. Roscoe, “Design principles for end-to-end multicore schedulers,” in Proceedings
of the 2nd U SENIX conference on Hot topics in parallelism, H otP ar’10, (Berkeley,
CA, USA), pp. 10-10, USENIX Association, 2010.
85

[7| C. Kesselman and I. Foster, The Grid: Blueprint fo r a New Computing Infras­
tructure. Morgan K aufraann Publishers, Nov. 1998.
[8 ] D. Wentzlaff, C. G. Ill, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff,
J. E. Miller, and A. Agarwal, “An operating system for multicore and clouds:
Mechanisms and im plem entation,” in SoCC, pp. 3-14, 2010.
[9| “The future accelerated: Multi-core goes m ainstream , computing pushed to ex­
tremes.” Intel Newsroom, September 2011.
[10] J. C. Mogul, A. Baum ann, T. Roscoe, and L. Soares, “M ind the gap: reconnecting
architecture and os research,” in Proceedings o f the 13th U SENIX conference
on Hot topics in operating system s, H otO S’13, (Berkeley, CA, USA), pp. 1- 1,
USENIX Association, 2011.
[11] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe,
A. Sclhipbach, and A. Singhania, “The multikernel: a new os architecture for
scalable multieore systems,” in Proceedings of the A C M SIG O PS 22nd symposium
on Operating systems principles, SOSP '09, (New York, NY, USA), pp. 29-44,
ACM, 2009.
[12] S. Zhuravlev, J. C. Saez, A. Fedorova, and M. Prieto, “Survey of scheduling tech­
niques for addressing shared resources in multicore processors,” A C M Computing
Surveys, In Press.
[13] D. Wentzlaff, C. G. Ill, N. Beckmann, A. Belay, H. Kasture, K. Modzelewski,
L. Youseff, .1. E. Miller, and A. Agarwal, “Fleets: Scalable services in a fac­
tored operating system.” tech. rep., CSAIL M assachusetts Institute of Technol­
ogy, 2011 .

86

[14] A. Schbach, S. Peter, A. Baum ann, T. Roscoe, P. Barham, T. Harris, and
R. Isaacs, "Embracing diversity in the barrelfish manycore operating system ,”
in In Proceedings of the Workshop on Managed Many-Core Systems, 2008.
[15] A. Belay, D. Wentzlaff, and A. Agarwal, “Vote the os off your core,” tech. rep.,
CSAIL M assachusetts Institute of Technology, 2011.
[16] A. Baumann, S. Peter, A. Schiipbach, A. Singhania, T. Roscoe, P. Barham , and
R. Isaacs, “Your com puter is already a distributed system, why isn’t your os?,” in
Proceedings of the 12th conference on Hot topics in operating systems, H otO S’09,
(Berkeley, CA, USA), pp. 12-12, USENIX Association, 2009.
[17] S. Panneerselvam and M. M. Swift, “Dynamic processors demand dynam ic op­
erating systems,” in Proceedings o f the 2nd U SEN IX conference on Hot topics
in parallelism, H otP ar’lO, (Berkeley, CA, USA), pp. 9-9, USENIX Association,
2010 .

[18] I. Kuz, Z. Anderson, P. Shinde, and T. Roscoe, “M ulticore os benchmarks: we can
do b etter,” in Proceedings o f the 13th U SEN IX conference on Hot topics in oper­
ating systems, HotOS’13, (Berkeley, CA, USA), pp. 10-10, USENIX Association,
2011.

[19] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris,
A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, “Corey: an
operating system for many cores,” in Proceedings o f the 8th U SEN IX confer­
ence on Operating systems design and implementation, OSDI’08, (Berkeley, CA,
USA), pp. 43-57, USENIX Association, 2008.
[20] S. Boyd-Wickizer, R. Morris, and M. F. Kaashoek, “Reinventing scheduling for
multicore systems,” in Proceedings of the 12th conference on Hot topics in oper­

87

ating system s, HotOS'09, (Berkeley, CA, USA), pp. 21-21, USENIX Association,
2009.
[21| R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz, “Tes­
sellation: space-time partitioning in a manycore client os,” in Proceedings o f the
First U SENIX conference on Hot topics m parallelism, H otP ar’09, (Berkeley, CA,
USA), pp. 10-10, USENIX Association, 2009.
[22 ] D. Wentzlaff, C. G. Ill, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff,
J. Miller, and A. Agarwal, “A unified operating system for clouds and manycore:
fos,” 1st Workshop on Computer Architecture and Operating System co-design
(CAOS), 2010.
[23] S. Blagodurov, S. Zhuravlev, and A. Fedorova, “Contention-aware scheduling on
multicore systems,” AC M Transactions on Computer Systems, vol. 28, pp. 8:18:45, December 2010.
[24] A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer, “Chip m ultithreading sys­
tems need a new operating system scheduler,” in Proceedings of the 11th workshop
on A C M SIG O P S European workshop, EW 11, (New York, NY, USA), ACM,
2004.
[25] S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Akula: a toolset for experi­
menting and developing thread placement algorithms on multicore systems,” in
Proceedings of the 19th international conference on Parallel architectures and
compilation techniques, PACT TO, (New York, NY, USA), pp. 249-260. ACM,
2010 .

88

[26] J. M. Calandrino, D. P. Baumberger, T. Li, J. C. Young, and S. Hahn, “Linsched:
The linux scheduler sim ulator.” in ISC A PD CCS (J. Jacob and D. N. Serpanos,
cds.), pp. 171-176, ISCA, 2008.
[27] M. Rosenblum, S. A. Herrod, E. W itchel, and A. G u p ta, “Complete com puter
system simulation: the simos approach,” Parallel Distributed Technology: Sys­
tems Applications, IEEE, vol. 3, pp. 34 -43, winter 1995.
[28] P. Magnusson, M. Christensson, J.

Eskilson, D. Forsgren, G.Hallberg, J. Hog-

berg, F. Larsson, A. M oestedt, and B. Werner, “Simics: A full system simulation
platform ,” Computer, vol. 35, pp. 50 -58, feb 2002.
[29] Y. Jiang, X. Shen, J. Chen, and R. Tripathi, “Analysis and approxim ation of
optimal co-scheduling on chip multiprocessors,” in Proceedings o f the 17th inter­
national conference on Parallel architectures and compilation techniques, PACT
’08, (New York, NY, USA), pp. 220-229, ACM, 2008.
[30] T. C. Xu, P. Liljeberg, and H. Tenhunen, “Process scheduling for future multicore
processors,” in Proceedings o f the Fifth International Workshop on Interconnec­
tion Network Architecture: On-Chip, Multi-Chip, INA-OCMC ’11, (New York,
NY, USA), pp. 15-18, ACM, 2011.
[31] J. C. Saez, M. Prieto, A. Fedorova,

and S. Blagodurov,

“A comprehensive sched­

uler for asymmetric multicore systems,” in Proceedings o f the 5th European con­
ference on Computer systems, EuroSys TO, (New York, NY, USA), pp. 139-152,
ACM, 2010.
[32] V. Kazempour, A. Kamali, and A. Fedorova, “Aash: an asymmetry-aware sched­
uler for hypervisors,” in Proceedings of the 6th AC M S IG P L A N /S IG O P S inter­

89

national conference on Virtual execution environments, VEE ’10, (New York,
NY, USA), pp. 85-96, ACM, 2010.
[33] S. Hofmeyr, C. Iancu, and F. Blagojevic, “Load balancing on speed,” in Pro­
ceedings of the 15th AC M SIG P L A N Sym posium on Principles and Practice of
Parallel Programming, P P oP P ’10, (New York, NY, USA), pp. 147-158, ACM,
2010 .

[34] F. R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry, “Decoupling contention
management from scheduling,” in Proceedings o f the fifteenth edition o f A SP L O S
on Architectural support fo r programming languages and operating systems, AS­
PLOS ’10, (New York, NY, USA), pp. 117-128. ACM, 2010.
[35] N. Guan, M. Stigge, W. Yi, and G. Yu, “Cache-aware scheduling and analysis
for multicores,” in Proceedings of the seventh A C M international conference on
Embedded software, EM SOFT ’09. (New York, NY, USA), pp. 245-254, ACM,
2009.
[36] L. Tang, J. Mars, and M. L. Soffa, “Contentiousness vs. sensitivity: improving
contention aware runtime systems on multicore architectures,” in Proceedings of
the 1st International Workshop on Adaptive Self-Tuning Computing System s for
the Exaflop Era (co-located with P LD I 2011), (New York, NY, USA), pp. 12-21,
ACM, 2011.
]37] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts,. John
Wiley & Sons. Inc., 2010.
[38] S. Haidar and A. Aravind, Operating Systems. Pearson Education, 2010.
[39] D. S. Milojicic, F. Doughs, Y. Paindaveine, R. Wheeler, and S. Zhou, “Process
migration,” AC M Computing Surveys, vol. 32. pp. 241-299, Sept. 2000.
90

[40] J. K. O usterhout, “Scheduling techniques for concurrent systems," in Proceedings
of the IE E E Distributed Computing System s, pp. 22 - 30, 1982.
[41] L. Chai, Q. Gao, and D. K. Panda, “U nderstanding th e im pact of multi-core
architecture in cluster computing: A case study with intel dual-core system," in
Proceedings o f the Seventh IE E E International Symposium on Cluster Comput­
ing and the Grid, CCGRID ’07, (W ashington, DC, USA), pp. 471-478, IEEE
Computer Society, 2007.
[42] I. A. Moschakis and H. D. K aratza, “Evaluation of gang scheduling performance
and cost in a cloud computing system,” Journal o f Supercomputing, vol. 59,
pp. 975-992, Feb. 2012.
[43] D. Burger and T. M. Austin, “The simplescalar tool set, version 2.0,” tech. rep.,
1997.
[44] M. Moudgill, P. Bose, and J. Moreno, “Validation of turandot, a fast proces­
sor model for microarchitecture exploration,” in Performance, Computing and
Communications Conference, 1999 IE E E International, pp. 451 -457, feb 1999.
[45] E. Argollo, A. Falcon, P. Faraboschi, M. Monchicro, and D. O rtega, “Cotson:
infrastructure for full system sim ulation,” SIG O PS Operating System Review,
vol. 43, pp. 52-61, Jan. 2009.
[46] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, “Using the simos ma­
chine simulator to study complex com puter systems,” A C M Trans. Model. Cornput. S im u i. vol. 7, pp. 78-103, Jan. 1997.
[47] AM D

Developer

Central.

AM D

SimNow

h ttp :// developer. amd. com/'tools/simnow/pages/default. aspx.

91

Simulator,

[48] A. B atat and D. G. Feitelson, ‘"Gang scheduling with memory considerations,”
in in Proc. o f the 14th Intl. Parallel and Distributed Processing Sym p., 2000,
pp. 109-114, 2000.
[49] K. Hyoudou, Y. Kozakai, and Y. Nakayama, “An im plem entation of a concurrent
gang scheduler for a pc-based cluster system,” System s and Computers in Japan,
vol. 38, pp. 39-48, Mar. 2007.
[50] H. D. K aratza, “Scheuling gangs in a distributed system ,” International Journal
of Simulation, vol. 7(1), pp. 15-22, 2006.
[51] Z. C. Papazachos and H. D. K aratza, “The im pact of task service tim e variability
on gang scheduling performance in a two-cluster system ,” Simulation Modelling
Practice and Theory, vol. 17, no. 7, pp. 1276 - 1289, 2009.
[52] Z. C. Papazachos and H. D. K aratza, “Gang scheduling in a two-cluster system
implementing migrations and periodic feedback,” SIM U LATIO N, 2010.
[53] Z. C. Papazachos and H. D. K aratza, “Gang scheduling in multi-core clusters
implementing migrations,” Future Generation Com puter Systems, vol. 27, no. 8 ,
pp. 1153 - 1165, 2011.
[54] Y. Wiseman and D. G. Feitelson, “Paired gang scheduling,” IE E E Transactions
on Parallel and Distributed Systems, vol. 14, pp. 581-592, June 2003.
[55] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam, “An integrated
approach to parallel scheduling using gang-scheduling, backfilling, and migra­
tion,” in Revised Papers from the 7th International Workshop on Job Schedul­
ing Strategies fo r Parallel Processing, JSSPP ’01, (London, UK), pp. 133-158,
Springer-Verlag, 2001.

92

[56] S. Frechette and D. R. Avresky, "M ethod for task m igration in grid environm ents,”
in Proceedings of the Fourth IE E E International Symposium on Network Com­
puting and Applications. NCA ’05, (W ashington, DC, USA), pp. 49-58. IEEE
Com puter Society, 2005.
[57] R. McDougall and J. M. and, Solaris Internals (2nd Edition). U pper Saddle
River, NJ, USA: Prentice Hall P T R , 2006.
[58] C. S. Pabla, Completely Fair Scheduler, h t t p : / /w ww . lin u x jo u r n a l. com/
ma q az i n e / c oarp l e t e I y - f a i r - s c h e du I e r , 2009.
[59] I. Stojmenovic. “Simulations in wireless sensor and ad hoc networks: m atch­
ing and advancing models, metrics, and solutions,” Communications Magazine,
IE E E , vol. 46, pp. 102 -107, december 2008.
[60] K. Sankaralingam and R. H. Arpaci-Dusseau, “Get th e parallelism out of my
clouds,” in Proceedings of the 2nd U SEN IX conference on Hot topics in paral­
lelism, H otPar'10, (Berkeley, CA, USA), pp. 8- 8 . USENIX Association, 2010.
[61] G. Ghinca and S. Chen, “Perceived quality of m ultim edia educational content:
A cognitive style approach.” M ultimedia Systems, vol. 11, pp. 271-279, 2006.
10.1007/s00530-005-0007-8.
[62] R. M. Fujimoto, Parallel and Distributed Simulation Systems. John Wiley &
Sons, Inc., 2000.
[63] Completely Fair Scheduling (CFS) Class, h t t p : / / I xr . l i nux. n o / Linux* v 3 .
■1 .

J /

K' €', 'CTi £ I / A 0 h C. d /

f Q . ’l T .

C.

93