A F L E X IB L E S IM U L A T IO N F R A M E W O R K F O R P R O C E S S O R S C H E D U L IN G A L G O R IT H M S IN M U L T IC O R E S Y S T E M S by V isw a n a th a n M anickam B.E., Anna University, Chennai, India, 2006 THESIS SUBM ITTED IN PARTIAL FULFILLM ENT OF THE REQUIREM ENTS FOR TH E D EG R EE OF M ASTER O F SCIENCE IN MATHEMATICAL, COM PUTER AND PHYSICAL SCIENCES THE UNIVERSITY OF NORTHERN BRITISH COLUMBIA April 2012 © V isw anathan Manickam, 2012 1+1 Library and Archives Canada Bibliotheque et Archives Canada Published Heritage Branch Direction du Patrimoine de I'edition 395 Wellington Street Ottawa ON K1A0N4 Canada 395, rue Wellington Ottawa ON K1A 0N4 Canada Your file Votre reference ISBN: 978-0-494-94438-7 Our file Notre reference ISBN: 978-0-494-94438-7 NOTICE: AVIS: The author has granted a non­ exclusive license allowing Library and Archives Canada to reproduce, publish, archive, preserve, conserve, communicate to the public by telecommunication or on the Internet, loan, distrbute and sell theses worldwide, for commercial or non­ commercial purposes, in microform, paper, electronic and/or any other formats. L'auteur a accorde une licence non exclusive permettant a la Bibliotheque et Archives Canada de reproduire, publier, archiver, sauvegarder, conserver, transmettre au public par telecommunication ou par I'lnternet, preter, distribuer et vendre des theses partout dans le monde, a des fins commerciales ou autres, sur support microforme, papier, electronique et/ou autres formats. The author retains copyright ownership and moral rights in this thesis. Neither the thesis nor substantial extracts from it may be printed or otherwise reproduced without the author's permission. L'auteur conserve la propriete du droit d'auteur et des droits moraux qui protege cette these. Ni la these ni des extraits substantiels de celle-ci ne doivent etre imprimes ou autrement reproduits sans son autorisation. In compliance with the Canadian Privacy Act some supporting forms may have been removed from this thesis. Conform em ent a la loi canadienne sur la protection de la vie privee, quelques formulaires secondaires ont ete enleves de cette these. W hile these forms may be included in the document page count, their removal does not represent any loss of content from the thesis. Bien que ces formulaires aient inclus dans la pagination, il n'y aura aucun contenu manquant. Canada A b stra ct In traditional uniprocessor systems, processor scheduling is the responsibility of the operating system. In high performance com puting (H PC ) domains th a t largely involve parallel processors, the responsibility of scheduling is usually left to the appli­ cations. So far, parallel computing has been confined to a small group of specialized HPC users. In this context, the hardware, operating system, and the applications have been mostly designed independently with minimal interactions. As the m ulti­ core processors are becoming the norm, parallel program m ing is expected to emerge as the m ainstream software development approach. This new trend poses several chal­ lenges including performance, power management, system utilization, and predictable response. Such a demand is hard to meet w ithout the cooperation from hardware, operating system, and applications. Particularly, an efficient scheduling of cores to the application threads is fundamentally im portant in assuring the above mentioned characteristics. We believe, operating system requires to take a larger responsibility in ensuring efficient multicore scheduling of application threads. To study the performance of a new scheduling algorithm for the future multicore systems with hundreds and thousands of cores, we need a flexible scheduling simula­ tion testbed. Designing such a multicore scheduling sim ulation testbed and illustrat­ ing its functionality by studying some well known scheduling algorithms Linux and Solaris are the main contributions of this thesis. In addition to studying Linux and Solaris scheduling algorithms, we dem onstrate the power, flexibility, and use of the proposed scheduling testbed by simulating two popular gang scheduling algorithm s adaptive first-come-first-served (AFCFS) and largest gang first served (LGFS). As a result of this performance study, we designed a new gang scheduling algorithm and we compared its performance with AFCFS. The proposed scheduling sim ulation testbed is developed using Java and expected to be released for public use. A ck n o w led g em en ts Throughout my time as graduate student a t UNBC, I came across many people who have encouraged me in one way or another. I would like to thank them all. Among them , some deserve special thanks. First of all, I would like to thank my supervisor Dr. Alex Aravind for his continuous support and encouragement. Alex has been a mentor, a well wisher, and a friend. I thoroughly enjoyed working with him, his style and ideas inspired me in getting into research and I enjoy doing that. Next to my supervisor, the people who contributed to my thesis are Dr. W aqar Haque and Dr. A ndrea Gorrell. I thank them for their valuable time, effort, suggestions, and encouragement. In addition, I would like to thank the external examiner Dr. Balbinder Deo and Dr. Ajit Dayanandan, chair of my defence for reading my thesis. I would like to thank my peers and friends who supported me and involved me in informative discussions. First of all, I want to specially thank Hassan Tahir for all his support starting from the day I arrived to Prince George, till I get settled, and also for his brilliant programming ideas which helped me a lot. I thank my fellow graduate students Baldeep, Adiba M ahjabin Nitu, Azizur Rahman, Manoj Nambiar, Narek Nalbandyan, Nahid Mahmud, Fakhar U1 Islam, and Behnish M ann for their support. I thank our computer adm inistrator Heinrich Butow for his continuous support. Finally, I would like to sincerely thank Dr. Mahi Aravind, for th e wonderful family dinners and parties on various events and occasions. Finally, much of the credit goes to my family, especially, my parents who supported me right to the end. A special thanks to my sister K avitha and my brother in law Ramasamy for encouraging me to apply UNBC. Once again, I thank them all. Thanks Guys! C 'j / 'Y r Contents A b stra ct i A ck n o w led g em en ts ii C o n ten ts iii List o f F ig u res vii List o f T ables ix 1 1 In tro d u ctio n Hardware T r e n d ................................................................................................ 2 1.1.1 Multicore Processors vs.Parallel C o m p u te r s ................................. 3 1.2 Computing T r e n d ............................................................................................. 4 1.3 Applications T r e n d ......................................................................................... 4 1.4 O perating Systems Trend ............................................................................ 5 1.5 Schedulers T r e n d ............................................................................................. 7 1.6 W here do the contributions ofthis thesis fit i n ? ...................................... 8 1.7 Thesis O rg an izatio n.......................................................................................... 9 1.1 2 M u lticore S ch ed u lin g 10 2.1 11 Load B a la n c in g ................................................................................................ iii 2.2 2.3 2.4 3 4 Scheduling A p p ro a c h e s .................................................................................... 12 2.2.1 Gang Scheduling in Multicore and Cloud C o m p u t in g .............. 14 Related W o r k ................................................................................................... 15 2.3.1 Scheduler S im u la to r s ......................................................................... 16 2.3.2 Fairness and Performance Issues in Gang Scheduling ............... 19 2.3.3 Performance S t u d y ............................................................................ 20 Summary .......................................................................................................... 21 M o tiv a tio n and C o n trib u tio n s 22 3.1 Multicore Scheduler Simulation F ram ew o rk .............................................. 22 3.2 A New Gang Scheduling A lg o r ith m ........................................................... 25 3.3 C o n trib u tio n s .................................................................................................... 26 3.4 Research M e th o d o lo g y ................................................................................... 28 3.5 S u m m a r y .......................................................................................................... 29 M u ltico re Scheduler S im u lation F ram ew ork 30 4.1 S im u latio n .......................................................................................................... 30 4.2 M ulticore Scheduler S im u la tio n .................................................................. 32 4.3 Architecture of MSS F ram ew o rk .................................................................. 34 4.3.1 T erm inology.......................................................................................... 34 4.3.2 Workload G e n e r a t o r ........................................................................... 36 4.3.3 Multicore M a c h in e .............................................................................. 36 4.3.4 Multicore S cheduler.............................................................................. 37 4.3.5 Execution T ra c e ...................................................................................... 38 4.3.6 Performance Calculation E n g in e ..................................................... 39 4.3.7 Statistical M easures............................................................................. 42 Activity Profile G e n e r a t o r ............................................................................ 43 4.4 iv 4.5 4.6 5 43 4.5.1 Performance P aram eter Setting W i n d o w .................................... 43 4.5.2 Performance Observation W in d o w ................................................. 45 4.5.3 Activity M onitor W i n d o w ............................................................... 46 Summary ........................................................................................................... 48 C ase S tu d ies - L inux an d Solaris S ch ed u lin g A lg o rith m s 49 5.1 Load B a la n c in g ................................................................................................. 50 5.1.1 Real Time S c h e d u le r ......................................................................... 51 Linux Scheduler - Completely Fair Scheduler ( C F S ) ............................... 51 5.2.1 Calculation of Q u a n t a ...................................................................... 51 5.2.2 Calculation of v r u n t i m e ................................................................... 53 Solaris S c h e d u le r .............................................................................................. 53 5.3.1 The Default Solaris S c h e d u le r ........................................................ 54 Simulation E x p e r im e n ts ................................................................................ 56 5.4.1 Observations on Experim ent 1 ......................................................... 57 5.4.2 Observations on Experiment 2 ......................................................... 60 S u m m a r y ........................................................................................................... 64 5.2 5.3 5.4 5.5 6 User I n te r f a c e .................................................................................................... A Fair an d E fficient G an g S ch ed u lin g A lg o rith m 65 6.1 Popular Gang Scheduling A lg o rith m s .......................................................... 65 6.1.1 AFCFS ................................................................................................. 66 6.1.2 LGFS .................................................................................................... 66 6.1.3 Simulation E x p e r im e n ts ................................................................... 67 6.2 A New Gang Scheduling A lg o r ith m ............................................................ 70 6.3 The Algorithm ................................................................................................. 71 6.4 Simulation Experiments.................................................................................... 72 v 6.4.1 6.5 7 Simulation S e t u p ................................................................................. Summary ............................................................................................................ 73 76 C on clu sion and F u tu re D irec tio n s 77 7.1 78 Future D ire c tio n s .............................................................................................. B ib liograp h y 93 vi List of Figures 4.1 The S ch ed u ler...................................................................................... 33 4.2 Architecture of MSS F ram ew o rk .................................................... 35 4.3 A Multicore M a c h in e ......................................................................... 37 4.4 Sample Execution T r a c e .................................................................. 38 4.5 Sample I/O T r a c e ............................................................................... 39 4.6 Param eter Setting W in d o w ............................................................... 44 4.7 Simulations Run Window .............................................................................. 4.8 Performance Observation W in d o w ................................................. 46 4.9 Activity Monitor W i n d o w ............................................................... 47 4.10 Core M onitor W in d o w ..................................................................... 47 5.1 Average Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the number of cores) ................................................................ 45 58 5.2 Standard deviation of Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the number of c o r e s ) ............................ 59 5.3 Average Interactive Response tim e : Linux and Solaris Scheduling Al­ gorithms (by varying the number of c o r e s ) ............................... 60 5.4 Standard deviation of Interactive Response time: Linux and Solaris Scheduling Algorithms (by varying the num ber of cores)............... ......... 5.5 Average Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the w o r k lo a d ) .............................................................. 62 vii 61 5.6 Standard deviation of Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the w o rk lo ad )......................................................... 62 Average Interactive Response time: Linux and Solaris Scheduling Al­ gorithms (by varying the w w k lo a d ) ............................................................. 63 Standard deviation of Interactive Response time: Linux and Solaris Scheduling Algorithms (by varying th e workload) .................................. 63 6.1 Average Response Time for AFCFS and LGFS A lg o r ith m s ................ 68 6.2 Standard deviation of Response Time for AFCFS and LGFS Algorithms 69 6.3 Core Utilization of AFCFS and LGFS A lg o rith m s ................................. 69 6.4 Average Response Time of AFCFS and Proposed A lg o rith m s............. 74 6.5 Standard deviation of Response Tim e of AFCFS and Proposed Algo­ rithm s .................................................................................................................. 75 6.6 Average Core Utilization of AFCFS and Proposed Algorithms . . . . 75 6.7 Bypass Count of Gangs of AFCFS and Proposed A lg o rith m s ............. 76 5.7 5.8 viii List of Tables 5.1 Completely Fair Scheduling A lg o r it h m ...................................................... 52 5.2 Solaris 10 Scheduling A lg o rith m .................................................................... 56 5.3 Simulation P a r a m e te r s ..................................................................................... 56 5.4 Simulation P a r a m e te r s ..................................................................................... 57 6.1 AFCFS Gang Scheduling A lg o rith m ............................................................. 66 6.2 LGFS Gang Scheduling A lg o rith m ................................................................. 67 6.3 Simulation P a r a m e te r s ..................................................................................... 67 6.4 New Gang Scheduling Algorithm ................................................................. 72 6.5 Simulation P a r a m e te r s ..................................................................................... 73 7.1 Solaris 10 Scheduling Classes Priority R a n g e ............................................. 81 7.2 Dispatch Table for RT Scheduling C la ss..................................................... 81 7.3 Dispatch Table for FX Scheduling Class .................................................. 81 7.4 Dispatch Table for TS and IA Scheduling C l a s s e s ................................. 82 7.5 Tick Processing of Solaris Scheduling A lg o rith m .................................... 83 7.6 U pdate Processing of Solaris Scheduling Algorithm 84 ix .............................. Chapter 1 Introduction The scheduling problem in com puting systems has been studied in the last several decades. W ith the recent arrival of multicore processors, th e scheduling problem has gained renewed interest. The contribution of this thesis is related to the scheduling problem in multicore com puter systems. The prim ary goal of com puter systems is to execute applications safely, securely, correctly, and efficiently. The hardware and system software have been designed to work together to achieve this goal. Until recently, th e developments in hardw are and system software have kept pace with each other to meet this goal. O perating system, which has scheduling as its central component, is th e m ajor part of system software. Hence, scheduling plays a pivotal role in effective use of com puter systems. A paradigm shift in hardware technology has happened very recently. Instead of increasing speed, the chip designers are starting to put a number of execution cores within a single processor. Such processors are called multicore processors. To exploit the capabilities of multicore processors effectively, th e software community requires to make at least two main changes: (i) a new way of designing software for m ulti­ 1 core processors, and (ii) system software m ust be redesigned with new capabilities to manage multicore processors. The proposal of this thesis is related to the efforts in meeting the latter demand. Particularly, this thesis proposes to develop a flexible simulation framework to study the performance of scheduling algorithms for multicore processors. The purpose of this chapter is to explain why multicore scheduling is interesting and worthy of investigation. For th at, we first briefly describe th e recent trends in hardware, computing, applications, operating system, and scheduler. 1.1 Hardware Trend Nearly forty five years ago, Intel co-founder Gordon Moore predicted th a t transis­ tor density on integrated circuits will be doubled about every two years. In term s of speed, this law is equivalent to: processor speed will be doubled about every eighteen months. Ever since M oore’s prediction, th e hardw are technology has been driven to produce processors with highest possible speed by increasing the clock rate. As limited by the laws of physics, the microprocessor design hit the clock cycle wall, and therefore the chip designers had to come up with a new way of exploiting the benefit of Moore’s law. The new way is to use the extra transistors to add multiple execution units (referred to as cores) w ithin a single processor. Each core is capable of executing independently of other cores in the processor. This trend in multicore processors indicates th a t all future processors will be multieore [1], As Intel, AMD, Fujitsu, IBM, and Sun Microsystems are already shipping their desktops and workstations with multicore processors [2J, multicore systems are rapidly emerging as the m ainstream computing platforms. This hardware trend seems to continue, and we will have hundreds and even thousands of cores in a single processor in the near future [lj. The abundant availability of execution cores is expected to 2 revolutionize the way we will design software for these systems in the future. Another hardware trend, driven by high performance computing groups, has been introduction of various parallel computers. A parallel com puter is a single com puter with multiple internal processors or multiple com puters interconnected to form a co­ herent high-performance computing platform [3]. Multicore processors have similarity to some parallel com puters in structure and functionality, b u t they differ in several other aspects. 1.1.1 M u ltic o r e P r o c e sso r s v s. P a r a lle l C o m p u ters Parallel processors or parallel com puters were around as early as single processor systems. In term s of execution units, parallel processors and multicore processors are similar. M ulticore processors and parallel processors have two or more execution units. They differ mainly in their purpose, other resources, and application domains. The main purpose of parallel computers is to increase the performance of applications which have longer run times. Parallel processors are often designed for certain types of applications and usually run them in static partitions. In order to utilize th e parallel processors effectively, the applications must be parallelized. In parallel processors, the scheduling is mostly done at th e user level rath er th an taken care by the operating system. Typically, users of parallel computing design their application programs incorporating the logic of scheduling and synchronization. Of­ ten, compilers or libraries like OpenM P [4] and MPI [5] help th e application designers in achieving this task. Multicore systems are emerging as general purpose computers, and they are ex­ pected to be used in a wide range of domains - from desktop, w orkstations, and servers. T hat is, multicore systems are expected to handle multiple types of applica­ 3 tions including interactive workloads, realtim e tasks, and norm al workloads. There­ fore, unlike parallel computers, the task of scheduling in multicore processors cannot be left to applications. The operating system requires to take a major responsibility of scheduling applications to multicore processors. So, there are fundamental differences between parallel processor systems and multicore processor systems [6], and m ulti­ core processor systems require redesign of scheduling algorithms. Designing efficient scheduling algorithm for multicore processor systems, we believe, is a complex task. 1.2 C om puting Trend Until recently, most general purpose com puting are desktop based and most high performance computing are based on parallel com puting and cluster com puting [3j. Parallel computing often divides the tasks into smaller ones and uses parallel com put­ ers to execute them simultaneously. Cluster com puting has th e same objective, but the computing infrastructure is a set of loosely connected computers w ith a suitable software module to coordinate the com puters in executing parallel programs. A nother computing paradigm called grid com puting [7] has taken th is trend one step further by pooling computing resources from multiple adm inistrative domain to solve a single problem. The most recent trend is cloud computing. Cloud is a computing infrastruc­ ture paradigm th a t offers com putation and storage as web-based services [8]. Cloud computing typically has parallel com puters an d /o r cluster computers as its server nodes. W ith the emergence of multicore processors, th e future clouds are expected to have multicore blades as their servers [8,9]. 1.3 A pplications Trend In the 1980s, only a small group of people knew about computers, and even a smaller group of people used one. Now, almost everyone knows what a com puter is, 4 most of us use it on a daily basis for reading news, listening to music, etc. Thus, the use of com puting continues to infiltrate every aspect of our life and autom ation in the practical world continues to increase. W ith the recent computing trends of cloud com puting and mobile accesses, and th e access to social media and Internet, the need for more autom ated services are expected to be accelerated. T h at means more software for applications are expected to be developed, and they have to be developed in a way to effectively run on the future com puting platforms. 1.4 O perating S ystem s Trend O perating systems is one of the core areas in com puter science, and has accum u­ lated a large body of literature. However, m ost of th e work in the past was done either in the domain of uniprocessor systems, where the scheduler has the full responsibility of managing applications including scheduling, or in th e domain of parallel processing, where the responsibility of task scheduling is largely left to application designers1. O perating system is one of the most complex software systems, and designing one is a challenging task. It has an influence on almost all other systems, bo th hardw are and software th a t involve computing. O perating systems spend a huge portion of their time in executing applications [10]. Generally, the functionalities of operating systems add and evolve constantly to meet the needs of new technologies and applications. However, the operating system design related to processor management has not been changed much in th e last several years as the number of processors has not been changed much. C urrent operating systems were designed for single or few cores. Thus, most developments have been 'P a ra lle l processing typically involves com plex problem s requiring high co m p u tatio n al tim e. Based on th e softw are specifically designed for parallel program m ing, the program designers di­ vide a com plex problem into com ponent p a rts an d th e n assign th e com ponent p a rts to be executed on individual processors [3J. 5 in the domain of file system, security, device management, and user interface. These developments are not aimed to meet the dem and of m ulticore processors. The traditional approach of building an operating system for every individual hard­ ware model is no more acceptable, because it will be out of date once new hardw are arrives [11]. Some of the main issues of current operating system are: 1. S calab ility: C urrent operating systems are not designed to be scalable for multicore processors. Therefore, adding hardw are resource would requires re­ designing of operating system, as the current operating system cannot utilize the newly added hardware to increase its efficiency [1], 2. R eso u rce allocation : Current operating systems are designed for com putation w ith limited resources. This may not be the case of multicore systems. Since multicores require abundant resources for their com putation, such resources are expected to be added accordingly for effective execution. For example, with the recent advancements of hardware multilevel caches, inefficient allocation of cache will result in performance degradation [12]. 3. Parallelism : If the current trend of multicore processor continues, the workload of an operating system managing the number of cores will continue to increase. Dividing the core management workload and handling concurrently require par­ allelism in kernel level. Parallelizing the kernel is difficult and m arginally suc­ cessful. This pushes us to seek new approaches [13]. Simply tuning applications to get advantage of the available cores may not be good enough when there is a mass deployment of cores. Therefore, the operating systems for the future mul­ ticore systems have to be redesigned or developed to effectively manage the cores among the applications to achieve/exceed the expectations of the users [14j. 6 As multicore systems offer more cores, handling these cores to serve the applica­ tions is becoming more challenging. 1.5 Schedulers Trend To meet the requirement of multicore systems, th e most im portant com ponent of operating system th a t m ight need a radical change is the processor scheduler. Most developments in the operating system dom ain in th e last several years have been on file system, security, device management, and user interface, and only minimal changes have been proposed for schedulers. These schedulers have been focusing on effectively multiplexing the CPU among th e com peting processes to assure fairness, quick response, and minimize the turnaround time. T he traditional operating systems such as Linux, Windows, and Solaris schedule the processes using time multiplexing. T he approach of tim e multiplexing alone is not suitable for multicore schedulers for several reasons. F irst of all, with the ad­ vancements of hardware and the availability of multilevel cache hierarchy, scheduling a core to a thread exclusively could reduce the latency and hence increase the perfor­ mance. T hat is. the thread scheduled alone in a core can effectively use the cache to reduce its execution time. Secondly, tim e multiplexing does not effectively deal with distributing the work among different cores so th a t no core sits idle when there is heavy workload on peer cores. Finally, tim e multiplexing does not reduce the im pact of access to cache and DRAM, which are considered expensive operations. Rather, it could increase the overhead on accessing shared resources such as cache, memory, and network. These reasons force us to re-think th e scheduler design. The operating system literature on multicore systems is relatively limited and most of the publications are within the last five years [1,6,8,10,12-36]. In th a t, only 7 a small portion is related to multicore scheduling algorithm s [12]. Most of them are related to effectively dealing w ith resource contention. The proposed approaches to design scheduling algorithm s for multicore systems vary greatly and differ in their recommendations. One view is th a t multicore systems have in fact simplified the scheduling. T h at is, since we have plenty of cores, there is no need to worry about intricate time multiplexing strategies, simply giving enough cores to applications will simplify the scheduling. On the other extreme, several researchers feel th a t multicore systems have complicated the scheduling task, as it needs to con­ sider several factors such as cores, caches, networks, and application requirements together in offering best possible service. A number of work suggest ideas in between these extreme cases. Most of these ideas are related to cache contention. Again, related to resource contention, there are two views. One group strongly advocates to incorporate contention aspects into th e scheduling algorithm s [12,23]. A nother group argues to decouple contention management from scheduling [34,36]. There is another direction of research th a t explores th e question of whether to keep operating systems and applications together or separately [13,22]. Overall, the field of multicore scheduling is very young and the proposed ideas on multicore scheduling are preliminary. 1.6 W here do th e contributions of th is th esis fit in? From the above discussion, we infer th a t the software systems th a t worked well for sequential systems might not effectively work w ith the multicore systems. There­ fore, there is a gap between the rapidly emerging hardware technology and relatively slow software technology. The recent research trend indicates that the software sys­ tems, particularly the operating system m ust be redesigned to reduce this gap. More importantly, new scheduling algorithms must be developed to utilize the multiple re­ sources offered by multicore systems. This thesis is an effort to help achieve this goal. More specifically, this thesis contributes to help develop new scheduling algorithms for multicore systems. The most difficult aspects of developing a novel scheduling algorithm are imple­ menting and testing its performance [25]. We believe a flexible multicore scheduler simulator framework with proper support for simulation and testing would be very useful. Developing such a comprehensive framework is the prim ary goal of this thesis. 1.7 T hesis O rganization The fundamentals of multicore scheduling and th e related work are presented in C hapter 2. Next, in C hapter 3, we present the motivation, contributions, and re­ search methodology. In C hapter 4, we present the design and the im plem entation of the multicore scheduler simulation framework. The framework and its im plem enta­ tion are the m ajor contributions of this thesis. We present the im plem entation and simulation study of Linux and Solaris scheduling algorithms in C hapter 5, and a new gang scheduling algorithm is presented and its performance compared to two well known gang scheduling algorithms in C hapter 6. Finally, in C hapter 7, we conclude the thesis and list some future directions to extend th e work carried out in this thesis. 9 Chapter 2 Multicore Scheduling Scheduling is a fundam ental problem in several systems. In processor scheduling, threads are assigned to processors for execution w ith the objective of optim izing cer­ tain performance metrics such as maximum throughput, minimum average response time, minimum average waiting time, an d /o r maximum C PU utilization. Threads are schedulable entities which achieve the intended tasks by their execution. Cores are physical execution units. In a multicore context, scheduling can be viewed at two levels: balancing the system load among the cores and multiplexing threads on a single core. In actual im plementations, these two tasks could be integrated as one scheduling module. A scheduling algorithm is a set of rules th a t define how to select the next thread for execution. This problem is well studied in single processor(eore) context and numerous scheduling algorithms exist in the literature [37,38]. Present multiprocessor operating systems such as Linux and Solaris use a two-level scheduling approach [12]. In one level the scheduler balances the load across cores, and in another level the scheduler uses a distributed run queue model with per core queues and local scheduling policies 10 to manage each core. 2.1 Load B alancing In general, load balancing in multicore or homogeneous multiprocessor systems can be done in several ways [12,39]. In one extreme, all the jobs can be kept in one shared global queue and schedule a job from this queue whenever a core becomes free. This approach is simple and balances the workload effectively, and lienee appears to increase the core utilization. But, in reality, this approach degrades th e performance due to cache pollution as there is a high probability of jobs frequently m igrating from one core to another. Modern systems achieve high performance by effectively exploiting the locality of reference, and by keeping frequently accessed d ata in local cache. Job migrations rarely utilize this benefit, and hence a significant am ount of time is spent on accessing d ata from farthest locations such as last level cache and memory. On the other extreme, each core can m aintain its own queue of jobs to b etter m an­ age cache affinity and other local resources. This case allows several load balancing strategies by m igrating jobs from one local queue to another [39]. There are four sim­ ple approaches. The first one is called sender-initiated policy in which lightly loaded cores initiate requests for jobs from other cores. This technique is also referred to as work stealing from other cores. In the second approach, called receiver-initiated pol­ icy, the heavily loaded cores request other lightly loaded cores to take jobs. T he third approach is the combination of both sender-initiated policy and receiver-initiated pol­ icy, and therefore called symmetric policy. The fourth one is that the heavily loaded core simply chooses a random destinations to m igrate some of its jobs. This simple strategy is found to be working well in practice. 11 The above general load balancing schemes are applicable only for systems with homogeneous processors, and are not suitable for parallel applications or the types of jobs which are heterogeneous with differing priorities. 2.2 Scheduling A pproaches In single processor system, scheduling of different jobs on a processor is typically done by tim e sharing. The basic idea of tim e sharing scheduling is th a t th e processor time is divided into chunks of time called tim e quanta, and each application executes in different, time quanta to complete its task. The critical factor affecting this tech­ nique is the tim e quanta, say, q. W hen a q is set very large, the applications run longer to complete their executions. W hen q is set smaller, th e applications interleave frequently. The popular time sharing technique with effective interleaving executions is round robin scheduling [37,38]. Almost all uniprocessor scheduling algorithms used in modern operating systems are time sharing. Among the tim e sharing algorithms, the most practical algorithms use multilevel feedback scheduling strategy. The basic idea behind multilevel feedback algorithms is th a t they use different priority queues to manage jobs w ith varying importance, and the jobs move between queues as their priorities change. The jobs with a lower priority will be served only if the higher priority queues are empty. The scheduling algorithms used by popular operating systems such as Linux, So­ laris, Mac OS, and Windows are some sort of multilevel feedback algorithm s. These algorithms with suitable load balancing technique have been adapted for multicore processors. (For this thesis, we have simulated Linux and Solaris schedulers.) The orthogonal technique to time sharing is space sharing, and it is applicable only in multiprocessor systems. It is an effective generic approach of scheduling m ulti­ 12 threaded applications on multiprocessor systems. T he basic idea is th a t different applications use different sets of processors during their lifetime. That is, the scheduler dedicates a set of processors to an application for its entire lifetime. Although this technique might offer excellent service to the applications, it may not be good for the utilization of system resources. In this approach, the processor will be idle when the application goes for I/O , or waiting for an event or synchronization. An effective variant of space sharing approach to parallel applications is th a t, instead of dedicating a set of processors to an application for its lifetime, parallel threads of an application are scheduled together for a fixed period of time. This technique, originally called co-scheduling later referred to in the literature as gang scheduling, was introduced by O usterhout [40]. Gang scheduling efficiently uses busy waiting for frequent synchronization. In th e literature, frequent synchronization is also referred to as fine-grained synchronization. The idea behind gang scheduling is simple th a t threads of a same process are scheduled together as a ‘gang’ on distinct processors so th a t they can progress in parallel and synchronize with minimal busy waiting involved. A gang is an application containing a set of parallel threads th a t frequently communicate with each other. During their executions, threads in a gang communicate for synchronization and d ata exchange. Often, a thread in a gang cannot proceed further w ithout sufficient progress from other threads. Such threads either do busy waiting or block themselves by suspending from execution until other threads progressed enough. A long busy wait on a processor wastes its execution time. On th e other hand, suspending and resuming processes often are also not good, when only a small wait is needed. Blocking results in context switches, which are costly. For several applications inducing small waits, research shows th a t a busy wait is b etter th a n blocking. G ang scheduling algorithms are typically designed to exploit the above observation. 13 In gang scheduling approach, different applications can use the same set of proces­ sors in different time quanta, and same application can use different sets of processors in different time quanta. This is an effective technique th a t provides an excellent service to parallel applications with threads involving similar loads and fine-grained synchronization. Using our scheduling simulator developed for this thesis, we study two popular gang scheduling algorithms and propose an improved gang scheduling algorithm . We believe th a t the proposed algorithm can be used for scheduling gangs in cloud com­ puting, as explained next. 2 .2 .1 G a n g S c h ed u lin g in M u ltic o r e a n d C lo u d C o m p u tin g Recently, it is predicted th a t the next decade will bring microprocessors contain­ ing hundreds, thousands, or even tens of thousands of computing cores, and com­ putational clusters and clouds built out of these multicore processors will offer un­ precedented quantities of com putational resources [8,22,41]. We discuss the relevance of multicore scheduling in cloud com puting assuming th a t th e above prediction will come true. Cloud computing is a service oriented com puting paradigm. It is designed to provide services such as com putation, software applications, data access, d a ta m an­ agement and storage resources to customers through internet transparently [8]. As cloud com puting offers computing as a service, customer satisfaction about th e ser­ vices they receive is extremely im portant. Custom er satisfaction is mainly related to cost, fairness, and quality of service. In th a t, fairness and quality of service are often related to system performance. Particularly, these metrics are prim arily influenced by the execution of applications in the cloud. T hat, in tu rn , heavily depend on the 14 processor scheduling in the cloud. T h at is, to offer services effectively to customers, the service requests must be properly m apped to th e available computing resources in the cloud. This is simply a scheduling problem, and it generally involves sequenc­ ing and assigning a set of applications on one or more processors (servers) so th a t the intended criterion is met, while m aintaining the m aximum possible utilization of system resources. Therefore, processor scheduling is a fundamental problem in cloud computing as it is involved in almost all services th a t the cloud can offer. As the servers of th e cloud are expected to be built from multicore processors, scheduling of multicore processors is an integral component of cloud scheduling [8,22,41], Among the applications of cloud computing, a considerable portion of applica­ tions are expected to be from high performance com puting groups. Such applications require huge com putational resources. Some of these applications are typically de­ signed as parallel threads with frequent synchronization among themselves. These applications are basically gangs. A recent research suggests th a t gang scheduling can be effective in cloud com puting |42]. 2.3 R elated W ork To set th e context for our work, we reviewed th e work on operating systems for multicore processors. In this section, we review th e work specifically related to our contributions. This thesis has contributions relating to multicore scheduling simu­ lation, performance study of Linux and Solaris scheduling algorithms, performance study of three gang scheduling algorithms, and fairness aspect of gang scheduling algorithms. Next, we review the work related to these contributions. 15 2 .3 .1 S ch ed u ler S im u la to r s A number of simulators for multiprocessors have been proposed in the literature [25,26,28,43-47]. Among them, Simics [28], SimOS [46], SimpleScalar [43], and AMD SimNow [47] em ulate the processor at th e instruction set level. They differ in emulating different architectures and simulating other components such as I/O and network. Simics simulates the hardware which can ru n unmodified operating systems such as Solaris, Linux, Windows XP, and Tru64. Simics supports the following processor models: Ultrasparc, Alpha, x86, x86-64, PowerPC, MIPS, IP F , and ARM. In addition, Simics simulates the device models well enough to execute the device drivers. SimOS simulates the hardware components to boot, study, and run IRIX oper­ ating system and the application th a t runs on IRIX. SimOS fastens the simulation by changing the mode of execution. There are three modes of execution proposed in SimOS - emulation, rough characterization, and accurate mode. Em ulation mode models the hardware th a t are required to execute the workloads, leaving other unin­ teresting execution such as booting the operating system, reading from the disk, and initializing the workload. This is the fastest mode. T he rough characterization mode approximate the behavior of the system by sim ulating those uninterested executions. This mode is two or three times slower than th e em ulation mode. The accurate mode emulates the complete system, and therefore it is the slowest and very tim e consum­ ing. The accurate mode can be used for measuring th e accuracy of the system under simulation. AMD SimNow simualtes the AMD family processors. SimpleScalar simulates a close derivative of MIPS architecture. Turandot [44] emulates PowerPC. A recent 16 simulator called COTSon [45] uses AMD SimNow, and employs a statistical sampling technique th a t can selectively tu rn on and off the sim ulation to reduce the overall simulation time. Since all these simulators em ulate machine instructions completely, they all have fine grained accuracy at instruction level. Some of them are used as virtual machines. However, they are very slow as they have to interpret each machine instruction at software level. For example, SimOS - the fastest among th e group - can execute workloads only less th an 10 times slower th a n the underlying hardware. Note th a t SimOS simulates other components to attain this speed. Also, as these sim ulators will run on host operating system, the scheduling of host operating system will further slow down th e execution tim e of the instruction. Such fine-grained simulators are more suitable for studying low level functionalities of the processors. These simulators are not suitable for rapid simulations aiming to get quick insight and guidance to develop new scheduling algorithm s for future multicore processors. They are machine dependent and emulates only existing hardware. Im plem enting a scheduling algorithm in a system supported by a regular operating system is hard. Therefore, simulating a new scheduling algorithm for performance study in these simulators is time consuming and hard. A sim ulator of Linux scheduler called Linshed was proposed in [26]. This sim ulator was designed by making changes to original Linux kernel and it runs in user mode. The objective of Linshed was to study the Linux scheduler in depth and was not intended to implement any new scheduling algorithm or comparing with any existing scheduling algorithm. Recently, a toolset called AKULA [25] was proposed to study scheduling of threads on multicores so as to reduce their cache contention. AKULA toolset assumes the 17 availability of the task profile term ed as bootstrap data on cache behavior. Such cache behavior can be obtained only through actual execution on a dedicated m ulti­ core system. The bootstrap d a ta contains two information: solo execution tim e and degradation m atrix. Solo execution tim e is m easured when th e thread runs alone in the real machine and the degradation m atrix is th e degradation value from their solo execution time when a thread is scheduled w ith other threads in different cores which share the cache. For example, there are two threads A and B which are scheduled in two core system. The degradation m atrix contains degradation value of of thread A when thread B is scheduled in another core and vice versa. Suppose the degra­ dation value of thread A when scheduled with th read B is 0.75, then the slow down percentage is 75. Threads on a multicore system can be scheduled in a num ber of ways. Consider a system with two cores and two level cache memories LI - local to each core, and L2 shared by both. In this system, all threads can be scheduled to one core or they can be distributed between two cores in several ways. These different ways of scheduling will have different influence on both LI and L2 level caches. To observe th e cache behavior, we need to collect the cache d ata in all possible ways of scheduling, which will result in large number of combinations. AKULA collects the cache behavior for a lim ited set of scheduling combinations. It assumes th a t the threads scheduled in the cores are allotted with full LI cache and observes the L2 level cache effect. T h at is, tim e sharing on a core is not allowed. For example, there are four threads, say A, B, C, and D. need to be scheduled and there are two cores in the system which share L2 cache. T he degradation m atrix will have the degradation value for the following 12 com binations of threads: AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, and DC. 18 If the number of cores is increased to 3, then a total of 24 combination of d ata have to be collected. So, the size of the d a ta increases drastically w ith increasing number of cores as well as increasing number of threads. Also, executing these tasks in real machine to collect the bootstrap trace is tedious and very tim e consuming task. More importantly, AKULA only runs on th e profile d ata created by actual executions, it cannot be used for testing new workloads. Therefore, it limits its usage only to the system and the workload for which the trace is collected. Even changing a single param eter such as speed, number of cores, cache, workload will make th e trace unusable. Due to this restriction, the use of AKULA is very limited. 2 .3 .2 F airn ess a n d P erfo rm a n ce Issu e s in G a n g S c h e d u lin g Gang scheduling has been extensively analysed and several studies have concluded th a t gang scheduling is one of the best approaches for parallel applications, and hence several gang scheduling algorithms under different conditions have been proposed in the literature [40,42,48-55]. Among them First F it (or First-Come-First-Served (FCFS)) and Best F it (or Largest-Gang-First-Served (LGFS)) are popular [53]. W hen enough processors are free, FCFS chooses the job at the head of the queue to schedule and LGFS chooses the largest job in the queue to schedule. FCFS assures high fairness, b u t does not guarantee the best processor utilization. Consider th a t a larger gang G is in the head of the queue, and several other smaller gangs are waiting behind G. Assume th at there are not enough processors to schedule G, but several other processes from the queue can be scheduled. Now, in FCFS, these processors will be idle until enough processors become free and G is scheduled. Such situations will not only make the processor utilization low, but also have the potential to increase the average waiting time. To avoid such situations, a modification called adaptive FCFS (AFCFS) was introduced [53]. W hen a gang in the head of the queue cannot 19 be scheduled, AFCFS schedules other gangs behind in the queue. All these algorithm s are susceptible to starvation. To avoid starvation, these algorithms adopt the policy of migrating jobs from processor to processor [53]. Such task m igration may not be effective for multicorc systems. Task migration is hardly possible between two heterogeneous m ulticore systems, and generally expensive even between two homogeneous systems [39,56]. Also, an efficient task migration can reduce th e wait tim e, but it does not guarantee to eliminate starvation. Therefore, a simple solution to avoid starvation in gang scheduling is an interesting open problem. Avoiding starvation is an interesting theoretical problem. But, for practical appli­ cations, a b etter fairness measure th a n free of starvation is most desirable. We found th a t no such fairness metric has been introduced and used in this context. Regarding performance, the most widely used metric in the processor schedul­ ing context is average response time. We believe a predictable performance is more valuable than b etter average response tim e1. 2 .3 .3 P erfo rm a n ce S tu d y We have conducted performance studies on two sets of scheduling algorithms. Next, we discuss the work related those studies. 2.3.3.1 L inux vs. Solaris S ch ed u lin g A lg o rith m s Linux and Solaris schedulers have been constantly tuned and updated [12,57]. Even the most popular 0(1) Linux scheduler introduced in version 2.6 was initially luit is m ore im portant to m inim ize v a ria n c e in th e response tim e th a n to m inim ize th e average response tim e. A system w ith reasonable and p re d ic ta b le response tim e may b e considered m ore desirable th a n a system th a t is faster on th e average b u t highly variable. However, little w ork has been done on C P U scheduling algorithm s to m inim ize th e variance." [37] 20 expected to be used for a long time, has been overshadowed by the introduction of a completely fair scheduler (CFS) [12,58]. We decided to study th e performance comparison between Linux CFS and the Solaris 10 scheduler. To th e best of our knowledge, we could not find a performance comparison between Linux and Solaris schedulers. Even the complete description of the scheduling algo­ rithm s of these operating systems are not comprehensively described in one place. We put a lot of effort to construct the complete algorithm in bits and pieces from various sources for our simulation study. 2.3.3 .2 A F C F S vs. LG FS G an g S ch ed u lin g A lg o rith m s Among the popular gang scheduling algorithm s, AFCFS alleviates the problem of low processor utilization, and performs b etter th a n LGFS under light loads. The performance of these two algorithms have been studied in [51-53]. LGFS, on the other hand performs b etter than AFCFS under heavy loads. 2.4 Sum m ary In this chapter, we explained multicore scheduling in a higher level, and looked at load balancing and scheduling approaches. Then, we discussed an interesting class of parallel job scheduling called gang scheduling and its relevance to multicore and cloud computing. W ith this background, we are ready to present the m otivation and contributions of the thesis. 21 Chapter 3 M otivation and Contributions This chapter presents the motivation and contributions of this thesis, and the m ethod­ ology used. This thesis contains three main contributions: (i) a multicore scheduler simulation framework and its implem entation; (ii) simulation studies of two popu­ lar scheduling algorithms to illustrate the use of th e proposed scheduler simulation framework; and (iii) a new scheduling algorithm and its performance study. We start with the motivation for th e first contribution. 3.1 M ulticore Scheduler Sim ulation Framework From the literature study presented in C hapter 2, we identify two potential choices for our thesis work: (a) design and propose a new or improved multicore scheduling algorithm; and (b) design and propose a multicore scheduler simulation testbed where any new scheduling algorithm can be easily sim ulated and analysed. The first choice is from the observation th a t, as the field is young, and there are no widely accepted concrete multicore scheduling algorithm s have been proposed in the literature, there is a good possibility of inventing an efficient multicore scheduling 22 algorithm. The second choice is from the observation th a t, as the trend in processor technology indicates th a t the future systems will have hundreds and thousands of cores, any new algorithm proposed for such systems m ust be studied carefully using a large number of cores using proper set of experiments before it can be adopted for practical use. After several brainstorm ing discussions, we started to realize th a t both choices are risky and challenging as they have open ended goals. However, we felt th e sec­ ond choice has the potential to im pact widely, and therefore we chose to proceed in designing and implementing a flexible multicore scheduler simulation framework. Designing and implementing a multicore scheduler simulation framework involve several research questions to be explored, and some of them are: • W hat would be the main purpose of the framework? • W hat would be the com ponents of the framework? • How accurately can the com ponents be modeled? • W hat is the level of accuracy th a t we want in modeling the com ponents of the framework? • To illustrate the use and flexibility of the framework, which scheduling algo­ rithm s can we implement and study? • W hat kind of simulation experiments we would like to conduct? From the literature, we understood th at no simulator could replace a real system. However, the design and im plem entation choices vary greatly depending on th e cost and accuracy. Here, the cost is a function of effort, time, complexity, and performance. 23 The accuracy depends on the purpose of th e simulation. O ur design choice of the scheduler sim ulation framework is m ainly motivated from the following observations: • “Simplicity is the key to understanding. ... Simplified simulations provide the best grounds for extracting m ajor properties quickly. ... Simulations done with realistic physical layers normally lead to investigating phenom ena w ith too many variables, too many puzzles, leading to too few explanations, and too few hints for future progress.” [59]. • “So, in practice, models th a t attem p t to be highly accurate end up running very small “toy” workloads.” [28]. • “... the biggest difficulties in scheduling algorithm development: th e difficulty of im plem entation and the duration of testing. The difficulty of im plem entation refers to the time and effort needed to convert an idea into the actual code ... The difficulty of im plem entation and th e duration of testing make it infeasible to explore many different scheduling algorithm s.” [25]. These observations motivated us to design a multicore scheduler sim ulation frame­ work th a t is simple but flexible and comprehensive, so th a t the design space of the scheduling algorithms for multicore systems can be explored rapidly. Our design objective of the framework is mainly to provide accuracy sufficient to gain initial insights into the performance of the scheduling algorithm under study. We believe such insights will be valuable to guide th e researchers in developing new scheduling algorithms with specific objective in mind. As a m atter of fact, we en­ countered such an opportunity of developing an improved scheduling algorithm for a specific class of parallel applications called gangs. Next, we briefly explain how we 24 were m otivated to study gang scheduling algorithm s and propose an improved gang scheduling algorithm. 3.2 A N ew G ang Scheduling A lgorithm Cloud com puting is an emerging class of com putational platform th a t has the potential to provide unprecedented com pute capacity to future organizations and average users [8]. The trend in multicore processors indicates th a t all future processors will be multicore, and hence the future cloud systems are expected to have their nodes and clusters based on multicore processors [60]. So th e processor scheduling in the future systems will most likely be all multicore processor scheduling. Therefore, multicore scheduling is fundamental to future cloud com puting performance. Also, due to multicore revolution, a considerable portion of large applications will be parallel programs. From the literature, we can see th a t gang scheduling is a dom inant strategy to schedule parallel programs with th e requirement of frequent synchronization. Among the popular gang scheduling algorithms, AFCFS alleviates th e problem of low processor utilization, but is susceptible to starvation. LGFS, as claimed in the literature [42], outperform s AFCFS in large loads, b u t again is susceptible to starva­ tion. To avoid starvation, these algorithms adopt a process migration policy. Process migration in this context is m igrating gangs between multicore systems. M igrating gangs may not be even possible between two heterogeneous multicore systems, and generally expensive even between two homogeneous systems [39,56]. Gang migration could reduce the overall wait time of the m igrating gang, b u t it does not guarantee to eliminate starvation. These observations raise a question. Can we design a gang scheduling algorithm with the following characteristics? 25 1. Freedom from starvation. 2. Predictable and acceptable response. 3. Better processor utilization. Since LGFS favors large gangs, th e smaller gangs are susceptible to starvation or will have longer wait time. This is unacceptable particularly in cloud environm ent where custom er satisfaction hugely depends on fairness and predictable response time. In practice, the customers who receive a little faster service (at the expense of oth­ ers’ long wait) may not be overly satisfied [61]. B ut, the customers who experience unpredictably long delay, on the other hand, will readily notice the unfairness and unpredictable response and th a t could potentially drive the cloud business in a nega­ tive direction. Therefore, in addition to fast response and high processor utilization, minimal variance in response time is extremely im portant for quality of service in cloud systems. 3.3 C ontributions This thesis has contributions in the following three categories: 1. In ven tive • A new multicore scheduler simulation framework • A new gang scheduling algorithm with increased fairness 2. C reative • A multicore scheduler sim ulator (expected to be released as open source software) 26 3. E x p erim en ta l • Performance study of Linux and Solaris scheduling algorithms • Performance study of two popular gang scheduling algorithms A FCFS and LGFS • Performance study of a new gang scheduling algorithm and comparison with AFCFS At another angle, this thesis has contributions in theory, algorithm, im plem enta­ tion, and experiments. The scheduling framework is a theoretical abstraction of the scheduling system. The new gang scheduling algorithm is an algorithmic contribu­ tion. The scheduling simulator is an im plem entation of th e framework. Finally, the experimental study are the performance study of five scheduling algorithm s - Linux and Solaris scheduling algorithms, and three gang scheduling algorithms. These contributions have several benefits. The insights obtained from the experi­ mental evaluation will help: (i) the users to effectively exploit the hidden power of the above mentioned schedulers, and (ii) th e researchers to design new efficient multicore scheduling algorithms, by combining th e best ideas of the above studied algorithms, and perhaps adding new ideas. The sim ulation tool can be used to evaluate any newly designed multicore scheduling algorithm under various conditions before it is adopted for real systems. The proposed gang scheduling algorithm is simple, fair, and gives predictable per­ formance. Such a predictable performance is attractive from th e service point of view. Also, the algorithm is scalable as it solves the starvation problem locally w ithout using process migration. High performance, fairness, and scalability are attractive proper­ ties for cloud computing. Therefore, the algorithm is applicable for cloud systems 27 built from multicore processors. 3.4 R esearch M eth od ology The methodology we followed is generally referred to as constructive research methodology, where the construct could be a new theory, algorithm, model, software, system or a framework. This methodology is most common in engineering and com­ puter science, particularly in systems research. Constructive research methodology involves innovative modeling, design, im plem entation, and experim entation. M ulticore scheduling is a challenging problem, and we determined to explore the problem in a systematic fashion, using a com bination of theory and practice, with an experimental approach. First, the focus is on thoroughly understanding the theory behind processor scheduling by studying and evaluating existing scheduling schemes. Second, based on this understanding of the literature, a multicore scheduler simula­ tion framework with components th a t can be useful in implementing a new multicore scheduling algorithm is designed and implemented. Third, the com ponents neces­ sary for studying the performance of m ulticore scheduling algorithm are determ ined and added to the framework. Finally, the functionality of the proposed framework is dem onstrated by simulating five scheduling algorithms, and then illustrating the performance of the simulated scheduling algorithm s through performance m onitoring and performance metrics. The five m ajor steps involved in th e methodology are: 1. Literature survey 2. Design and implementation of th e multicore scheduler simulation framework 3. Identification and im plem entation of performance monitoring com ponents and performance metrics 28 4. Implementation of three popular classes of scheduling algorithms: Linux, Solaris, and Gang scheduling 5. Conducting simulation experiments on five scheduling algorithms and illu strat­ ing their performance results 3.5 Sum m ary In this chapter, we presented the m otivation for our contributions and listed the contributions. It also contains the research methodology th a t we followed. W ith this background, we are ready to present th e m ajor contribution of thesis - th e multicore scheduler simulation framework in the next chapter. 29 Chapter 4 Multicore Scheduler Simulation Framework This chapter presents the architecture of the multicore scheduling simulation (MSS) framework, which is the m ajor contribution of this thesis. The prim ary use of this framework is to facilitate researchers in implementing and simulating multicore scheduling algorithms to evaluate the performance visually and statistically. Before presenting the framework, we briefly explain the simulation technique we used. 4.1 Sim ulation Computer simulation is a technique to model and observe the behavior of some real or imagined system over time [62]. A simulator is simply a com puter program th a t transforms the state o f the system in discrete tim e points over a specified period of time. Simulation is widely used to study the dynamic behavior of complex systems. Based on how the system state is modeled and simulated, com puter simulations are classified either as continuous or discrete. If the state variables change continuously 30 over time, then it is called a continuous simulation, and if th e state variables change only at discrete times, then it is called a discrete simulation. In reality most systems are a combination of both. However, depending on the purpose, most systems are simulated either as continuous or discrete, and rarely as hybrid of both. Discrete simulation is further divided into time-stepped and event-driven, based on the advancement of simulation time and th e update of th e system state. In timestepped simulation, the system state is updated at every tim e step. In event-driven simulation, the system state is updated at the occurrence of events. Discrete event simulation consists of an events list, a simulation clock, and an event scheduler. For instance, in simulating the behavior of a queue at the bank-teller, the number of customers arrived and the number of customers served are state variables and they will be updated on the occurrence of the events in th e system. T he number of customers arrived will be updated when th e customer arrives in the bank, and the number of customers served will be updated when th e bank-teller serves th e customer. Simulation continues until either the events list becomes em pty or the sim ulation time ends. In discrete event simulation, the sim ulator m aintains a queue of events (also called event list) sorted by the simulated tim e they should occur. Simulation clock m aintains the simulation tim e and it is advanced to the time of occurrence of next event in the event list. Since, it is not im portant to execute the simulation in real-time, the advancement of the simulation time can be the same, faster, or slower th a n real-time. For example, in the simulation of hum ans evacuating a building, the queues buildup can be visualized faster than real-time. The current flow through an electric circuit can be simulated slower than real-time, and in-training simulations (for example, flight, simulators) can be exhibited real-time speed. An Event scheduler executes events 31 from the events list and the system state changes a t the occurrence of each event in the system. We use both discrete tim e stepped and discrete event simulation to implement different components of our framework. 4.2 M ulticore Scheduler Sim ulation From systems point of view, multicore scheduling essentially has two tasks - main­ taining the load among th e cores and multitasking threads in each core. Generally, the first task is referred to as load balancing and the second task is referred to as local scheduling. We m aintain these abstractions in our scheduling framework. Load balancing manages the jobs across the cores, and local scheduling directs the core to switch between threads. T hread switching (or context switching), say from Ti to Tj, requires saving the context of Tj and loading or restoring th e context of Tv Also, certain tasks must be performed when a thread completes its execution. In the simulation context, these tasks are basically updating the simulation system state. In our framework, we provide generic routines to do these tasks. In essence, implementing the routines of load balancing and local scheduling are the programming effort needed to use our simulator to study the performance of a new scheduling algorithm. Again, to reduce the effort of implementing the scheduler further, we provide a default load balancing routine and sample local scheduling algorithm routines. These routines can be suitably modified to implement th e new scheduling algorithm, unless the new scheduling algorithm is completely novel and does not follow the structure of load balancer and local scheduler combination. Even in th a t case, the entire scheduling can be designed from scratch with little effort to use our framework. In any case, from our experience, we are confident th a t, once th e logic 32 A rrival queue Scheduler updates PQ O State adds T race & Sim ulation M anager advances amulaticaa do ck Figure 4,1: The Scheduler of the new scheduling algorithm is completely understood, then im plem enting it in our simulator is straightforward. W ith this background on multicore scheduler, next we explain the overall multicore scheduler simulator, which is a part of the framework. The simulator is illustrated in Fig. 4.1. It has seven components: sim ulation clock, simulation manager, arrival handler, arrival queue, scheduler, state, and trace. The scheduler has four routines: load balancer, local scheduler, context switch handler, and completion handler. The simulation of the system basically involves updating the state of the system at every simulation time point, and incrementing th e simulation clock. Simulation clock advancement and state update could be done in an integrated fashion. But, for the modular design of th e scheduling framework, we decided to keep these two tasks separate. 33 In our framework, we designed a component called Simulation M anager to do the task of advancing the simulation clock and initiating th e appropriate routines to update the system state. U pdating th e simulation system state typically involves recording the arrival of new jobs, and updating the state related to scheduling. To implement the arrival of new jobs, we introduced a queue called arrival queue, and implemented a routine to register the newly arrived jobs in th e arrival queue. U pdat­ ing the state related to scheduling is dependent on the scheduler logic, and it must be done as p art of the im plem entation of the scheduler. The simulation of the executions of jobs are recorded as simulation trace, and it is recorded at every scheduling point. To make this task generic, we have standardized the format of the execution trace and implemented a routine to add the trace appro­ priately during the simulation. Actually, this task of updating the trace is taken care of autom atically by the context switch handler routine. W ith this background, we now introduce the architecture of the multicore scheduler simulation framework. 4.3 A rchitecture o f M SS Fram ework First we introduce some basic terminology used in our framework. Hereafter, to avoid confusion of what really refers to processor1 in the simulation context, we avoid its usage in the rest of the thesis. 4 .3 .1 T erm in ology • Core: The hardware execution unit. • Chip: An integrated circuit containing one or more cores. b e f o r e m ulticore era, the te rm processor was used to refer to as an execution u n it. T herefore, it was used synonym ous w ith central processing u n it (C P U ). Now', w'ith m ulticore technolog}-, a processor has several execution units. 34 W o r k lo a d G en er a to r S ch ed u ler Chip l M achin e Cl^ip 2 Chip n w-*-* / (^Coare^) ✓ 1--------SI--------- j f Covet \ cS> © P erfo rm a n ce C a lcu lation E n gine ---------- v Core E x e c u tio n T race --------- ^N --------- @ A c tiv ity P ro file G en er a to r Figure 4.2: Architecture of MSS Framework • Machine: A collection of chips designed to work together. • Thread: The smallest unit th a t can be scheduled to a core for execution. • Gang: A set of parallel threads th a t can be executed to achieve a task. A higher level architecture of MSS framework is given in Fig. 4.2. The framework has five main logical components: workload generator, multicore machine, multicore scheduler, execution trace, and performance calculation engine. We explain them next. 35 4 .3 .2 W ork load G en er a to r We have implemented the workload generator to generate two types of workloads: threads and gangs. Threads are generated for traditional applications and gangs are generated for parallel applications. The gangs of parallel threads typically execute in a synchronized manner. The input to generate a workload of threads are: the numbers of threads, mean arrival rate, and mean service rate. To generate the workload of gangs, instead of the number of threads, the number of gangs is given. The number of threads w ithin each gang is generated uniformly between the range 2 to M , where M is th e num ber of cores. The output will be a set of threads or gangs depending on the generator chosen. Each thread will have a unique id, arrival time, execution time, I/O points, priority, and application class. Similarly, each gang will have a gang id, arrival time, execution time, and a set of parallel threads with their own ids. The I/O points of a thread are the times when the thread will go for I/O s. For example, assume th a t a thread has an execution time 50, and will go for I/O at time 5, 20, 30. In this case, I/O requests will be invoked after the thread is executed for 5 units, 20 units, and 30 units. The number of I/O points are generated uniformly w ithin the execution time range, and th e I/O wait tim e is generated uniformly within a range specified in the configuration. 4 .3 .3 M u ltic o r e M a ch in e The Multicore machine is the com puting unit th a t is organized in a hierarchy, starting with machine at the highest level, as shown in Fig. 4.3. 36 Machine Chip 1 Chip 2 Core Core I L1 I— — I L1 I— L2 cache Core Core i Ll I —I L1 1 L2 cache L3 cache Main Memory Figure 4.3: A Multicore Machine A multicore machine contains one or more chips, and each chip contains two or more cores. The core is the physical execution unit. In practice, a core is capable of executing one or more threads in an interleaved way, referred to as hyper-threading. In our simulator we model a core to execute one thread at a time. We believe th at, w ith a suitable scheduling policy, the performance of hyper-threading effect can be approxim ately simulated using single threaded cores. A multicore machine has hierarchy of cache memories. C urrent m ulticore systems have three levels of cache memories, referred to as L l, L2, and L3. L l is core level, L2 is chip level, and L3 is machine level. As shown in Fig. 4.3, L l is local to each core, cores in a chip share L2 of th a t chip, and L3 is shared by all the cores in the machine. 4 .3 .4 M u ltic o re S ch ed u ler As explained earlier, our framework implements a multicore scheduler having two logical components - load balancer and local scheduler. T he load balancer is re­ sponsible for maintaining a desired balance of the system load. This task involves dispatching the new jobs to the appropriate local scheduler, and m igrating jobs from one local scheduler to another when necessary. T he local scheduler is responsible 37 < cid , t i d , E x -sta r t, Ex-end, S> < 0 ,0 ,4 ,1 2 ,0 > < 1 ,1 ,9 ,1 3 ,0 > < 1 ,1 ,1 3 ,1 7 ,2 > < 3 ,4 ,1 6 ,1 9 ,2> <0 , 2 , 12 , 2 2 , 0 > < 4,3 54,12459,12459,3> Figure 4.4: Sample Execution Trace for scheduling jobs to the cores for execution. Generally, each core is assigned a lo­ cal scheduler, but other choices are possible. In our simulation, local scheduling is implemented to have centralized control. In simulation context, the local scheduler primarily makes a decision to choose a job for execution, determines th e am ount of execution time, calculates its progress rate, and produces the trace. The progress rate of a job is the crucial design factor affecting the accuracy of execution, and it is dependent on several factors such as execution speed of the core and the contention for shared resources. We have used a simple cache contention model, but it can be easily replaced w ith an im plem entation of a more refined model. 4 .3 .5 E x e c u tio n T race During th e simulation, the execution trace is recorded at every context switch to generate the activity profile and to com pute the performance metrics. There are two types of traces collected: execution trace and I/O trace. The execution trace is a collection of quintuple, as shown in Fig. 4.4. In the execution trace, each quintuple has a core id (cid), thread id (tid), execution start, time (Ex-start), scheduling end time (Ex-end), and a status (S). S tatus is 0 if preem pted by quanta expiration, 1 if preem pted by higher priority thread, 2 if going 38 < tid , I /O -s ta r t, I/0-end> <4,19,49> <4,52,69> <19,52,76> <13,43,83> <34,85,87> < 681,12045,12070> Figure 4.5: Sample I/O Trace for I/O , and 3 if completed. The I/O trace is a collection of triples, as shown in Fig. 4.5, each triple has thread id, I/O start tim e (I/O -start), and I/O end tim e (I/O -end). 4 .3 .6 P erfo rm a n ce C a lc u la tio n E n g in e The performance study prim arily aims to determine how well th e algorithm re­ sponds to satisfy certain criteria such as response time, fairness, and utilization of resources. The performance calculation engine calculates these values for a given set of data, and the result can be passed to the performance observation window for the users to study. The performance criteria widely used to study scheduling algorithm s are the five metrics: throughput, CPU utilization, turnaround time, waiting time, and response time [37,38]. In addition, we have included three more measures: (i) interactive response time, (ii) bypass count, and (iii) slow down factor. Interactive response tim e is introduced to observe the interactive response of tasks. It is defined as the tim e from when a thread is ready for execution to th e subsequent start of execution. To avoid starvation, wre introduce the metric of bypass count. Bypass count will indicate the number of threads which bypassed a waiting thread in scheduling, and it is an indication of unfair scheduling. We use a bypass count graph to see the level of fairness th a t a scheduling algorithm can assure. Finally, a relative performance would be interesting 39 for resource allocation purpose. For th a t, we include slow down factor. The following list of performance metrics is supported in th e proposed framework. 1. Throughput: The to tal number of jobs completed execution in one unit of time. Assuming the simulation starts at tim e 0, and n jobs complete in Ts period, the throughput T P is com puted as follows: TP = — n 2. Core (4.1) utilization: This is the percentage of tim e the core spends on executing jobs. Let Ts be the to tal sim ulation time and Tb be th e amount of tim e the core is busy. Then, the utilization (Uc) of core c is com puted as follows: Uc = x 100% (4.2) S 3. Core idle time: This is the percentage of tim e the core is idle w ithout jobs to execute. The idle time (Id lec) of core c is computed as follows: Id lec = (l - ^ ) x 100% (4.3) S 4. Core wait time: The total time th a t the thread waits for core. 5. I /O Wait time: The total time taken for all I/O waits of the thread. 6. Wait time: The sum of core wait tim e and I/O wait tim e of the thread. 7. Turnaround time: The sum of wait tim e and execution time of the thread. 8. First response time: The tim e from submission to sta rt of execution of the thread. 40 9. Interactive response time: The tim e from when th e th read is ready for execution to the subsequent start of execution. 10. Slow down factor: Slow down factor of a thread is defined as the ratio between its turnaround tim e and execution time. This m etric is used to m easure th e delay suffered by a job against its actual execution time. If w ti and exj, respectively, are the wait and execution times of i, then the slowdown factor .Sj of the thread i is computed as follows: Wti exi 'i (4.4) 11. Bypass Count Graph: Every thread is associated with a bypass counter. This counter is incremented whenever it is bypassed another thread in scheduling. The bypass count graph reflects how many threads have bypassed a waiting thread in scheduling. It gives the sense of fairness th at th e scheduling algorithm can assure. Let x i,x 2, .... x u be the values, th e average x and standard deviation a are com­ puted as follows: (4.5) n and n n 41 (4.6) 4 .3 .7 S ta tis tic a l M ea su re s For these metrics, when applicable, we calculated minimum, maximum, average, and standard deviation. Traditionally, average and percentage are used to study the performance in this context. T h at is, typically, throughput and utilization are com puted in percentage, and average is com puted for turnaround time, waiting time, and response time. We believe b etter measures than average m ust be used in analysing these metrics. As cloud computing is fundam entally a service oriented system, its prim ary goal is to provide quality service to its customers. Although the term quality of service is often used and directly related w ith response time, it is not well interpreted in the context of computing and communication systems. As indicated in [61], perceived quality of customers need not be directly related w ith minimal response time. The study indicates th a t users are often unaware of the quality differences until it crosses certain threshold. Therefore, quality of service need not always be related to the widely used metrics such as minimal response time or minimal average response time. Thus, we believe th a t a predictable response time is a more appropriate measure for the quality of service in the cloud com puting context than other measures such as average response time and resource utilization. One such measure we discussed earlier is predictable response time. Predictable response could be measured using variance or standard deviation. Therefore, we decided to include the m etric of standard deviation (i.e., square root of variance) in our framework. We believe th a t standard deviation is a better measure of predictability than average value. 42 4.4 A ctivity Profile G enerator Activity m onitor provides visualization of how th e algorithm schedules the threads among the cores. Activity profile generator is responsible for providing th e d ata for the visualization. It basically derives the d a ta from the execution trace. From the trace, it generates d ata for every clock tick. The d ata contains the core statu s as idle or occupied. Also, if occupied, the d a ta contains the information ab o u t the thread and its status, whether it is a newly arrived thread or preem pted thread or m igrated thread from a different core. Activity profile generator differentiates these different states of the core by assigns different colors. The activity m onitor window visualizes the derived d ata of core to thread allocation. Using this component, we can visually observe core utilization, load balancing, and the individual core statistics. 4.5 U ser Interface The multicore scheduling sim ulator has three main user windows, and the interface within each window is organized as hierarchical panels. We explain these windows next. 4 .5 .1 P erfo rm a n ce P a r a m e te r S e ttin g W in d o w Performance param eter setting window is used to configure the param eters for the simulation. It consists of two screens. First screen, shown in Fig. 4.6, is used to set the workload generation param eters. The input for the workload generator are the number of jobs, arrival time distribution, mean arrival rate, and job execution time range. The param eter setting window has the provision to create more th an one work- 43 ....fwipimpik far SLMtyal niaanrtwr* o*lAuMcotn ^ ferfarm aoc* f v w w n l a w ; WrKtaw Workload generator Mi— k c r a ( p i»t*** iW O Arrival xkn*4lnribM hm P«**cm 50 ^ *» 1&U Gang Job generator ) N « « -» Figure 4.6: P aram eter Setting Window load. The window takes number of process, arrival rate distribution, arrival rate, and execution range as the param eters for each workload. Once you configure the above mentioned param eters, you can add it to the workload list by pressing th e ;Add to Load’ button which will add to the list of workloads and will display it in th e right side of the window. Using this option, we can configure more than one workload. The simulations run window, shown in Fig. 4.7, is designed to get input for creating simulation runs. Using this window, several simulation runs can be configured. The param eters required for each run are the number of cores, workload selection, and scheduling algorithm. Using the simulation run window, we will be able to create simulation runs with different combinations are listed below: 1 . number of cores vs workload w ith the same scheduling policy 2 . number of cores vs scheduling policy under th e same workload 44 . r - .x..... .......... . .............. j ...... .. ., i ’tpmmark h* U ffr pf N fto e w o O b s e n r tw W W e w j >iBcfn*w i g» r t r tnrwu n ci h n a w t k m n w i ACMr MOMHTVMM* | Configuration .■iwatNfrafcer** 12* W w U a * _____ ■MortiostfO__________ <*j C u e Arrttnri B ate j i . 4 & Ati Sdw tfaitiig Policy iS o U rto tO Kutnfrtr sir c a re t J j |j : S chedule* Pete* ;*Ej __ | A d a to sim ulation n m i ] W arttead j i&4.........................................................Linus 2.6 2 4 .........................................Wofdoadfl 64 U8 iiH U n u x 2.6 26 SoUri* IQ Sol-iru td W o ftftu d i W «kk>»40 W ontlwuU Figure 4.7: Simulations Run Window 3. workload vs scheduling policy when executing with sam e number of cores This simulations run setting feature simplifies th e effort involved in simulating the scheduling algorithms under various conditions. 4 .5 .2 P er fo rm a n c e O b ser v a tio n W in d o w Performance observation window, shown in Fig. 4.8, offers various performance metrics th a t can be represented in charts. The window has a list of performance met­ rics which can be chosen to see th e computed values. Additionally, there is an option to choose two different simulation run and compare the results. The performance metrics can be analysed by varying the number of cores and arrival rate. 45 Charts CfcMMCbantypt t-TUca* ttittnatyw Aiutyw pcrtarmaiK* •£*»« Figure 4.8: Performance Observation Window 4 .5 .3 A c tiv ity M o n ito r W in d o w Activity monitor window is an interesting component in th is simulator. This screen shows the core usage in a graphical representation. W ith this screen, shown in Fig. 4.9, we can to analyse the following: 1 . Core utilization 2. Thread migration 3. Execution thread list Using Activity m onitor window, we can visualizes how th e threads are distributed among the cores and how effectively the load is balanced. To see th e individual core performance, a core m onitor child window is attached to each core to show its statistics. The child window is shown in Fig. 4.10 46 k. ;»£»'..... '"^ «*» ; f~[." i . "J [..H I i ^ H * 1m*mma+ HHH Mm "■r *M*Mjaw «*M MM “J |nl" ___ “1’ » B iB - B J H i i l wj’i iii i t jlX i M f T i i ...W - “ i ^j^jjj^ m iz g iU 7*^ 1_ ' mm ■ ■ ; am m m g £ |j| *t^*" ____ f m hb i _ m k 1^ 7 -^ y ^ - M m w m a h i f t r B s is WM ■ ■ :IH w ] ..... - H H .... | H | ..... M i l '’ ^"""^‘ " »[ {,,»i f.J L..**' .L.** :,..) •._“* |J_1* “ H E H I H I t m ! IH i I H ^^L IH - .n ..r * ** C “ * I..[ r H H ' H I IH I t n ....."j j ^ r a l i r Figure 4.9: Activity M onitor Window ■■--i.X ....... '^ . - c .£ ■ ■■: t C ow Thread status New P'empiec Prem ptttJ Pfempwd Prempttd Prem pw d summary Figure 4.10: Core M onitor Window 47 The core m onitor window has three parts. The first part (left part) is to display the threads running in th a t core using different colors. The second p art (right upper part) is to display the statistics of busy and idle time of the core. A thread scheduled in th a t core could be new, preem pted, or m igrated from another core. The ratio of these three types of threads scheduled in th a t core is displayed in the th ird part (right lower part) of the window. This p art shows how many threads are m igrated from other cores due to load balancing. Such visual analysis is sometimes useful to capture unusual behavior and patterns. 4.6 Sum m ary In this chapter, we presented a new framework for simulation of m ulticore schedul­ ing algorithms. The framework is flexible, and serves as a base for designing new scheduling algorithms and conducting experimental study on existing scheduling al­ gorithms for multicore processors. 48 Chapter 5 Case Studies - Linux and Solaris Scheduling Algorithms To test and illustrate the functionality of the proposed simulator, we implemented the recent versions of Linux and Solaris 10 scheduling algorithms. We first present the algorithms, and then describe the simulation experiments and observations. The recent version of Linux scheduler is called completely fair scheduler (C F S )1, and the Solaris scheduler is referred to as kernel dispatcher. Although Linux and Solaris are popularly used operating systems, the documen­ tations precisely describing the scheduling algorithm s are rarely published. We had to reconstruct the algorithms using inform ation obtained from different sources. Both schedulers have two logical components - a load balancer and a local sched­ uler. The load balancer is responsible for m aintaining a desired balance of th e system load. This task involves dispatching new threads to the appropriate local scheduler, and migrating threads from one local scheduler to another when necessary. The local 'C F S has been im plem ented sta rtin g from Linux 2.6.23. 49 scheduler is responsible for scheduling threads to th e cores for execution. We first present the load balancer of these two algorithms together first, and then the local scheduling algorithms separately. 5.1 Load B alancing Load balancing involves four factors: (i) initial placement; (ii) m igration criteria; (iii) migration policy; and (iv) frequency of balancing. • In itial T hread P la cem en t - Initial thread placement in both Linux and Solaris is the same: the new threads are dispatched to the lightly loaded cores. • Load B alan cin g C riteria - The difference in the num ber of threads between any two cores is less than one. • M ig ra tio n P o licy - Linux migrates threads from heavily loaded cores to lightly loaded cores to satisfy the load balancing criteria. Solaris, when the choice occurs, moves the thread to the core in different chip t-lian to the core in the same chip, to reduce the cache conflict. • B a la n cin g frequency - Load balancing in Linux is done every 200ms, and the load balancing in Solaris is done every 100ms. Next, we present the local scheduling algorithms of Linux and Solaris. The threads in both algorithms are classified as real-tim e 2 thread and normal thread. Here, real­ time implies th a t these threads have higher priorities than the normal threads, and therefore real-tim e threads are always executed before normal threads. We first present how the real-time threads are handled in b o th Linux and Solaris. 2T he term real-tim e in th is context is m isnom er th a t it does n ot associate any specific deadline to m eet, and no trad itio n al scheduling algorithm s like rate m onotonic (RM ) or earliest deadline first (ED F) have been used to schedule th ese th read s. 50 5 .1 .1 R ea l T im e S ch ed u ler Whenever a real-time thread arrives, it preem pts th e norm al thread and th e real­ time thread is scheduled to run. WThen a real-tim e thread is executing, if another real-time thread w ith higher priority arrives, then th e current thread is preem pted and the higher priority thread is scheduled. W ithin th e same priority level of real-time threads, Linux uses either the First-In-First-O ut (FIFO ) or th e Round Robin (RR) scheduling policy, and Solaris uses FIFO. 5.2 Linux Scheduler - C om p letely Fair Scheduler (C FS) The basic idea behind CFS is to ensure fair share among threads in the overall execution. This is achieved by quanta allocation. The execution time plays a key role in quanta com putation. CFS m aintains the am ount of time th a t a thread has utilized the core before, referred to as vruntime. T he thread with th e smallest vruntim e has the highest preference to be selected next for execution. Calculation of qu an ta and vruntime is given later. Instead of run queue, for efficiency, CFS m aintains a d a ta structure called RedBlack tree, sorted by vruntime key. The scheduler picks from the left-most child of the tree which is the smallest vruntime thread for execution. The pseudocode of CFS algorithm is given in Table. 5.1. 5 .2.1 C a lc u la tio n o f Q u a n ta The quanta value calculation plays a key role in CFS. T he time quanta Q \Ti\ of a thread TJ is calculated using the following formula: 51 D ata Structures: Red-Black tree - R B T , T hread X, 1. w h ile ( R B T 7^ em pty ) do 2. select leftmost thread 7) from R B T 3. com pute time quanta 4. schedule T 5. end w h ile Table 5.1: Completely Fair Scheduling Algorithm m \ = J ; MKl9ht x P } Tj. weight ( 5 . 1) jeRBT where, Ti.weight is the weight value corresponds to nice 3 value of 7). (Every nice value is mapped to a weight value.) P= { sched latency if n > n r ~ m in _ g r a n u la r ity x n ~ otherwise latency (5-2) where n is the number of threads in RBT, and sched ^ latency, nr _la te n c y , and m in _ g r a n u la r ity are constants. In the current im plem entation, these values, respectively are 6 , 8 and 0.75 [63]. The details of how these values are determined and their significance are not known. 3In Linux, the priority of th e th rea d is controlled w ith th e nice value. Nice value ranges betw een -20 and 19. Lower value corresponds to higher priority. T h e user level com m and nice can be used to lower th e priority of a th re a d (i.e.. to b e nice to other users). 52 5 .2 .2 C a lc u la tio n o f vruntim e For every clock tick, the scheduler calculates vruntime of th e executing thread and also decreases its time quanta. Preem ption of a thread occurs when quanta expires or RBT has a thread with smaller vruntime. The virtual vruntime of a thread 7) is computed as follows: T i.vru n tim e — li.w eig h t x T i.ru n tim e (5.3) where, w eight 0 is the value corresponding to th e nice value of 0 and 7).ru n tim e is the execution tim e consumed so far by the thread 7). 5.3 Solaris Scheduler Solaris 10 kernel dispatcher does both load balancing and local scheduling. Solaris has implemented two schedulers - fair share scheduler and a default scheduler. The fair share scheduler is typically used in server environments. We have implemented the default scheduler, th a t we will describe next. For simplicity, in Solaris, w'e consider ju st threads are scheduled for execution4. Solaris m aintains the priority range of 0 - 169. Priority range of 160-169 is reserved for Interrupts, and the rest of the priorities are assigned to different scheduling classes (see Appendix). Solaris manages the threads in th e following six different scheduling classes. 4In actual im plem entation, each ap p licatio n process in Solaris m ay co n tain one or m ore app licatio n th read s and each application th re a d is scheduled (m apped) to a v irtu al core called light w eight process (LW P). Each LW P is im plem ented using a kernel thread, an d these kernel th rea d s are eventually scheduled and executed by th e physical core. 53 1. Time share (T S) 2. Interactive (IA) 3. Fair share scheduling (FSS) 4. Fixed priority(FX) 5. Real Time (RT) 6 . System (SY S) By default, threads created by th e window manager are assigned to IA class for better interactivity, and the rest are assigned to TS class. System threads are created by the operating system. O ther threads are created using different levels of system privileges. FSS class is used when the fair share scheduler is invoked. Priority of a thread may be specified or inherited from the parent. In fixed priority class, threads have the same priority throughout th eir execution. RT class is the highest priority thread class which requires attention right away, and needs to be scheduled immediately. Next to the threads in real tim e class, the kernel threads get attention. Finally, the threads in th e classes TA, IA, and FX are scheduled. The scheduler always chooses the highest priority thread for execution. The scheduling classes priority ranges are summarised in Table 7.1 (see Appendix). 5 .3.1 T h e D e fa u lt Solaris S ch ed u ler In Solaris, except RT, every scheduling class has a local scheduling queue for every core. For RT class, a global level kernel preem ption queue is maintained for every chip. 54 After choosing a thread for scheduling, th e next task is to obtain the quanta value. For th at, Solaris m aintains a set of tables, called dispatch tables (see A ppendix), from which the quanta value is obtained. Also, th e scheduler provides the fairness among the threads by boosting the priority up/dow n, in response to the following events. • A thread successfully completes its execution for specified time quanta. Here, the thread priority has to be boosted down to give fairness to other threads. • A thread comes back from an I/O wait. Here, its priority has to be boosted up, so th a t it will execute soon. • A thread is waiting in its ready queue beyond certain threshold tim e period. Here, the priority has to be boosted up to give chance to execute soon. These scheduling subtasks are performed by specific routines called tick processing and update processing (see Appendix). Tick processing is responsible for managing quanta and invoking preemption. U pdate processing is responsible for boosting the priority up/dow n. Dispatch tables are used to obtain new quanta, waiting threshold, and new priority. W ith this set up, the scheduling policy is: a t any scheduling time point, choose the highest priority thread in the system and schedule for execution. If a higher priority thread arrives when a lower priority thread is executing, then it will be preem pted to allow the higher priority to execute. A higher level description of Solaris scheduling algorithm is given in Table 5.2. 55 Data Structures: Scheduling queues - D Q T S , D Q Thread T\ I A, D Q F X , KPQRT, 1. w h ile ( K P Q R T V D Q T S V D Q I A V D Q F X ^ empty ) d o 2. if ( K P Q R T 7^ em pty) 3. select highest priority thread T from K P Q R T 4. get the time quantum Q[T,] from dispatch table 5. schedule 7) 6. e lse 7. pick highest priority thread T) from D Q _ T S , D Q IA , D Q F X 8. get the time quantum Q[Ti\ from dispatch table 9. schedule T 10. e n d if 11. e n d w h ile Table 5.2: Solaris 10 Scheduling Algorithm 5.4 Sim ulation E xperim ents To illustrate the functionality and use of the proposed simulator, we conducted two sets of experiments: 1. E x p erim en t 1: In this experiment, the workload is fixed and the number of cores is varied. The simulation param eters used for this experiment are given in Table 5.3. P a ra m eter Number of Threads Mean Arrival rate 3 Arrival distribution Execution tim e distribution Number of cores V alue 5000 2.5 Poisson Exponential 10, 50, 100, 200, 500 Table 5.3: Simulation Param eters 5We use generic u nit for arrival rate . T h e inter arrival tim es derived from th e d istrib u tio n are real num bers. T hey are th e n scaled to integer u n its of sim ulation clock. F or example, th e in ter arrival tim e 0.4 is scaled to 4 sim ulation clock units. 56 2. E x p erim en t 2: In this experiment, the number of cores is fixed and the work­ load is varied. The simulation param eters used for this experiment are given in Table 5.4. P ara m eter Mean Arrival rate Time period Arrival distribution Execution tim e distribution Number of cores V alue 2.5, 3.5, 4, 5 1 m inute Poisson Exponential 50, 100, 150, 200 Table 5.4: Simulation Param eters Linux schedules threads based on how much execution tim e it is consumed and Solaris does by how long a thread waits w ithout execution. We believe scheduling thread based on wait time would give b etter response time and predictability than scheduling done based on execution time. Based on this observation, we make the following hypothesis. Hypothesis: Solaris scheduler will have b etter and predictable response time th an Linux scheduler. Predictable response tim e is studied using standard deviation of response time. We computed the core utilization, turnaround time, standard deviation of turnaround time, and interactive response time and its standard deviation for bo th experiments and we explain our observations next. 5.4.1 O b serv a tio n s o n E x p e r im e n t 1 • Observation on Core Utilization: We observe th a t b o th algorithms keep the cores 99% busy. In term s of core utilization, there is no significant difference between these two scheduling algorithms. The consistent behavior is due to their load balancing which periodically runs to evenly distribute the workload 57 j — Unu< A vyapeT -j^ atsyr-d *Jne - -S olans A v y ^ T u n ; a r j j r g t r a j :;3,ooo ho.om- V -, \n lOO.OCtfl ■ n « . S3.SJO- ') \ J£ 600 0 0 0 5 ) 000 . 40,000- \ \ \ % 75 o f cores Figure 5.1: Average Turnaround time: Linux and Solaris Scheduling A lgorithms (by varying the number of cores) among the cores. • Observation on Average Turnaround Time: The average turnaround tim e graphs of Linux and Solaris scheduling algorithms are shown in Fig. 5.1. The average turnaround tim e gradually decreases as the num ber of cores increases. Both Linux and Solaris algorithms show almost same behavior until 200 cores, and after 200 cores, Solaris performs b etter than Linux by a factor of 10 when the number of cores is 500. This is due to th e fact th a t Solaris boosts up the th read ’s priority when the thread is waiting in th e ready queue longer. • Observation on Standard deviation of Turnaround Time: The standard devia­ tion of turnaround tim e graphs of Linux and Solaris scheduling algorithm s are shown in Fig. 5.2. We observe th a t Solaris and Linux show similar behavior between 10 and 200 cores. As we increase th e number of cores, Solaris shows 58 Figure 5 .2 : Standard deviation of Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the number of cores) b etter predictability than Linux. • Observation on Interactive Response Time: The average interactive response tim e graphs of Linux and Solaris scheduling algorithm s are shown in Fig.5.3. We observe th a t Solaris outperform s Linux consistently providing b etter interactive response time. This is because, whenever a th read stays in ready queue and reaches the maximum wait time, Solaris boosts up the thread priority. Also, whenever a thread returns from I/O , th e thread priority is boosted up so th a t it can be scheduled soon. • Observation on Standard deviation o f Interactive Response Time: The standard deviation of interactive response time graphs of Linux and Solaris scheduling algorithms are given in Fig. 5.4. The prediction of how soon the thread will be scheduled for execution, when it is ready, is measured by computing the standard 45.000.000 •• 4 2 ,5 0 0 ,0 0 0 ; 40.000.090 • 37-500,000 35 .000.000; \ 32.500.000-| Clock (tic k s ) 30 .0 0 0 .0 0 0 ; 27;509;000-; 25,000,030 22 500,000 • \ 20 .000,000 17.500,000 ■ \ 15,009.000 \ 12.500,000 •; 10;000;000 ’,500,000 5 000,000 2,500.000 25 50 75 ICO 125 150 1 75 2SC 225 250 275 303 325 350 375 400 425 450 475 500 Number of cores Figure 5.3: Average Interactive Response tim e : Linux and Solaris Scheduling Algo­ rithm s (by varying the number of cores) deviation in interactive response time. It is clearly visible that Solaris performs better than Linux by assuring b etter fairness to the threads. As mentioned before, the boosting of the priority by Solaris influences the predictability b etter than Linux. 5 .4.2 O b serv a tio n s on E x p e r im e n t 2 • Observation on Core Utilization: We observed th a t both algorithms keep the cores busy and there is no significant difference. • Observation on Turnaround Time: T he average turnaround tim e graphs of Linux and Solaris scheduling algorithm s are shown in Fig. 5.5. The average turnaround time constantly increases when th e arrival rate increases. From these experiments, it is clearly seen th a t Solaris performs b etter then Linux, 60 — Ufi‘jx Standard deviation r- htgracdve response line ‘ -Solaris S t a n d s dewatffir= -r Intaactlve response time - 22.500,030 • 2 20.000,000 0 17,500.000 • 15 000 000 - 12“» OH) 5 003 a 25 325 50 350 375 400 425 450 475 500 Figure 5.4: Standard deviation of Interactive Response time: Linux and Solaris Scheduling Algorithms (by varying the number of cores) when the number of cores is 50, but th e trend gradually converges as the num­ ber of cores increases. • Observation on Standard deviation in Turnaround Time: The standard devi­ ation of turnaround tim e graphs of Linux and Solaris scheduling algorithm s are shown in Fig. 5.6. We observe th a t the predictability also increases when the arrival rate increases. Compared to Linux, Solaris predictability is always better. • Observation on Interactive Response Time: T he average interactive response time graphs of Linux and Solaris scheduling algorithm s are shown in Fig. 5.7. Solaris outperforms Linux consistently providing low interactive response time. As explained earlier, the priority boosting of Solaris heavily im pacts th e inter­ active response time. This is an expected behavior. 61 j—So&is1050 10100--Soiy;s1020C— 10150—u-u*2.6.265-3 - Ur=ut; 626100—Urui2626200—L-mjT2.626150-j 0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7-5 8.0 S.S 90 9.5 1G.G Arrival rate Figure 5.5: Average Turnaround time: Linux and Solaris Scheduling A lgorithms (by varying the workload) — Solaris 1050 Sharis 1D100 — So^ns 10200 — S ote's 10150 — _ny« 2,6.2650 Ur^x 2.6.26100 — Unux 2 6.26200 — unu* 2.6 2615C 32 500 27.500 2.5 4.5 55 6.0 7.0 8.0 90 9.5 10.0 Arrival rate Figure 5.6: Standard deviation of Turnaround time: Linux and Solaris Scheduling Algorithms (by varying the workload) 62 I—’ Solaris 1050 Solans 10100 — Solars 10200 — Solars 10150 — L m i 2.6.2650 b to x 2.6.26100 — b ru x 2.6.26230 — bmjx 2.6 2615C! ■000,030- .000 ,500.000 2.5 3.0 3.5 4.0 4.5 5.0 ss 6.0 5.5 7.0 7.5 $.0 8.5 9-0 9.5 1C.Q Arrival rate Figure 5.7: Average Interactive Response time: Linux and Solaris Scheduling Algo­ rithm s (by varying the workload) •Solans 1050 — Solaris 1Q1GC — Solans 1020C — Solars 10150 — unux 2.6.265D U *ux2.6.26100 — L irux2 6.26200 — bfiu* 2 .6 261SC 2 0 ,0 0 0 .0 0 0 ; 4.0 4,5 5.0 55 6.0 6.5 7.0 7,5 S.O 8.5 9.0 9.5 lC.fl Arrival rate Figure 5.8: Standard deviation of Interactive Response time: Linux and Solaris Scheduling Algorithms (by varying th e workload) 63 • Observation on Standard deviation o f Interactive Response Time: The standard deviation of interactive response tim e graphs of Linux and Solaris scheduling algorithms are shown in Fig. 5.8. We observe th a t Solaris predictability is much b etter compared to Linux. The trend is consistent. These observations confirm our hypothesis, which is an evidence th a t the imple­ m entation of the scheduler is fairly accurate. 5.5 Sum m ary In this chapter, we have presented the sim ulation experiments of Linux and Solaris scheduling algorithms. As we expected from our hypothesis, the simulation experi­ ments prove th a t Solaris scheduler offers b etter and predictable response tim e th an Linux scheduler. From these experiments, we are confident th a t our scheduler simu­ lator im plem entation is fairly sound. 64 Chapter 6 A Fair and Efficient Gang Scheduling Algorithm The trend in multicore processors indicates th a t all future processors will be mul­ ticore, and hence the future cloud systems are expected to have their nodes and clusters based on multicore processors. So th e processor scheduling in the future sys­ tems will most likely be all multicore processor scheduling. Therefore, we believe, multicore scheduling is fundam ental to future cloud com puting performance. Also, due to multicore revolution, a considerable portion of large applications will be par­ allel programs. From the literature, we can see th a t gang scheduling is a dom inant strategy to schedule parallel programs. 6.1 Popular G ang Scheduling A lgorithm s Among the gang scheduling algorithms, AFCFS and LGFS are the most popular algorithms and we present them next. 65 6 .1 .1 AFCFS The scheduling algorithm AFCFS places the gangs in th e run queue in order of their arrival. The scheduling starts from the head of the queue and the gang which can fit into the available cores are scheduled for execution. Unlike FCFS which stops scheduling when it finds th e gang which cannot fit into the free cores, AFCFS iterates over the whole run queue and schedules all gangs which can fit into the free cores. The AFCFS algorithm is given in Table. 6.1. Data Structures: RQ: Queue of gangs 1. w h ile (RQ ^ empty ) do 2. for i = 1 to size of R Q do 3. if RQ[i] fits in free cores then schedule RQ[i] 4. en d for 5. end w h ile Table 6.1: AFCFS Gang Scheduling Algorithm 6 .1 .2 LG FS Largest gang first scheduling algorithm orders th e run queue based on th e size of the gangs.The size of the gang is determ ined by the number of threads. LGFS schedules the gang from the head of the queue(largest gang) until the gang which can fit into the available free cores. LGFS favors the large sized gang over the small sized gang. This will influence the response tim e of small sized gangs, and th a t makes the small gangs to wait for longer time to get their turn. Also, when a large size gang arrives, it may overtake these small gangs. The algorithm of LGFS is given in Table. 6 .2 . 66 Data Structures: RQ: Queue of gangs 1. w h ile (RQ ^ em pty ) do 2. sort R Q based on largest gang first 3. for i = 1 to size of R Q do 4. if RQ[i] fits in free cores then schedule RQ \i\ 5. en d for 6. en d w h ile Table 6.2: LGFS Gang Scheduling Algorithm P a ra m eter Number of cores Time period Tasks per gang Mean Arrival rate Arrival distribution Execution rate Execution tim e distribution V a lu e(s) 200 1 minute Uniformly distributed over [2..200] 1.5, 2, 2.5 Poisson 2 Exponential Table 6.3: Simulation Param eters From the literature [53), we note th a t AFCFS performs better th an LGFS in response time in case of lighter workloads with small gangs. Our experim ent using the proposed simulation confirms the result. 6 .1 .3 S im u la tio n E x p e r im e n ts The param eters used in our simulation are listed in Table. 6.3. We conducted sim­ ulation experiments to compute average response time, standard deviation of response time, and average core utilization of AFCFS and LGFS algorithms. The observations are presented next. • Observation on Average Response time: The average response tim e graphs of AFCFS and LGFS algorithms are shown in Fig. 6.1. The average response tim e of AFCFS outperforms LGFS for small sized gangs. Since LGFS favors 67 — AFCFS ■ LGFS 13.0SC 17.000 16.000 15,000 „ 14-000 | 13,000 “ 17.000 | 11,000 * 1 3 ,0 0 0 | 3,000 iX | «»» 3 5,050 4.000 3.000 2.000 1,45 1,50 1.55 1.50 1.65 1.73 1.75 1,85 1.90 1.95 2.00 2.05 2,10 2.15 2.20 2.25 230 235 2,40 2,45 2,50 Figure 6.1: Average Response Tim e for AFCFS and LGFS Algorithms the large size gangs, it pushes the small sized gangs for longer wait tim es which is reflected in their average response time. This result confirms the observation from the literature. Observation on Standard deviation in Response time: The standard deviation response time graphs of AFCFS and LGFS algorithm s are shown in Fig. 6.2. The predictability of AFCFS is worse than LGFS. Since the AFCFS favors the small size gang, the large size gangs have to wait longer for their turn for execution which in tu rn increase the deviation in response time. Observation on Average Core Utilization: T he average core utilization graphs of AFCFS and LGFS algorithms are shown in Fig. 6.3. LGFS performs b etter th an AFCFS, because it favors large jobs which fits into more num ber of cores and makes the core busy executing these larger jobs. As AFCFS favors smaller jobs, the possibility of core to stay idle is higher. 68 -A F C F S LGFS 12.000 < j; 11,000 1 0,000-i; I 5.000 |i v | S.COC i: a 7.000 |! I 5,0 0 0 1! i 4 ,0 j0 i: * 11 3,oao-; 2.000jj 1.000 ji 1.45 1.50 1.55 1.60 1.65 1.7D 1.75 1.60 1,65 1.90 1.95 2.00 2.05 2.10 2 15 2 20 2 25 2.30 2.35 2 40 2,45 2.5C Arrival R ate (A) Figure 6 .2 : Standard deviation of Response Time for AFCFS and LGFS Algorithms I— AFCFS ■ LGFS I i 1.45 ISO 1.55 1.60 1.55 1.70 1.75 IK 1.S5 l.M 1.9S 2,00 2 05 2.10 2,15 2,20 2 25 2.30 2.35 2.40 Figure 6.3: Core Utilization of AFCFS and LGFS Algorithms 69 2.45 2,50 From the observations shown in Fig. 6.2 and Fig. 6.3, we can see th a t LGFS is a preferred algorithm , despite A FC FS 7 low average response time. This is because LGFS utilizes the system resources b etter and offers b etter predictability in response time. This proves our earlier point th a t th e average response time is not a desirable metric. Fairness and predictability are particularly im portant th a t the expectation of users under light load is normally high, and failure to provide such guarantee even under light load could expose the system very badly. Therefore, it would be nice to have an algorithm which yields low average response time and standard deviation w ith high processor utilization. This is the m otivation for our algorithm presented next. 6.2 A N ew G ang Scheduling A lgorithm The gang scheduling algorithms AFCFS and LGFS are susceptible to starvation. To avoid starvation, these algorithms adopt a process migration policy. Process mi­ gration may not be even possible between two heterogeneous multicore systems, and is generally expensive even between two homogeneous systems [39,56]. Also, although it alleviates, process migration does not eliminate starvation. These observations bring us a question: Can we design a gang scheduling algorithm with the following characteristics? 1. Freedom from starvation. 2. Predictable and acceptable response time. 3. B etter processor utilization. 4. Simple. 70 Since AFCFS favors small gangs, th e larger gangs are susceptible to starvation or to longer wait times. This is unacceptable particularly in cloud environment where customer satisfaction hugely depends on fairness and predictable response time. In practice, the customers who receive a little faster service (at the expense of oth­ ers’ long wait) may not be overly satisfied [61]. B ut, the customers who experience unpredictably long delay, on the other hand, will readily notice the unfairness and unpredictable response and th a t could potentially drive the cloud business in a nega­ tive direction. Therefore, in addition to fast response and high processor utilization, minimal variance in response is extremely im portant for b etter cloud services. 6.3 T h e A lgorithm The gang scheduling algorithm proposed in this thesis combines the ideas of AFCFS and priority boosting. In the AFCFS algorithm proposed for m ulticore clus­ ters in [53], each multicore has a run queue and all gangs stay in the run queue until it gets a chance to execute. The scheduler always chooses the next fit gang from the run queue so th a t overall response tim e is reduced. Such behavior degrades the overall core utilization which in tu rn increases the variation in response time as seen in the experiments. The proposed algorithm uses an additional variable for each gang which stores the information about how many gangs bypassed it for execution when it stayed in run queue. We call th a t variable ‘Bypass count’. W hen the gang’s bypass count reaches the threshold value T . it gets the highest priority to schedule next. This pushes other gangs to force wait until the highest priority gang gets scheduled. The proposed algorithm is given in Table. 6.4. A new gang joins RQ, and its bypass count is set to zero. W henever a gang 71 is scheduled, the bypass count of gangs precedes th e scheduled gang in RQ will be incremented by 1. At any time, gang in RQ w ith bypass count greater or equal to the threshold value has the highest priority over other gangs. This guarantees th a t the gangs will be served in a predictable tim e period. W hen there is no gang writh bypass count greater or equal to the threshold value, it acts as AFCFS algorithm. W hen enough cores are not available to schedule the highest priority gang, the system has to wait for some of the currently executing gangs to leave, and this delay is unavoidable to assure fairness and predictable response. The sim ulation results show th a t such a wait rarely happens. D ata Structures: RQ: Queue of gangs;T: Threshold value; i.bpc: bypass count of gang i 1 . w h ile (RQ ^ empty ) do 2. for i = 1 to size of R Q d o 3. if RQ(i) .bpc > T th e n wait until RQ[i] fits in free cores 4. if RQ[i\ fits in free cores th e n 5. for k = 1 to i — 1 d o RQ[k\.bpc + + end for 6. schedule RQ[i\ 7. end if 8. en d for 9. en d w hile Table 6.4: New Gang Scheduling Algorithm N o te: The proposed algorithm becomes AFCFS if the threshold is set to oo. When the threshold is 0, it emulates FCFS algorithm . So, choosing a proper threshold is the key of the proposed algorithm. 6.4 Sim ulation E xperim ents As explained earlier, the proposed algorithm tries to achieve predictable and fast response for gangs (users) and b etter utilization for the system. 72 P aram eter Number of cores Time period Tasks per gang Mean Arrival rate Arrival rate distribution Execution rate Execution rate distribution Threshold V a lu e(s) 200 1 minute Uniformly distributed over [2..200] 5, 7.5, 10 Poisson 2 Exponential 700 Table 6.5: Simulation Param eters 6 .4 .1 S im u la tio n S e tu p We used the simulation param eters listed in Table 6.5 for our experiments. To keep the results generic, the execution is shown in term s of simulation clock ticks. Through simulation study, we com puted average response time, standard deviation in response time, average core utilization, and bypass count for the gangs. The observations are presented next. • Observation on Average Response Time: T he average response tim e graphs of AFCFS and the proposed algorithms are shown in Fig. 6.4. The average response tim e of the proposed algorithm is b etter than th a t of AFCFS. This is because, whenever the gangs’ bypass count reaches the threshold, it guarantees the gang to schedule next which reduces the response tim e of long waiting gangs. Choosing a proper threshold value is crucial. Choosing a small num ber will unnecessarily make others gangs to wait more often, which will in tu rn increase the average response time. For our experiments, we have chosen 7001 bypass count as the threshold value. From the Fig. 6.4, it is clear th a t the average response tim e of proposed algorithm performs better th a n AFCFS consistently. 1T he threshold value is derived from th e rep etitiv e experim ents for consistent behavior. 73 |— AFCFS Fair AfCPs| 95.000; 90.00085.000 • 86-000 i time (dock t ic k s ) 75 000 70.000 65.000 60.000 • 55.000 50.000 45 0 0 0 % 40,000 0 35.-000 » 30,000 * 25,000 20.000 15.000 10.000 5,-000 0 4.75 S.DO 5.25 550 5.75 6.00 6.25 6-50 6.75 7 00 7.25 7.50 7.75 8.30 6.25 S.50 8.75 9.20 9.25 9.50 9.75 10.00 10.2! Arrival Rate (A) Figure 6.4: Average Response Tim e of AFCFS and Proposed Algorithms • Observation on Standard deviation of Response time: The average turnaround time graphs of AFCFS and the proposed algorithms are shown in Fig. 6.5. The proposed algorithm offers b etter predictability in response tim e th an AFCFS algorithm. Since, the proposed algorithm avoids the longer wait times, the predictability in response time will be lower th an AFCFS. By controlling the bypass threshold value, b etter predictability may be assured. • Observation on Average Core Utilization: The average core utilization graphs of AFCFS and the proposed algorithm s are shown in Fig. 6.5. The proposed algorithm outperforms AFCFS algorithm. This is because, AFCFS favors only small gangs but the proposed algorithm favors all sized gangs once the threshold is reached. • Observation on Bypass count graph: The bypass count graphs of AFCFS and the proposed algorithm are shown in Fig. 6.7. The proposed algorithm shows 74 fi ■§ 35.000 i: | 30.000-. 3 Jl 25.000C 1 20.000 i! * 15.00010.000 - 5 000 4.7S 5.00 5.25 S.50 5.75 6.00 6.25 6.50 6.75 700 7 25 7. S3 7.75 3.03 S.25 S.50 S, 75 S.00 9.25 9.53 9.75 10.00 10.21 Arrival Rate (A) Figure 6.5: Standard deviation of Response Time of AFCFS and Proposed Algorithms |- A F C F S fa rA K B | 757 0 -p 65^60# 5 5 -; * 45 i; |* 03 g35li 3 3 0 ’. 2520 ■ 5 i| 5.25 5.50 5.75 S.C3 6.2S 6.50 6.75 7.3C US 7.50 7.75 3.00 8-25 6.50 8.75 5.00 5.25 5.50 5 75 10.00 Arrival R ate {X) Figure 6 .6 : Average Core U tilization of AFCFS and Proposed Algorithms 75 10.2! the fairness among gangs, once it reaches th e threshold, it starts giving the priority for long waited gangs. The longer wait time is completely avoided in the proposed algorithm, which makes our algorithm interesting. 4 .7 5 0 ;' 0 250 500 750 1.000 S.230 1.500 1.750 2.000 2.250 2.500 2.750 5.000 5.250 3,500 J /v 4 '. 0 4,253 4.530 4.730 Job [~~ AFC3S -1 0.0 f t ; r 7K r e - 1 0 0 | Figure 6.7: Bypass Count of Gangs of AFCFS and Proposed Algorithms From these observations, we conclude th at, the proposed algorithm outperform s AFCFS in all three metrics, and of course solves th e starvation problem completely. 6.5 Sum m ary In this chapter, we proposed a fair and efficient gang scheduling algorithm for multicore processors. The algorithm is simple, fair, and gives predictable performance. Such a predictable performance is attractive from th e service point of view. Since this algorithm solves the starvation problem locally w ithout using process migration, it is highly scalable and attractive for cloud com puting involving a large number of multicore processors. 76 Chapter 7 Conclusion and Future Directions Recently, multicore processors and software development for multicore systems have received increasing attention from the research community. The contributions of this thesis are: (i) the design and im plem entation of a flexible multicore scheduler simula­ tion framework; (ii) illustration of the power and flexibility of the proposed framework by simulating the scheduling algorithm s of Linux and Solaris, and a sim ulation study of two gang scheduling algorithms; and (iii) a new gang scheduling algorithm and its performance study. The experience gained by developing this sim ulator is very rich. It involved software design, im plem entation, algorithm discovery and design, and performance analysis. The proposed simulator can be used for rapid simulation studies of multicore scheduling algorithms, and th a t can provide initial insights on how the proposed algorithm will perform in practical multicore systems. These insights will be helpful, not only in testing the performance of the proposed algorithm , but also to identify the bottlenecks and offer guidelines for improvements. Also, from our experience, we believe th a t the simulator can be used to develop new scheduling algorithm s by 77 experimenting and identifying the lim itations of the existing algorithms. Although, initially we did not expect to propose a new multicore scheduling algo­ rithm , we finally ended up designing one for an im portant class of parallel applications called gangs. We compared the performance of the new gang scheduling algorithm with th e known best algorithm in its category. T he proposed algorithm seems to perform better. 7.1 Future D irections We believe the proposed scheduling framework for multicore systems is an im por­ ta n t first step in the performance study of multicore scheduling algorithms. There are many directions in which the work presented in this thesis can be expanded to study the performance of scheduling algorithm s deeper and more accurately. We outline some of them next. • Modeling workload could be refined and improved. • Modeling I/O waits could be refined and improved. • The modeling of cache can be refined and improved. • The framework can be expanded to model heterogeneous cores. • Cache effect can be inferred from the values of hardware performance counters available in the recent multicore machines, and used in scheduling simulations. • More statistical metrics can be included. • In cloud computing context, modeling clusters with many chips and its associ­ ated cores would be more interesting. 78 • The proposed gang scheduling algorithm works b etter for light loads. Is there a gang scheduling algorithm th a t can perform b etter under all workload condi­ tions? This is an interesting research question to be explored. I would like to continue to work on some of these directions in th e future. 79 Appendix Tables used by Solaris Scheduler Solaris scheduler uses the inform ation given in th e following four tables - 1. P rio rity R a n g e T able (Table 7.1) - specifies the priority range for the schedul­ ing classes. 2. D isp a tch T able for R e a l-tim e Tasks (Table 7.2) - defines the time quanta of real time scheduling class. 3. D isp a tch T able for F ix ed P rio rity T asks (Table 7.3) - defines th e time quanta of fixed priority scheduling class. 4. D isp a tch Table for N o rm a l Tasks (Table 7.4) - defines the tim e quanta of time sharing and interactive scheduling classes. 8 0 S ch ed u lin g class Realtime System Fair share Fixed priority Time share Interactive G lobal p riority ra n g e 100 - 159 6 0 -9 9 0 -5 9 0 - 59 0 - 59 0 - 59 U se r lev el p riority ran ge 0 - 59 0 - 60 -60 - 60 -60 - 60 Table 7.1: Solaris 10 Scheduling Classes Priority Range Q u an ta 100 80 60 40 20 10 10 P rio rity 100 110 120 130 140 150 159 Table 7.2: Dispatch Table for RT Scheduling Class Q uanta 0 10 20 30 40 59 P rio rity 20 16 12 8 4 2 Table 7.3: Dispatch Table for FX Scheduling Class 81 Q uanta 20 16 12 8 4 2 P rio rity on Q u an ta E xp iry 0 0 10 20 30 49 P rio rity on IO R etu rn 50 51 52 53 55 59 W ait T h resh o ld 0 0 0 0 0 3200 P r io r ity o n W ait 50 51 52 53 55 59 P rio rity 0 10 20 30 40 59 Table 7.4: Dispatch Table for TS and IA Scheduling Classes Tick Processing and U p d a te P rocessing Here we present two key routines used in Solaris scheduling. T ick P r o c e ssin g Tick processing will be executed for every scheduling tick. This m ethod is mainly responsible for managing the tim e quanta and preemption control. The overview of what tick processing is doing is given in Table. 7.5. When th e tick processing m ethod is invoked, it first checks whether th e executing thread is in system mode. This is because, system threads are scheduled for its full execution time and preem ption of system threads are not allowed. O ther than system threads, all other scheduling classes are preemption enabled. The m ain task of the tick processing is to m anage the allocated time quanta. First, it decrements the quanta allocated for th e th read and checks whether the thread executed for i t ’s whole tim e quanta. If so, it additionally checks whether the preemption control is enabled for the thread. The preem ption control is a variable to give few more time for a thread when it finishes it allotted time quanta if preemption control is enabled. This variable can be configured. If the preemption control is not enabled for the thread, then the first task is to re-assign the priority of thread from their dispatch table in case of TS and IA class thread. 82 RT and FX class threads are m aintained w ith the same priority. Then, it invokes the preem ption and places the thread in their dispatch queue based on their priority. When the tim e quanta is not over, then it checks whether th e thread is going for I/O and places the thread in the sleep queue. The m ethod also checks for any highest priority thread in the dispatch queues and preem pts the current thread. W hen the thread is preem pted by a high priority thread, then th e priority of th e current thread won’t change. Data Structures: Thread t 1. if thread t is not in system priority th e n decrement the time quanta t.quanta if (t.quanta < 0) th en 3. 4. check thread preem ption control enabled th en give additional tim e q uanta for execution 5. else 6. if / £ T S or I A scheduling class th en 7. re assign t.p rio rity from d isp atch _ tab le_ ts w ith the value of ts tqexp 8. en d if 9. enable preemption and place thread t in their dispatch queues 10. 11. en d if 12 . if t going for I/O th en enable preemption and place thread t in sleep queue 13. 14. en d if 15. if t.priority < highest th rea d ’s priority in dispatch queues th e n enable preemption and place the thread t in dispatch queue 16. 17. en d if 18. en d if 2. Table 7.5: Tick Processing of Solaris Scheduling Algorithm 83 U p d a te P r o c e ssin g U pdate processing m ethod is invoked only for tim e sharing and interactive schedul­ ing classes. The main purpose of this m ethod is to boost up the priority. The higher level im plem entation of update processing is given in Table. 7.6. Update processing m ethod will be invoked periodically. This m ethod will iterate over the dispatch and sleep queues to increment the waiting tim e of the threads. Once the waiting time of a thread reaches m aximum wait time specified in dispatch table, the th read ’s priority will be boosted up by th e ts_ lw ait value. This is done to provide fairness among the threads. Data Structures: Dispatch queues - D Q _ T S , D Q _ I A , sleep queue - S Q , T hread t 1. for each thread t is not in D Q _ T S A DQ I A A S Q do 2. increment t.w ait value by 1 3. if (t.w ait > m ax_w ait) th e n 4. boost up thread priority from dispatch table ts with the value of ts_ lw ait 5. en d if 6. en d for Table 7.6: U pdate Processing of Solaris Scheduling Algorithm 84 Bibliography [1] D. Wentzlaff and A. Agarwal, “Factored operating systems (fos): th e case for a scalable operating system for multicores,” SIG O P S Operating System Review, vol. 43, pp. 76-85. April 2009. [2 ] A. Fedorova, M. Seltzer, and M. D. Smith, “Cache-fair thread scheduling for multicore processors,” tech. rep., Harvard University, 2006. [3] B. Wilkinson and M. Allen, Parallel Programming Techniques and Applications Using Networked Workstations and Parallel Computers Second Edition. Pearson Prentice Hall, 2005. [4] L. Dagum and R. Menon, “Openmp: an industry standard api for shared-memory programming,” Computational Science Engineering, IEEE, vol. 5, pp. 46 -55, jan-m ar 1998. [5] M. Snir, S. W. O tto, D. W. Walker, J. Dongarra, and S. Huss-Lederman, MPI: The Complete Reference. Cambridge, MA, USA: MIT Press, 1995. [6 ] S. Peter, A. Schupbach, P. Barham , A. Baumann, R. Isaacs, T. Harris, and T. Roscoe, “Design principles for end-to-end multicore schedulers,” in Proceedings of the 2nd U SENIX conference on Hot topics in parallelism, H otP ar’10, (Berkeley, CA, USA), pp. 10-10, USENIX Association, 2010. 85 [7| C. Kesselman and I. Foster, The Grid: Blueprint fo r a New Computing Infras­ tructure. Morgan K aufraann Publishers, Nov. 1998. [8 ] D. Wentzlaff, C. G. Ill, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. E. Miller, and A. Agarwal, “An operating system for multicore and clouds: Mechanisms and im plem entation,” in SoCC, pp. 3-14, 2010. [9| “The future accelerated: Multi-core goes m ainstream , computing pushed to ex­ tremes.” Intel Newsroom, September 2011. [10] J. C. Mogul, A. Baum ann, T. Roscoe, and L. Soares, “M ind the gap: reconnecting architecture and os research,” in Proceedings o f the 13th U SENIX conference on Hot topics in operating system s, H otO S’13, (Berkeley, CA, USA), pp. 1- 1, USENIX Association, 2011. [11] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs, S. Peter, T. Roscoe, A. Sclhipbach, and A. Singhania, “The multikernel: a new os architecture for scalable multieore systems,” in Proceedings of the A C M SIG O PS 22nd symposium on Operating systems principles, SOSP '09, (New York, NY, USA), pp. 29-44, ACM, 2009. [12] S. Zhuravlev, J. C. Saez, A. Fedorova, and M. Prieto, “Survey of scheduling tech­ niques for addressing shared resources in multicore processors,” A C M Computing Surveys, In Press. [13] D. Wentzlaff, C. G. Ill, N. Beckmann, A. Belay, H. Kasture, K. Modzelewski, L. Youseff, .1. E. Miller, and A. Agarwal, “Fleets: Scalable services in a fac­ tored operating system.” tech. rep., CSAIL M assachusetts Institute of Technol­ ogy, 2011 . 86 [14] A. Schbach, S. Peter, A. Baum ann, T. Roscoe, P. Barham, T. Harris, and R. Isaacs, "Embracing diversity in the barrelfish manycore operating system ,” in In Proceedings of the Workshop on Managed Many-Core Systems, 2008. [15] A. Belay, D. Wentzlaff, and A. Agarwal, “Vote the os off your core,” tech. rep., CSAIL M assachusetts Institute of Technology, 2011. [16] A. Baumann, S. Peter, A. Schiipbach, A. Singhania, T. Roscoe, P. Barham , and R. Isaacs, “Your com puter is already a distributed system, why isn’t your os?,” in Proceedings of the 12th conference on Hot topics in operating systems, H otO S’09, (Berkeley, CA, USA), pp. 12-12, USENIX Association, 2009. [17] S. Panneerselvam and M. M. Swift, “Dynamic processors demand dynam ic op­ erating systems,” in Proceedings o f the 2nd U SEN IX conference on Hot topics in parallelism, H otP ar’lO, (Berkeley, CA, USA), pp. 9-9, USENIX Association, 2010 . [18] I. Kuz, Z. Anderson, P. Shinde, and T. Roscoe, “M ulticore os benchmarks: we can do b etter,” in Proceedings o f the 13th U SEN IX conference on Hot topics in oper­ ating systems, HotOS’13, (Berkeley, CA, USA), pp. 10-10, USENIX Association, 2011. [19] S. Boyd-Wickizer, H. Chen, R. Chen, Y. Mao, F. Kaashoek, R. Morris, A. Pesterev, L. Stein, M. Wu, Y. Dai, Y. Zhang, and Z. Zhang, “Corey: an operating system for many cores,” in Proceedings o f the 8th U SEN IX confer­ ence on Operating systems design and implementation, OSDI’08, (Berkeley, CA, USA), pp. 43-57, USENIX Association, 2008. [20] S. Boyd-Wickizer, R. Morris, and M. F. Kaashoek, “Reinventing scheduling for multicore systems,” in Proceedings of the 12th conference on Hot topics in oper­ 87 ating system s, HotOS'09, (Berkeley, CA, USA), pp. 21-21, USENIX Association, 2009. [21| R. Liu, K. Klues, S. Bird, S. Hofmeyr, K. Asanovic, and J. Kubiatowicz, “Tes­ sellation: space-time partitioning in a manycore client os,” in Proceedings o f the First U SENIX conference on Hot topics m parallelism, H otP ar’09, (Berkeley, CA, USA), pp. 10-10, USENIX Association, 2009. [22 ] D. Wentzlaff, C. G. Ill, N. Beckmann, K. Modzelewski, A. Belay, L. Youseff, J. Miller, and A. Agarwal, “A unified operating system for clouds and manycore: fos,” 1st Workshop on Computer Architecture and Operating System co-design (CAOS), 2010. [23] S. Blagodurov, S. Zhuravlev, and A. Fedorova, “Contention-aware scheduling on multicore systems,” AC M Transactions on Computer Systems, vol. 28, pp. 8:18:45, December 2010. [24] A. Fedorova, C. Small, D. Nussbaum, and M. Seltzer, “Chip m ultithreading sys­ tems need a new operating system scheduler,” in Proceedings of the 11th workshop on A C M SIG O P S European workshop, EW 11, (New York, NY, USA), ACM, 2004. [25] S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Akula: a toolset for experi­ menting and developing thread placement algorithms on multicore systems,” in Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT TO, (New York, NY, USA), pp. 249-260. ACM, 2010 . 88 [26] J. M. Calandrino, D. P. Baumberger, T. Li, J. C. Young, and S. Hahn, “Linsched: The linux scheduler sim ulator.” in ISC A PD CCS (J. Jacob and D. N. Serpanos, cds.), pp. 171-176, ISCA, 2008. [27] M. Rosenblum, S. A. Herrod, E. W itchel, and A. G u p ta, “Complete com puter system simulation: the simos approach,” Parallel Distributed Technology: Sys­ tems Applications, IEEE, vol. 3, pp. 34 -43, winter 1995. [28] P. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Hallberg, J. Hog- berg, F. Larsson, A. M oestedt, and B. Werner, “Simics: A full system simulation platform ,” Computer, vol. 35, pp. 50 -58, feb 2002. [29] Y. Jiang, X. Shen, J. Chen, and R. Tripathi, “Analysis and approxim ation of optimal co-scheduling on chip multiprocessors,” in Proceedings o f the 17th inter­ national conference on Parallel architectures and compilation techniques, PACT ’08, (New York, NY, USA), pp. 220-229, ACM, 2008. [30] T. C. Xu, P. Liljeberg, and H. Tenhunen, “Process scheduling for future multicore processors,” in Proceedings o f the Fifth International Workshop on Interconnec­ tion Network Architecture: On-Chip, Multi-Chip, INA-OCMC ’11, (New York, NY, USA), pp. 15-18, ACM, 2011. [31] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov, “A comprehensive sched­ uler for asymmetric multicore systems,” in Proceedings o f the 5th European con­ ference on Computer systems, EuroSys TO, (New York, NY, USA), pp. 139-152, ACM, 2010. [32] V. Kazempour, A. Kamali, and A. Fedorova, “Aash: an asymmetry-aware sched­ uler for hypervisors,” in Proceedings of the 6th AC M S IG P L A N /S IG O P S inter­ 89 national conference on Virtual execution environments, VEE ’10, (New York, NY, USA), pp. 85-96, ACM, 2010. [33] S. Hofmeyr, C. Iancu, and F. Blagojevic, “Load balancing on speed,” in Pro­ ceedings of the 15th AC M SIG P L A N Sym posium on Principles and Practice of Parallel Programming, P P oP P ’10, (New York, NY, USA), pp. 147-158, ACM, 2010 . [34] F. R. Johnson, R. Stoica, A. Ailamaki, and T. C. Mowry, “Decoupling contention management from scheduling,” in Proceedings o f the fifteenth edition o f A SP L O S on Architectural support fo r programming languages and operating systems, AS­ PLOS ’10, (New York, NY, USA), pp. 117-128. ACM, 2010. [35] N. Guan, M. Stigge, W. Yi, and G. Yu, “Cache-aware scheduling and analysis for multicores,” in Proceedings of the seventh A C M international conference on Embedded software, EM SOFT ’09. (New York, NY, USA), pp. 245-254, ACM, 2009. [36] L. Tang, J. Mars, and M. L. Soffa, “Contentiousness vs. sensitivity: improving contention aware runtime systems on multicore architectures,” in Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing System s for the Exaflop Era (co-located with P LD I 2011), (New York, NY, USA), pp. 12-21, ACM, 2011. ]37] A. Silberschatz, P. B. Galvin, and G. Gagne, Operating System Concepts,. John Wiley & Sons. Inc., 2010. [38] S. Haidar and A. Aravind, Operating Systems. Pearson Education, 2010. [39] D. S. Milojicic, F. Doughs, Y. Paindaveine, R. Wheeler, and S. Zhou, “Process migration,” AC M Computing Surveys, vol. 32. pp. 241-299, Sept. 2000. 90 [40] J. K. O usterhout, “Scheduling techniques for concurrent systems," in Proceedings of the IE E E Distributed Computing System s, pp. 22 - 30, 1982. [41] L. Chai, Q. Gao, and D. K. Panda, “U nderstanding th e im pact of multi-core architecture in cluster computing: A case study with intel dual-core system," in Proceedings o f the Seventh IE E E International Symposium on Cluster Comput­ ing and the Grid, CCGRID ’07, (W ashington, DC, USA), pp. 471-478, IEEE Computer Society, 2007. [42] I. A. Moschakis and H. D. K aratza, “Evaluation of gang scheduling performance and cost in a cloud computing system,” Journal o f Supercomputing, vol. 59, pp. 975-992, Feb. 2012. [43] D. Burger and T. M. Austin, “The simplescalar tool set, version 2.0,” tech. rep., 1997. [44] M. Moudgill, P. Bose, and J. Moreno, “Validation of turandot, a fast proces­ sor model for microarchitecture exploration,” in Performance, Computing and Communications Conference, 1999 IE E E International, pp. 451 -457, feb 1999. [45] E. Argollo, A. Falcon, P. Faraboschi, M. Monchicro, and D. O rtega, “Cotson: infrastructure for full system sim ulation,” SIG O PS Operating System Review, vol. 43, pp. 52-61, Jan. 2009. [46] M. Rosenblum, E. Bugnion, S. Devine, and S. A. Herrod, “Using the simos ma­ chine simulator to study complex com puter systems,” A C M Trans. Model. Cornput. S im u i. vol. 7, pp. 78-103, Jan. 1997. [47] AM D Developer Central. AM D SimNow h ttp :// developer. amd. com/'tools/simnow/pages/default. aspx. 91 Simulator, [48] A. B atat and D. G. Feitelson, ‘"Gang scheduling with memory considerations,” in in Proc. o f the 14th Intl. Parallel and Distributed Processing Sym p., 2000, pp. 109-114, 2000. [49] K. Hyoudou, Y. Kozakai, and Y. Nakayama, “An im plem entation of a concurrent gang scheduler for a pc-based cluster system,” System s and Computers in Japan, vol. 38, pp. 39-48, Mar. 2007. [50] H. D. K aratza, “Scheuling gangs in a distributed system ,” International Journal of Simulation, vol. 7(1), pp. 15-22, 2006. [51] Z. C. Papazachos and H. D. K aratza, “The im pact of task service tim e variability on gang scheduling performance in a two-cluster system ,” Simulation Modelling Practice and Theory, vol. 17, no. 7, pp. 1276 - 1289, 2009. [52] Z. C. Papazachos and H. D. K aratza, “Gang scheduling in a two-cluster system implementing migrations and periodic feedback,” SIM U LATIO N, 2010. [53] Z. C. Papazachos and H. D. K aratza, “Gang scheduling in multi-core clusters implementing migrations,” Future Generation Com puter Systems, vol. 27, no. 8 , pp. 1153 - 1165, 2011. [54] Y. Wiseman and D. G. Feitelson, “Paired gang scheduling,” IE E E Transactions on Parallel and Distributed Systems, vol. 14, pp. 581-592, June 2003. [55] Y. Zhang, H. Franke, J. E. Moreira, and A. Sivasubramaniam, “An integrated approach to parallel scheduling using gang-scheduling, backfilling, and migra­ tion,” in Revised Papers from the 7th International Workshop on Job Schedul­ ing Strategies fo r Parallel Processing, JSSPP ’01, (London, UK), pp. 133-158, Springer-Verlag, 2001. 92 [56] S. Frechette and D. R. Avresky, "M ethod for task m igration in grid environm ents,” in Proceedings of the Fourth IE E E International Symposium on Network Com­ puting and Applications. NCA ’05, (W ashington, DC, USA), pp. 49-58. IEEE Com puter Society, 2005. [57] R. McDougall and J. M. and, Solaris Internals (2nd Edition). U pper Saddle River, NJ, USA: Prentice Hall P T R , 2006. [58] C. S. Pabla, Completely Fair Scheduler, h t t p : / /w ww . lin u x jo u r n a l. com/ ma q az i n e / c oarp l e t e I y - f a i r - s c h e du I e r , 2009. [59] I. Stojmenovic. “Simulations in wireless sensor and ad hoc networks: m atch­ ing and advancing models, metrics, and solutions,” Communications Magazine, IE E E , vol. 46, pp. 102 -107, december 2008. [60] K. Sankaralingam and R. H. Arpaci-Dusseau, “Get th e parallelism out of my clouds,” in Proceedings of the 2nd U SEN IX conference on Hot topics in paral­ lelism, H otPar'10, (Berkeley, CA, USA), pp. 8- 8 . USENIX Association, 2010. [61] G. Ghinca and S. Chen, “Perceived quality of m ultim edia educational content: A cognitive style approach.” M ultimedia Systems, vol. 11, pp. 271-279, 2006. 10.1007/s00530-005-0007-8. [62] R. M. Fujimoto, Parallel and Distributed Simulation Systems. John Wiley & Sons, Inc., 2000. [63] Completely Fair Scheduling (CFS) Class, h t t p : / / I xr . l i nux. n o / Linux* v 3 . ■1 . J / K' €', 'CTi £ I / A 0 h C. d / f Q . ’l T . C. 93