Linux Device Drivers

Linux Device Drivers

Character drivers
IO & Memory
Linux Kernel
Process Management
Process Address space

Linux Scheduler
Memory Management
System Calls
Kernel Synchronization
Linux Inter Process Communications

Serial Ports
Parallel Ports
Introduction to Hardware
Linux Timers
DMA in Linux
Linux Threads
Linux Thread Synchronization

Linux Multi Threading
Debugging in Linux
GDB GNU Debugger
KDB Kernel Debugger
KGDB Kernel GNU Debugger
Example Ethernet Driver

Linux SchedulerMultitasking operating system come in two flavors:

Linux Scheduler


Multitasking operating system come in two flavors:


     Co-operative multitasking

     Pre-emptive multitasking


In preemptive multitasking, the scheduler decides when a process is to cease running and a new process is to resume running. The way of suspending a running process is called preemption. In cooperative multitasking, a process does not stop running until it voluntarily decides to do so. The way of suspending itself is called yielding.


Linux Scheduler


A new scheduler, called O(1) scheduler in LK-2.6 because of its algorithmic behavior, solved the shortcomings of the previous Linux scheduler and introduced new powerful features and performance characteristics. Here we discuss the fundamentals of scheduler design and how they apply to the new O(1) scheduler and its goals, design, implementation, algorithms and related system calls.


Scheduling Policy


The set of rules used to determine when and how selecting a new process to run is called scheduling policy. The scheduling algorithm must fulfill objectives like: fast process response time, good throughput for background jobs, avoidance of process starvation, and reconciliation of needs of low and high priority processes.


Scheduling Policy


Processes are traditionally classified as


     I/O bound or



An alternative classification:

     Interactive processes

     Batch Processes

     Real-time Process


Process Priority


A common type of scheduling algorithm is priority-based scheduling. The idea is to rank processes based on their worth and need for processor time. Processes with a higher priority run before those with a lower priority, whereas processes with the same priority are scheduled round-robin (one after the next, repeating).

Linux provides dynamic priority-based scheduling. This concept begins with an initial base priority and then enables the scheduler to increase or decrease the priority dynamically to fulfill scheduling objectives.

For example, a process that is spending more time waiting on I/O than running is clearly I/O bound. Under Linux, it receives an elevated dynamic priority. A process that continually uses up its entire time-slice is processor bound it would receive a lowered dynamic priority.


Process Priority

Linux implements two separate priority ranges. The first is the nice value, a number from 20 to +19 with a default of 0. Larger nice value corresponds to a lower priority. Processes with a lower nice value (higher priority) run before processes with a higher nice value.


A process with a nice value of 20 receives the maximum time slice, whereas a process with a nice value of 19 receives the minimum possible time slice.


The second range is the real-time priority. The values are configurable but by default range from 0 to 99. All real-time processes are at higher priority than normal processes.




The time-slice is the numeric value that represents how long a task can run until it is preempted. Too long a time slice causes the system to have poor interactive performance. Too short a time slice causes significant amounts of processor time to be wasted on the overhead of switching processes.


I/O bound processes do not need longer time-slices (although they do like to run often), whereas processor bound processes crave long time-slices (to keep their caches hot).The Linux scheduler bumps the priority of interactive tasks, enabling them to run more frequently. It dynamically determines the time-slice of a process based on the priority.



Process preemption


The Linux operating system is preemptive. When a process enters the TASK_RUNING state, the kernel checks whether its priority is higher than the priority of the currently executing process. If it is, scheduler is invoked to preempt the currently executing process and run the newly run-able process.

When a process's time slice reaches zero, it is preempted and the scheduler is again invoked to select a new process.


The scheduling policy in action


Consider a system with two run-able tasks a text editor and a video encoder. The text editor is I/O bound because it spends nearly all its time waiting for a user key press. But when the text editor receives a key press, the user expects the editor to respond immediately. The video encoder is processor bound. It spends all its time applying the video codec to the raw data.

In this scenario, the scheduler gives the text editor a higher priority and larger time slice than the video encoder receives because the text editor is interactive. This ensures that text editor has plenty of time-slice available. Because the text editor has a higher priority, it is capable of preempting the video encoder when the user presses key. Video encoder can monopolize the remaining time. This optimizes the performance of both the applications.


The Linux Scheduling Algorithm


The Linux scheduler is defined in kernel/sched.c The new scheduler was designed to accomplish specific goals:


     Implement fully O(1) scheduling. Every algorithm completes in a constant-time.

     Implement perfect SMP scalability. Each processor has its own locking and individual runqueue.

     Implement improved SMP affinity.

     Provide good interactive performance.

     Provide fairness. No process should starve of time-slice.

     Optimize for the common case of only one or two run-ableprocesses, yet scale to multiple processors, each with many processes.




The basic data structure in scheduler is the runqueue. The runqueueis defined in kernel/sched.c as struct runqueue. The runqueue is the list of run-able processes on a given processor, there is one runqueue per processor.


struct runqueue


spinlock_t lock; /* spinlock that protects this runqueue*/

unsigned long nr_running; /* number of runnabletasks*/

unsigned long nr_switches; /* context switch count*/

struct prio_array *active; /* active priority array*/

struct prio_array *expired; /* the expired priority array:*/



A group of macros are used to obtain the runqueue associated with a given processor or process.

cpu_rq(processor): runqueue of the given processor

this_rq(): returns the runqueue of the current processor

task_rq(task): returns a pointer to the runqueue on which task is queued


Before a runqueue is manipulated, it must be locked. The functions task_rq_lock() and task_rq_unlock() is used.


The priority arrays


Each runqueue contains two priority arrays, the active and expired array. Priority arrays are defined in kernel/sched.c as struct prio_array. A priority bitmap is used to efficiently discover the highest priority run-able task in the system.


struct prio_array


int nr_active; /* no. of tasks in the queue */

unsigned long bitmap[BITMAP_SIZE]; /* priority bitmap */ structlist_headqueue[MAX_PRIO]; /* priority queue */


MAX_PRIO is the number of priority levels in the system.

With 140 priorities and 32 bit words, it is five (160 bits).


Recalculating time-slices


It maintains two priority arrays for each processor: both an active array and an expired array. The active array contains all the tasks in the associated runqueue that have time slice left. The expired array contains all the tasks in the associated runqueue that have exhausted their time-slice. When each task's time-slice reaches 0, its time-slice is recalculated before it is moved to the expired array. Recalculating all the time-slices is then as simple as just switching the active and expired arrays.


struct prio_array *array = rq->active;

if (!array->nr_active)

rq->active = rq->expired;rq->expired = array;



The schedule() function


The act of picking the next task to run and switching to it is implemented via the schedule() function. This function is called explicitly by kernel code that wants to sleep and it is invoked whenever a task is to be preempted.


It is invoked, directly or lazy way by several kernel mechanisms.




Direct invocation


The scheduler is invoked directly when the current process must beblocked.


1.     Insert current in the proper wait queue.

2.     Changes the state of current either to TASK_INTERRUPTIBLEor to TASK_UNINTERRUPTIBLE.

3.     Invokes schedule()

4.     Checks the resource is available, if not, goes to step 2.

5.     Once the resource is available, removes the current from waitqueue.


Lazy invocation


The scheduler is invoked in a lazy way by setting the need_resched field of current to 1.


1.     When the current has used up its quantum of CPU time

2.     When a process is woken up and its priority is higher than that of Current process.

3.     When a sched_setscheduler() system call is issued.


Calculating priority and time-slice


Processes have initial priority that is called the nice value. This value ranges from 20 to +19 with a default of 0. 19 is the lowest and 20 is the highest priority. This value is stored in the static_prio member of the process's task struct. The dynamic priority is stored in prio variable. The dynamic priority is calculated as function of the static priority and the task's interactivity.


The method effective_prio() returns a task's dynamic priority. The method begins with the task's nice value and computes a bonus or penalty in the range 5 to +5 based on the interactivity of the task.


Linux keeps a running tab on how much time a process is spent sleeping versus how much time the process spends in a run-able state. Based on this it decides whether a task is I/O bound or processor bound.


Scheduler time-slices


Type of task Nice Value Time-slice duration

Initially created parent's half of parent's

Minimum Priority +19 5ms (MIN_TIME-SLICE)

Default priority 0 100ms (DEF_TIME-SLICE)

Maximum Priority -20 800ms (MAX_TIME-SLICE)


Sleeping and Waking up


Tasks that are sleeping (blocked) are in a special non-run-able state. A task sleeps for a number of reasons, but always while it is waiting for some event. The event can be a specified amount of time, more data from a file I/O or another hardware event. A task can also go to sleep when it tries to obtain a semaphore in the kernel. The task marks itself as sleeping, puts itself on a wait queue, removes itself from the runqueue and calls schedule() to select a new process to execute. Waking back up: the task is set as run-able, removed from the wait queue, and added back to the runqueue.

Two states are associated with sleeping:



Sleeping and Waking up


Sleeping is handled via wait queue is a simple list of processes waiting for an event to occur. Wait queues are represented in the kernel by wake_queue_head_t.

Wait queue is created statically via DECLARE_WAITQUEUE() or dynamically via init_waitqueue_head().

Waking is handled via wake_up(), which wakes up all the tasks waiting on the given wait queue.




The load balancer


The linux scheduler implements separate runqueues and locking for each processor on a SMP. The load balancer works to ensure that the runqueues are balanced.

The load balancer is implemented in kernel/sched.c as load_balance(). It has two methods of invocation. It is called by schedule() whenever the current runqueue is empty. It is also called via timer: every 1 millisecond when the system is idle and every 200 milliseconds



Preemption and Context switching


Context switching, the switching from one runnabletask to anotheris handled by the context_switch() function defined in kernel/sched.c.It is called by schedule() when a new process has been selected torun. It does two basic jobs:


     Calls switch_mm(), which is defined in <asm/mmu_context.h>, to switchthe virtual memory mapping from the previous to new process

     Calls switch_to(), defined in <asm/system.h>, to switch the processor state.This involves saving and storing of stack and the processor registers.


The kernel provides the need_resched flag to signify whether a reschedule should be performed.


Preemption and Context switching


Functions for accessing and manipulating need_resched


set_tsk_need_resched() set the need_resched flag in the given process clear_tsk_need_resched() clear the need_resched flag in the given process need_resched() test the value of the need_resched flag, return true if set and false otherwise.

Upon returning to user space or returning from an interrupt, the need_resched flag is checked. If it is set, the kernel invokes the scheduler before continuing.


User Preemption


User preemption can occur:


     When returning to user space from a system call

     When returning to user space from an interrupt handler.


Kernel Preemption


Kernel preemption can occur


     When an interrupt handler exits, before returning to kernel space

     When kernel code becomes pre-emptable again

     If a task in the kernel explicitly calls schedule()

     If a task in the kernel blocks (which results in a call to schedule())


Data structures used by scheduler


Each process descriptor includes several fields related to scheduling:


need_resched A flag checked by ret_from_intr() to decide whether to invoke the schedule() function.

policy scheduling classes: SCHED_FIFO, SCHED_RR, SCHED_OTHER

rt_priority The static priority of a real-time process.

priority The base time quantum (or base priority) of a process.

counter The number of ticks of CPU time left to the process before its quantum expires.




Linux provides two real-time scheduling policies, SCHED_FIFO and SCHED_RR. The normal, not real-time scheduling policy is SCHED_NORMAL


The difference between SCHED_FIFO and SCHED_RR is that a process of the SCHED_FIFO class run until it relinquishes control or until a process with higher real time priority wishes to run. A process of the SCHED_RR class, in contrast, also interrupted when its time slice has expired or there are processes of the same real time priority.


Both real-time scheduling policies implement static priorities. The kernel dos not calculate dynamic priority values for real-time tasks. This ensures that a real-time process at a given priority always preempts a process at a lower priority.


Real time does not mean "hard real time" with guaranteed process switching and reaction times, but "soft real time"


Real-time priorities range from 0 to 99. This priority space is shared with the nice values of SCHED_NORMAL tasks. They use the space from MAX_RT_PRIO to (MAX_RT_PRIO+40). By default, this means the 20 to 19 nice range maps directly onto the priority space from100 to 139.


System calls related to scheduling


nice() change the priority of a conventional process.

setpriority() set the priority of a conventional process

sched_setscheduler() set the scheduling policy and priority of a process. sched_setparam() set the priority of a process