Questions: Q1, Q2, Q3, Q4, Q5, Q6
In the Linux kernel, processes are defined as
task_struct
structures in include/linux/sched.h
, line
281. This structure contains every relevant information about
a process.
The first field in the task_structure
structure
identifies the state
of a process:
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
There are five possible states for a process, desribed at line
86 of sched.h
. These states are the following:
#define TASK_RUNNING 0 /* 00000000b */ #define TASK_INTERRUPTIBLE 1 /* 00000001b */ #define TASK_UNINTERRUPTIBLE 2 /* 00000010b */ #define TASK_ZOMBIE 4 /* 00000100b */ #define TASK_STOPPED 8 /* 00001000b */
TASK_RUNNING
identifies a process that is
executing on a CPU, or waiting to be executed.
TASK_INTERRUPTIBLE
identifies a process that is
suspended (sleeping) until some condition becomes
true. Raising an interrupt, releasing a system resource the
process is waiting for, or delivering a signal are examples of
conditions that might wake up the process, that is put its
state back to TASK_RUNNNING
.
TASK_UNINTERRUPTIBLE
identifies a process that is
suspended like in the TASK_INTERRUPTIBLE
state,
except that in this case delivering a signal will not wake up
the process. This process state is seldom used.
TASK_STOPPED
identifies a process whose execution
has been stopped. The process enters this state upon delivery
of a SIGSTOP
, SIGTSTP
,
SIGTTIN
or SIGTTOU
signal (see
signal(7)
). Notice that the SIGSTOP
signal cannot be ignored or caught. These signals are useful
for debugging a program. Process execution can be continued
by sending a SIGCONT
signal to the process.
TASK_ZOMBIE
. Process execution is terminated, but
the parent process has not yet issued a
wait()
-like system call (see
wait(2)
) to return information about the dead
process. Before the wait()
is issued, the process
cannot discard the data contained in the dead process
descriptor because the parent could need it.
Also notice in the task_struct
the process
identifiers, such as the PID the PPID, and the process
credentials, such as the user and the group to which the
process belongs to (UID and GID):
uid_t uid,euid,suid,fsuid; gid_t gid,egid,sgid,fsgid; gid_t groups[NGROUPS];
The meaning of these fields is explained below:
Name Description uid, gid
User and group real identifiers euid, egid
User and group effective identifiers fsuid, fsgid
User and group effective identifiers for file access groups
Supplementary group identifiers suid, sgid
User and group saved identifiers
Real and effective identifiers are usually the same. They may differ only when a user executes a setuid program, that is, an executable file whose setuid flag is on. In this case, the effective user ID of the new process becomes the UID of the owner of the file. The following example shows how to set the setuid flag in a program:
[ealtieri@italia os]$ id uid=500(ealtieri) gid=500(engineer) groups=500(engineer) [ealtieri@italia os]$ ls total 0 -rwx--x--x 1 ealtieri engineer 0 Jul 9 11:03 hello [ealtieri@italia os]$ sudo chown root:root hello [ealtieri@italia os]$ ls total 0 -rwx--x--x 1 root root 0 Jul 9 11:03 hello [ealtieri@italia os]$ sudo chmod u+s hello [ealtieri@italia os]$ ls total 0 -rws--x--x 1 root root 0 Jul 9 11:03 hello
The chown
command changes the owner and group of
the file "hello
" to root
. The
chmod
command sets the setuid flag of the
file (u
stands for user, and +s
turns on the setid flag). Notice the "s
"
character in the file permission codes of the last listing.
At this point, every time the "hello
" file is
executed, the resulting process will run with root privileges
(EUID=0), indipendently from the user who started the program.
The counter
, nice
,
policy
and processor
fields are
grouped together and are all involved in process scheduling,
as explained later. The rt_priority
field is also
used for scheduling but it only applies to real-time
processes.
To allow an efficient search through processes of a given type
(for instance, all processes in a runnable state) the kernel
creates several lists of processes. Each list consists of
pointers to process descriptors. A list pointer (struct
task_struct*
) is embedded in the process descriptor's
data structure.
A circular doubly linked list links together all
existing process descriptors. This list is called process
list. The prev_task
and
next_task
fields of each process descriptor are
used to implement this list. The head of the list is the
init_task
descriptor: it is the ancestor of all
processes and it is called process 0 or swapper.
The Linux kernel defines a useful macro to parse the process list:
for_each_task(p) { ... }
The macro expands to a for loop where, in every iteration,
p
points to a task descriptor of the process
list. for_each_task()
is defined in include/linux/sched.h
, line
870:
#define for_each_task(p) \ for (p = &init_task ; (p = p->next_task) != &init_task ; )
Notice that the head of the list, init_task
, is
skipped during the parsing because it does not correspond to
any real process.
The tasks.c
module creates the /proc/tasks
entry. Reading from this file will output the process
list. Notice the use of the
for_each_task() macro in the
tsk_proc_read() function. |
The kernel keeps track of the runnable processes (those in the
TASK_RUNNING
state) using another circular,
doubly-linked list called runqueue. The process
descriptor include the next_run
and
prev_run
fields to implement the runqueue
list. As in the previous case, the init_task
process descriptor plays the role of the header.
Q3. Show how you would implement a loop that goes through the runqueue. | |
Q4. Modify tasks.c so that it only
shows the runnable processes. You can use the loop from
the previous question, or add a few lines to the existing
loop to check the process state at each
iteration. |
Process scheduling is the part of an operating system that
controls the order in which processes are executed and how
long they can execute. A particular scheduling method is
called a scheduling policy, and its implementation a
scheduling algorithm. Linux offers three scheduling
policies, defined in sched.h
, line 115:
SCHED_OTHER
, SCHED_FIFO
and
SCHED_RR
. All of these scheduling policies are
preemptive: if a process with a higher priority gets
ready to run, the current process will be preempted. The
scheduling policy that is in effect for a process is given by
the policy
field of the process descriptor.
This is the default scheduling policy on Linux. The operating
system assigns a certain number of clock ticks (quantum)
to each process. This value is stored in the
counter
field of the process descriptor. The
number of ticks assigned to each task is based on the
nice
property of the process:
p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice);
where p
is the process descriptor
(task_struct
) of the new task. The value of
nice
is usually 0 (default), which identifies a
process with normal priority. nice
can
range from -20 (highest priority) to 19 (lowest
priority). With nice<0
, the value of
counter
will increase with respect to a normal
priority process. With nice>0
, the value of
counter
will decrease.
The "niceness" of a process can be seen with the
ps
command:
[ealtieri@italia ealtieri]$ ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 000 S 500 1694 1693 0 75 0 - 644 11cf00 pts/2 00:00:00 bash 000 S 500 2207 1694 12 69 0 - 2755 147d0d pts/2 00:00:00 emacs 000 R 500 2209 1694 0 78 0 - 662 - pts/2 00:00:00 ps
All of the processes above have a default (0) nice value. To
change the niceness of emacs
, for example, use
the renice
bash command:
[ealtieri@italia ealtieri]$ renice 10 -p 2207 2207: old priority 0, new priority 10 [ealtieri@italia ealtieri]$ ps -l F S UID PID PPID C PRI NI ADDR SZ WCHAN TTY TIME CMD 000 S 500 1694 1693 0 75 0 - 644 11cf00 pts/2 00:00:00 bash 000 S 500 2207 1694 12 69 10 - 2755 147d0d pts/2 00:00:00 emacs 000 R 500 2209 1694 0 78 0 - 662 - pts/2 00:00:00 ps
Only the super-user can decrease the niceness of a process.
At every clock
tick the operating system decrements the counter
field of the running process. This is done in kernel/timer.c
, line 586:
void update_process_times(int user_tick) { struct task_struct *p = current; /* current running process */ int cpu = smp_processor_id(), system = user_tick ^ 1; update_one_process(p, user_tick, system, cpu); if (p->pid) { if (--p->counter <= 0) { p->counter = 0; p->need_resched = 1; } if (p->nice > 0) kstat.per_cpu_nice[cpu] += user_tick; else kstat.per_cpu_user[cpu] += user_tick; kstat.per_cpu_system[cpu] += system; } else if (local_bh_count(cpu) || local_irq_count(cpu) > 1) kstat.per_cpu_system[cpu] += system; }
When p->counter
becomes 0, the running time of
the current process is expired and another task is selected for
running. The process will not be selected again for running
until all of the other running processes have exhausted their
time quantum. At this point the counter
field of
every process is recalculated according to the formula above.
This is a First-In-First-Out implementation of the scheduling
algorithm. A process with this scheduling policy, when selected
for running, will continue to use the CPU until either it is
blocked by an I/O request, it is preempted by a higher priority
process (FIFO or RR), or it calls sched_yield()
.
SCHED_FIFO
tasks belong to the cathegory of the
real-time (RT) processes. Real-time processes are in
charge of critical tasks whose execution cannot be
interrupted. These tasks are usually involved in multimedia
processing. Because RT processes prevent any other
SCHED_OTHER
process from running, their life-time
must be very short.
The priority of SCHED_FIFO
processes is determined
by the rt_priority
field of the task descriptor
("rt" stands for real-time). This value ranges from 1 (lowest
priority) to 99 (highest priority).
A Round Robin real-time process. Everything described for
SCHED_FIFO
also applies to SCHED_RR
,
except that each process is only allowed to run for a
maximum time quantum. When the quantum expires, the RR
process is moved to the end of the run-queue and another RR
process is selected. The length of a quantum is given by
NICE_TO_TICKS(p->nice)
.
The priority of a SCHED_RR
process is given by
rt_priority
.
On Linux, the scheduler()
function is in charge of
scheduling processes, that is deciding which process to run. It
is defined in kernel/sched.c
, line 549.
The function starts by copying the value of
current
into the prev
pointer. current
is the descriptor of the last
running process.
asmlinkage void schedule(void) { struct schedule_data * sched_data; struct task_struct *prev, *next, *p; struct list_head *tmp; int this_cpu, c; ... prev = current; this_cpu = prev->processor;
At this point the function checks if the scheduling policy of
the current process is SCHED_RR
, and if its time
quantum has expired. If both these conditions ar true, the
process quantum is recalculated and its descriptor moved to
the end of the run-queue. This allows another round-robin
process to be selected for running next. If the process
descriptor weren't moved to the end of the run-queue list, the
same process would have been selected for running again. This
happens with the SCHED_FIFO
policy.
/* move an exhausted RR process to be last.. */ if (unlikely(prev->policy == SCHED_RR)) if (!prev->counter) { prev->counter = NICE_TO_TICKS(prev->nice); move_last_runqueue(prev); }
The proper scheduling process starts a bit futher in the
function. The scheduling consists of a while loop which goes
through all the processes in the run-queue. To each process in
the queue is assigned a "goodness" value which
determines how good a process is for running. This value is
calculated by the goodness()
function. The
process with higher goodness will be the next process to run.
If no process is available for running, then the operating
system selects a special idle task.
repeat_schedule: next = idle_task(this_cpu); c = -1000; list_for_each(tmp, &runqueue_head) { p = list_entry(tmp, struct task_struct, run_list); if (can_schedule(p, this_cpu)) { int weight = goodness(p, this_cpu, prev->active_mm); if (weight > c) c = weight, next = p; } }
Notice from the code above that the order of processes in the
run-queue is important. Given two or more processes with the
same priority, the process that comes first in the queue will
be selected for running. This is because the next
pointer, which points to the task descriptor of the next
process to run, is only changed if the goodness of process
p
is greater than any other goodness found up to
that point in the loop.
It may happen that all of the processes in the run-queue have
a goodness of zero. As we will see in the next section, this
means that all of the processes in the queue have exhausted
their quantum (SCHED_OTHER
policy). If this
happens, the quanta are recalculated and the processes
reevaluated for running.
/* Do we need to re-calculate counters? */ if (unlikely(!c)) { struct task_struct *p; spin_unlock_irq(&runqueue_lock); read_lock(&tasklist_lock); for_each_task(p) p->counter = (p->counter >> 1) + NICE_TO_TICKS(p->nice); read_unlock(&tasklist_lock); spin_lock_irq(&runqueue_lock); goto repeat_schedule; }
From this point on nothing can prevent the scheduler from
switching to the next task, which is stored in the
next
pointer. The task switch is carried out by
the following line at the end of the function:
switch_to(prev, next, prev);
The goodness of a process is calculated by the
goodness()
function, in kernel/sched.c
, line
144. This value indicates how good a process is for running.
The goodness()
function can return one of the
following values:
Return Value | Description |
---|---|
-1000 | Never select this process |
-1 | The process wants to yield |
0 | The process is out of time (its quantum has
expired). Recalculate its counter field. |
0 < x < 1000 | Goodness value for a
SCHED_OTHER process. The larger, the more
desirable the process is for running. |
> 1000 | Goodness value for a real time process
(SCHED_FIFO or SCHED_RR ) |
Because the scheduler selects the process with highest
goodness value for running, you can see from the return values
above how real-time processes (SCHED_FIFO
or
SCHED_RR
) will always be selected before normal
processes (SCHED_OTHER
).
The goodness()
function starts by testing the
SCHED_YIELD
bit in the policy
field
of the process descritor. This bit is set by the
sched_yield()
function and indicates that the
process wants to yield. In the code below, p
is a
pointer to the task descriptor of the process to be evaluated.
static inline int goodness(struct task_struct * p, int this_cpu, struct mm_struct *this_mm) { int weight; /* * select the current process after every other * runnable process, but before the idle thread. * Also, dont trigger a counter recalculation. */ weight = -1; if (p->policy & SCHED_YIELD) goto out;
Following, the function calculates the goodness value in the
case of a normal process (SCHED_OTHER
). A
first-approximation is calculated according to the number of
ticks left in the process quantum. This means that the
goodness of a process decreases with time, until it becomes
zero when the time quantum of the task expires.
if (p->policy == SCHED_OTHER) { /* * Give the process a first-approximation goodness value * according to the number of clock-ticks it has left. * * Don't do any other calculations if the time slice is * over.. */ weight = p->counter; if (!weight) goto out; ...
Still in the SCHED_OTHER
case, the final goodness
is calculated in function of the niceness of the
process. Recall that the value of nice
can range
from -20 (highest priority) to +19 (lowest priority).
weight += 20 - p->nice; goto out; } /* code for real-time processes goes here */ out: return weight;
In the case of real-time processes, things are a bit simpler because the function skips the whole if statement described above and jumps directly to the following lines:
/* * Realtime process, select the first one on the * runqueue (taking priorities within processes * into account). */ weight = 1000 + p->rt_priority;
The goodness of a real-time process does not depend on
nice
but on the rt_priority
field of
task descritor. The minimum value of rt_priority
is 1, the maximum is 99. Therefore, the goodness of a
real-time process ranges from 1001 to 1099.
Understanding the Linux Kernel, by Daniel P. Bovet & Marco Cesati, published by O'Reilly. |