Questions: Q1, Q2, Q3, Q4, Q5, Q6, Q7.
On Intel processors, Linux uses a page size of 4 KB (4096 bytes). This means that the address of the first byte of a page frame is always a multiple of 0x1000 (4096).
![]() |
Q1. Assuming a page size of 4096 bytes (4 KB),
indicate which of the following physical addresses mark
the beginning of a page: 0x0, 0x5F001, 0x7A0B00,
0x9D000, 0x860000 . Indicate the page frame to
which the following addresses belong: 0x178A67,
0x10101, 0x8FF000 .Using the shift right C operator (>>), give a C expression that quickly calculates the page frame number of a given address. |
The kernel keeps track of the current status of each page
frame in memory using an array of page
structures. Each element of this array corresponds to a page
frame in physical memory. The array is defined as following:
struct page *mem_map;
For example, mem_map[0]
contains information
about the first page frame in memory (address range 0x0 -
0xFFF). The exact location in memory of the
mem_map
array depends on the size of the
kernel. The total number of page frames in memory is given by
max_mapnr
.
A page
structure is defined in include/linux/mm.h
,
line 151, as following:
struct page { struct list_head list; /* ->mapping has some page lists. */ struct address_space *mapping; /* The inode (or ...) we belong to. */ unsigned long index; /* Our offset within mapping. */ struct page *next_hash; /* Next page sharing our hash bucket in the pagecache hash table. */ atomic_t count; /* Usage count, see below. */ unsigned long flags; /* atomic flags, some possibly updated asynchronously */ struct list_head lru; /* Pageout list, eg. active_list; protected by pagemap_lru_lock !! */ wait_queue_head_t wait; /* Page locked? Stand in line... */ struct page **pprev_hash; /* Complement to *next_hash. */ struct buffer_head * buffers; /* Buffer maps us to a disk block. */ void *virtual; /* Kernel virtual address (NULL if not kmapped, ie. highmem) */ struct zone_struct *zone; /* Memory zone we are in. */ }
In this document we will only consider the fields in bold.
flags
field determines the properties of
a page. Flags are described in mm.h
, line
275. For example, the PG_reserved
flag indicates
that a page is reserved for kernel use or for some other
special purpose and cannot be touched.count
field indicates if the page is free
or being used. If count is zero, the page is free, otherwise
the page is in use.zone
field indicates the memory zone to
which the page belongs. Zones are described below.On Linux 2.4.18, physical memory is divided into zones. Each zone contains a fixed number of continuous pages having some specific property. On i386 machines there are only three possible memory zones:
ZONE_DMA
(4096 pages), contains pages
suitable for DMA (Direct Memory Access)
operations. Because of a limitation in ISA DMA controllers,
only the first 16MB of physical memory can be used for direct
memory access. Therefore, ZONE_DMA
contains 4096
pages (4096*4KB=16MB) identified by memmap[0]
through memmap[4095]
.ZONE_NORMAL
contains pages available
for normal (kernel and user) use. The size of this zone
depends on the amount of RAM on your computer. For example, on
a machine with 512 MB of RAM, the number of pages in
ZONE_NORMAL
will be
ranging fromNormalPages = (TotalMemory - DMAMemory) / PageSize = = (512 MB - 16 MB) / 4 KB = = 496 MB / 4 KB = = 507,904 KB / 4 KB = = 126,976 pages
memmap[4096]
to memmap[131071]
.
ZONE_HIGHMEM
, generally not used; it
contains 0 pages.Upon allocation request, the buddy system algorithm attempts to allocate pages within a goal zone, where the goal zone depends on the properties of the requested pages. If this zone fails, the other zones are also considered. The buddy system will be explained later.
![]() |
Q2. Given a page size of 4096 bytes (4 KB), calculate
the number of pages belonging to the normal zone
(ZONE_NORMAL ) if the total amount of memory
is 128 and 256 MB. Also calculate the page index range in
memmap[] for each case. |
A zone data structure is defined in include/linux/mmzone.h
, line
36, as following:
struct zone_struct { /* * Commonly accessed fields: */ spinlock_t lock; unsigned long free_pages; unsigned long pages_min, pages_low, pages_high; int need_balance; /* * free areas of different sizes */ free_area_t free_area[MAX_ORDER]; /* * Discontig memory support fields. */ struct pglist_data *zone_pgdat; struct page *zone_mem_map; unsigned long zone_start_paddr; unsigned long zone_start_mapnr; /* * rarely used fields: */ char *name; /* "DMA", "Normal" or "HighMem" */ unsigned long size; /* size in pages */ };
The function in charge of allocating pages is
__alloc_pages()
, implemented in kernel/mm/page_alloc.c
,
line 311:
struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)
The function attempts to allocate 2order pages. The number of pages allocated is always a power of two, in order to reduce external fragmentation.
The gfp_mask
argument specifies the priority of
the request and use of the page. This parameter can be one the
GFP_xxx
symbols, defined in
include/linux/mm.h
, line 539. A high priority
request (e.g. GFP_ATOM
) will cause the kernel to
do everything possible to satisfy the request immediately,
including swapping out used pages if no free pages are
available. On the other side, a low priority, such as
GFP_USER
, could put the current process to sleep
if no free pages are available.
The zonelist
argument contains a list of memory
zones where the algorithm will look for free pages. The
function will try to allocate the requested pages in the first
zone of the list, the goal zone. If this fails, the
other zones in the list are considered. In this document we
will always assume ZONE_NORMAL
to be the goal
zone.
When looking for free pages in a given zone, the function
examines the zone's free_area[]
array. Each
element free_area[K]
of this array contains a
pointer to a list of areas of 2K free continuous
pages in the memory zone. K
is called the
order of the free area list. For example:
zone->free_area[2].free_list
is a list of free areas of 4 (22) pages.
A list of free areas of 4 pages.
If we go back to the __alloc_pages()
function, we
can now see that to satisfy a request of 2order
pages the function must look for a free area in the
zone->free_area[order].free_list
list. If a free
area of such order cannot be found, the function looks in the
free area list of the next order.
The algorithm can be summarized as following:
struct list_head *head, *curr; struct page *page; do { head = &zone->free_area[order].free_list; /* the free area list */ curr = list_next(head); /* first element in the list */ if (curr != head) { /* check if the free area list contains elements */ page = list_entry(curr, struct page, list); list_del(curr); /* remove area from the free list */ /* ... */ return(page); } /* the free area list of this order is empty, try the list of the following order */ order++; } while(order < MAX_ORDER); /* cannot find an appropriate area in this zone */ return(NULL);
![]() |
Q3. The free_area[4].free_list list of a
zone contains 13 areas. What is the size (in pages) of a
single area? What is the total number of pages referenced
by this list? |
![]() |
Q4. What is the maximum size (in pages) of a free area supported by the Linux kernel? |
The following code can be used to parse a free area list of
order order
in the ZONE_NORMAL
zone:
struct list_head *head, *curr; struct zone_struct *zone; ... zone = &contig_page_data.node_zones[ZONE_NORMAL]; /* ZONE_NORMAL */ head = &zone->free_area[order].free_list; /* head of the list */ list_for_each(curr, head) { struct page *page = list_entry(curr, struct page, list); /* "page" is now the first page of the current area */ ... }
The list_for_each()
macro expands to a
for
loop which goes through every element of the
list. In each iteration, the current position in the list is
represented by the curr
pointer. To retrieve a
page
structure from this element, we need to use
the list_entry()
macro.
![]() |
Q5. To which zone does page 4096 belong to? What about page 2304? (You may want to review the Keeping track of pages section of this document) |
![]() |
Q6. pageinfo-skel.c is
a kernel module that creates a /proc entry
called pageinfo . Reading from
/proc/pageinfo will output general
information about memory and the normal zone
(ZONE_NORMAL ). Extend the
pageinfo_proc_read() function in this module
so that it also outputs the number of areas in
each of the zone's free area lists (use the code above as
a reference). Also print the total number of pages
referenced by each list. The final output should be
something similar to this:
In order to compile the device driver, you will need to add the line[ealtieri@italia os]$ cat /proc/pageinfo ... FREE AREA LISTS: order 0: 0 areas (0 pages) order 1: 72 areas (144 pages) order 2: 55 areas (220 pages) order 3: 31 areas (248 pages) order 4: 17 areas (272 pages) order 5: 4 areas (128 pages) order 6: 3 areas (192 pages) order 7: 2 areas (256 pages) order 8: 0 areas (0 pages) order 9: 3 areas (1536 pages) EXPORT_SYMBOL(contig_page_data); to
kernel/ksyms.c and recompile the kernel
using "make clean dep bzImage" . |
Unfortunately, the kernel does not have a built-in page fault
counter. In order to keep track of the number of page faults
on the system, we need to implement our own counter. This can
be done with a few simple modifications to the page fault
handler, do_page_fault()
, in arch/i386/mm/fault.c
, line
150.
We declare a counter page_faults
to keep track of
the number of page faults. The counter must be initialized to
0 and incremented at every page fault. The right place to
increment page_faults
is at the beginning of
do_page_fault()
, as shown below in bold:
/* in arch/i386/mm/fault.c */ ... unsigned long page_faults = 0; /* page faults counter */ ... asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code) { ... tsk = current; /* increment page fault counter */ page_faults++; ...
At this point the kernel is correctly counting every page
fault. However, the use of page_faults
is limited
to the current module: fault.c
. We now want to
make the counter accessible from everywhere in the kernel
using the extern directive. Add the code shown in bold
below to include/linux/mm.h
:
/* include/linux/mm.h */ #ifndef _LINUX_MM_H #define _LINUX_MM_H ... #ifdef __KERNEL__ #include <linux/config.h> #include <linux/string.h> #include <linux/list.h> #include <linux/mmzone.h> #include <linux/swap.h> #include <linux/rbtree.h> /* Page Faults Counter (implemented in arch/i386/mm/fault.c) */ extern unsigned long page_faults; ...
Finally, we need one more small change to the kernel. Even
with the extern directive, the page_faults
counter is not accessible to a kernel module. This is because
our kernel modules are injected into the operating system at
run-time (dynamic linking), instead of being statically
linked to the kernel at compile-time.
In order to make our counter accessible to kernel modules and
device drivers, we need to export
page_faults
using a special macro called
EXPORT_SYMBOL(name)
. The file kernel/ksyms.c
contains a
list of all of the exported symbols in the kernel. Add the
line shown below to this file, possibly in the "internal
kernel memory management" section:
EXPORT_SYMBOL(page_faults);
At this point you are ready to recompile the new kernel with
"make clean dep bzImage"
.
![]() |
Q7. Add a page fault counter to the kernel as
explained in this section, then recompile and install the
kernel. To test the new kernel, modify pageinfo-skel.c so
that it prints the number of page faults occurred since
system reset. |