CSC 262

Lab Exercise: Memory Manager Instrumentation

Extra Credit

Overview

In this lab you will add code to the Linux kernel that will display information about the current status of the memory manager. We will display information about the number and sizes of free memory blocks available, and also add a counter that keeps track of the number of page faults. Such a counter could be useful for monitoring system behavior.

Changing the behavior of the kernel has two steps. The code we will run to actually print the desired information will be implemented as a simple kernel module. But before our program can see the information it needs to see, we have to make it visible. (The linux kernel code embodies moderate levels of modularity and encapsulation, so not all the variables used by the memory manager are visible to other parts of the kernel. However, we can change thatby modifying the kernel source we can make variables visible on a global scale.) The other thing we will need to do is create a new variable for our page fault counter, and make it visible as well. All these changes will require that we recompile the kernel. (See the lab on kernel compilation for more details.) Because compiling the kernel is a time-consuming process, you should make all the changes suggested first, and then recompile everything before going on. The changes required are summarized below.

  1. Before starting, you should have an up-to-date copy of the kernel source. Remember that you can get this with the commands yum install kernel and yum install kernel-devel.
  2. Add the line EXPORT_SYMBOL(contig_page_data); to the end of arch/um/kernel/ksyms.c. (This will make the variable visible to the kernel module we plan to write. Note that the file paths given here and below are relative to the kernel source tree root, which should be something like /usr/src/redhat/BUILD/kernel-x.y.z/linux-x.y.z.i686.)
  3. Add the line EXPORT_SYMBOL(page_faults); right after the line above. (This variable doesn't exist yet, but we will define it below.)
  4. Add two lines to arch/i386/mm/fault.c, after the included headers
        unsigned long page_faults = 0; /* page faults counter */
    
    and inside function do_page_fault() right after the line tsk = current;
       /* increment page fault counter */
       page_faults++;
    
  5. Finally, in include/linux/mm.h add the following lines right after the includes:
    /* Page Faults Counter (implemented in arch/i386/mm/fault.c) */
    extern unsigned long page_faults;
    

Once you have completed the modifications above, you should do a full recompile of the kernel and modules. The steps to do so are summarized below. Note that with this sequence, the makefile handles most of the details of installation, putting files in the right place and updating grub.conf.

# cd /usr/src/redhat/BUILD/kernel-2.6.20/linux-2.6.20.i686
# cp .config .config-save
# make mrproper
# cp .config-save .config
# make oldconfig
# make clean dep bzImage
# make modules
# make modules_install
# make install

Once you have compiled the kernel, you can work on the pageinfo.c module. Your goal is to create a module that satisfies questions Q6 and Q7.

Contents

  1. Keeping track of pages
  2. Allocating pages
  3. Parsing a free area list
  4. Page faults

Keeping track of pages

On Intel processors, Linux uses a page size of 4 KB (4096 bytes). This means that the address of the first byte of a page frame is always a multiple of 0x1000 (4096).

question Q1. Assuming a page size of 4096 bytes (4 KB), indicate which of the following physical addresses mark the beginning of a page: 0x0, 0x5F001, 0x7A0B00, 0x9D000, 0x860000. Indicate the page frame to which the following addresses belong: 0x178A67, 0x10101, 0x8FF000.

Using the shift right C operator (>>), give a C expression that quickly calculates the page frame number of a given address.

The kernel keeps track of the current status of each page frame in memory using an array of page structures. Each element of this array corresponds to a page frame in physical memory. The array is defined as following:

struct page *mem_map;

For example, mem_map[0] contains information about the first page frame in memory (address range 0x0 - 0xFFF). The exact location in memory of the mem_map array depends on the size of the kernel. The total number of page frames in memory is given by max_mapnr.

A page structure is defined in include/linux/mm.h, line 151, as following:

struct page {
struct list_head list;/* ->mapping has some page lists. */
struct address_space *mapping;/* The inode (or ...) we belong to. */
unsigned long index;/* Our offset within mapping. */
struct page *next_hash;/* Next page sharing our hash bucket in
   the pagecache hash table. */
   atomic_t count;  /* Usage count, see below. */
   unsigned long flags;  /* atomic flags, some possibly
   updated asynchronously */
struct list_head lru;/* Pageout list, eg. active_list;
   protected by pagemap_lru_lock !! */
wait_queue_head_t wait;/* Page locked?  Stand in line... */
struct page **pprev_hash;/* Complement to *next_hash. */
struct buffer_head * buffers;/* Buffer maps us to a disk block. */
void *virtual;/* Kernel virtual address (NULL if
   not kmapped, ie. highmem) */
   struct zone *zone;  /* Memory zone we are in. */
}

In this document we will only consider the fields in bold.

In Linux, physical memory is divided into zones. Each zone contains a fixed number of continuous pages having some specific property. On i386 machines there are only three possible memory zones:

Upon allocation request, the buddy system algorithm attempts to allocate pages within a goal zone, where the goal zone depends on the properties of the requested pages. If this zone fails, the other zones are also considered. The buddy system was explained in class; more details appear below.

question Q2. Given a page size of 4096 bytes (4 KB), calculate the number of pages belonging to the normal zone (ZONE_NORMAL) if the total amount of memory is 128 and 256 MB. Also calculate the page index range in memmap[] for each case.

A zone data structure is defined in include/linux/mmzone.h, line 36, as following:

struct zone {
/*
 * Commonly accessed fields:
 */
spinlock_t lock;
unsigned long free_pages;
unsigned long pages_min, pages_low, pages_high;
int need_balance;

/*
 * free areas of different sizes
 */
free_area_t free_area[MAX_ORDER];

/*
 * Discontig memory support fields.
 */
struct pglist_data *zone_pgdat;
struct page *zone_mem_map;
unsigned long zone_start_paddr;
unsigned long zone_start_mapnr;

/*
 * rarely used fields:
 */
char *name;  /* "DMA", "Normal" or "HighMem" */
unsigned long size;   /* size in pages */
};

Allocating pages

The function in charge of allocating pages is __alloc_pages(), implemented in mm/page_alloc.c, line 311:

struct page * __alloc_pages(unsigned int gfp_mask, unsigned int order, zonelist_t *zonelist)

The function attempts to allocate 2order pages. The number of pages allocated is always a power of two, in order to reduce external fragmentation.

The gfp_mask argument specifies the priority of the request and use of the page. This parameter can be one the GFP_xxx symbols, defined in include/linux/mm.h, line 539. A high priority request (e.g. GFP_ATOM) will cause the kernel to do everything possible to satisfy the request immediately, including swapping out used pages if no free pages are available. On the other side, a low priority, such as GFP_USER, could put the current process to sleep if no free pages are available.

The zonelist argument contains a list of memory zones where the algorithm will look for free pages. The function will try to allocate the requested pages in the first zone of the list, the goal zone. If this fails, the other zones in the list are considered. In this document we will always assume ZONE_NORMAL to be the goal zone.

When looking for free pages in a given zone, the function examines the zone's free_area[] array. Each element free_area[K] of this array contains a pointer to a list of areas of 2K free continuous pages in the memory zone. K is called the order of the free area list. For example:

zone->free_area[2].free_list

is a list of free areas of 4 (22) pages.

free area list

A list of free areas of 4 pages.

If we go back to the __alloc_pages() function, we can now see that to satisfy a request of 2order pages the function must look for a free area in the zone->free_area[order].free_list list. If a free area of such order cannot be found, the function looks in the free area list of the next order.

The algorithm can be summarized as following:

struct list_head *head, *curr;
struct page *page;

do {
   head = &zone->free_area[order].free_list;   /* the free area list */
   curr = list_next(head); /* first element in the list */
   
   if (curr != head) {/* check if the free area list contains elements */
 page = list_entry(curr, struct page, list);
 list_del(curr); /* remove area from the free list */

 /* ... */

 return(page);  
   }

   /* the free area list of this order is empty, 
   try the list of the following order */
   order++;

} while(order < MAX_ORDER);

/* cannot find an appropriate area in this zone */
return(NULL);
question Q3. The free_area[4].free_list list of a zone contains 13 areas. What is the size (in pages) of a single area? What is the total number of pages referenced by this list?
question Q4. What is the maximum size (in pages) of a free area supported by the Linux kernel?

Parsing a free area list

The following code can be used to parse a free area list of order order in the ZONE_NORMAL zone:

struct list_head *head, *curr;
struct zone *zone;

...

zone = &contig_page_data.node_zones[ZONE_NORMAL];

head = &zone->free_area[order].free_list;   /* head of the list */
list_for_each(curr, head) {
struct page *page = list_entry(curr, struct page, list);

/* "page" is now the first page of the current area */
...
}

The list_for_each() macro expands to a for loop which goes through every element of the list. In each iteration, the current position in the list is represented by the curr pointer. To retrieve a page structure from this element, we need to use the list_entry() macro.

question Q5. pageinfo.skel.c is the skeleton of a kernel module that creates a /proc entry called pageinfo. Reading from /proc/pageinfo will output general information about memory and the normal zone (ZONE_NORMAL). Rename the file to pageinfo.c and extend it with the following changes:
  • Modify the pageinfo_proc_read() function in this module so that it also outputs the number of areas in each of the zone's free area lists (you will need to loop over the number of orders and the number of pages in each order; use the code above as a reference). Also print the total number of pages referenced by each list. The final output should be something similar to this:
    [ealtieri@italia os]$ cat /proc/pageinfo 
    ...
    FREE AREA LISTS:
    order 0: 0 areas (0 pages)
    order 1: 72 areas (144 pages)
    order 2: 55 areas (220 pages)
    order 3: 31 areas (248 pages)
    order 4: 17 areas (272 pages)
    order 5: 4 areas (128 pages)
    order 6: 3 areas (192 pages)
    order 7: 2 areas (256 pages)
    order 8: 0 areas (0 pages)
    order 9: 3 areas (1536 pages)
    
  • Treat the other two memory zones the same way. (i.e., print out the same pieces of information for ZONE_DMA and ZONE_HIGHMEM
In order to compile this device driver, you will need to add the line EXPORT_SYMBOL(contig_page_data); to the end of arch/um/kernel/ksyms.c before recompiling the kernel.

Counting page faults

Unfortunately, the kernel does not have a built-in page fault counter. In order to keep track of the number of page faults on the system, we need to implement our own counter. This can be done with a few simple modifications to the page fault handler, do_page_fault(), in arch/i386/mm/fault.c, line 218.

We declare a counter page_faults to keep track of the number of page faults. The counter must be initialized to 0 and incremented at every page fault. The right place to increment page_faults is at the beginning of do_page_fault(), as shown below in bold:

/* in arch/i386/mm/fault.c */

...

unsigned long page_faults = 0; /* page faults counter */

...

asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code)
{
   ...
tsk = current;

/* increment page fault counter */
   page_faults++;

...

At this point the kernel is correctly counting every page fault. However, the use of page_faults is limited to the current module: fault.c. We now want to make the counter accessible from everywhere in the kernel using the extern directive. Add the code shown in bold below to include/linux/mm.h:

/* include/linux/mm.h */

#ifndef _LINUX_MM_H
#define _LINUX_MM_H

...

#ifdef __KERNEL__

#include <linux/config.h>
#include <linux/string.h>
#include <linux/list.h>
#include <linux/mmzone.h>
#include <linux/swap.h>
#include <linux/rbtree.h>

/* Page Faults Counter (implemented in arch/i386/mm/fault.c) */
extern unsigned long page_faults;

...

Finally, we need one more small change to the kernel. Even with the extern directive, the page_faults counter is not accessible to a kernel module. This is because our kernel modules are injected into the operating system at run-time (dynamic linking), instead of being statically linked to the kernel at compile-time.

In order to make our counter accessible to kernel modules and device drivers, we need to export page_faults using a special macro called EXPORT_SYMBOL(name). The file arch/um/kernel/ksyms.c contains a list of exported symbols in the kernel. Add the line shown below to the bottom of this file:

EXPORT_SYMBOL(page_faults);

At this point you are ready to recompile the new kernel with "make clean dep bzImage".

question Q7. Add a page fault counter to the kernel as explained in this section, then recompile and install the kernel. To test the new kernel, modify pageinfo.skel.c so that it prints the number of page faults occurred since system reset.