CSC 262

Ext2 File System

 

The purpose of this lab is to help you to understand the data structures that comprise the ext2 file system. This is a relatively simple file system used in Linux before the development of the ext3 file system, and still used for floppy disks and other small volumes. Most Linux partitions on hard drives use the more sophisticated ext3 file system, which is more efficient for larger disks and offers other advantages. However, for our purposes the ext2 file system is perfect because it exemplifies all of the basic technology we have been studying.

This lab used to be completed using a file system on a floppy disk. With floppy drives going the way of the dinosaur, it now uses USB memory sticks (to be provided by the instructor, and returned at the end of the lab).

Contents

  1. Lab Overview
  2. How to create an ext2 filesystem
  3. Structure of an ext2 filesystem
  4. The super block
  5. The group descriptors
  6. The inodes and blocks bitmaps
  7. The inode table
  8. Directory inode entries
  9. Locating a file

Overview

In the very first part of this lab, you will create an ext2 file system on a USB memory stick using built-in Linux utilities. Most of the remainder of this page explains the data structures that are set up on the disk during this process. There are also sample programs that read and display parts of these data structures. You should read through the lab description, pausing when you get to a program to compile and run it. Make sure that you understand what each program is doing, and how. Your ultimate goal is to be able to write a program that will read and interpret the data structures on the drive, by parsing the raw block data returned by the device driver (see Question 9). Your program will return a listing of the drive's contents without relying on built in utilities like ls. The rest of this web page is designed to guide you through this process.

One more note may be helpful as you read the information below. The memory stick looks to the operating system just like an external hard drive, with information organized into tracks and sectors. The remainder of this lab will adopt this metaphor, and treat the stick as though it is an actual drive. The sectors on the drive are ordered, so that all the storage can be treated as though it is a sequence of addressable bytes (rather like our view of main memory). The primitive commands from the disk device driver that we are interested in at this level consist of seek (move the read/write head into position over a specified data address) and read (transfer a number of bytes from the disk, starting from the current data address). Thus a number of the questions below ask you to compute "offsets" -- that is, the data address where a certain piece of information resides on the disk, to which you would seek before reading. You can do this by computing the size in bytes of all the information before that which is desired. Often this will amount to multiplying the number of preceding blocks by the block size. In certain cases, such as finding the offset of a particular inode, the computation will be more complicated. In any case, understanding the "blocky" nature of disk storage is important for efficiency when dealing with real hard drives -- the fewer disk blocks you have to read, the better!

How to create an Ext2 filesystem

A new memory stick comes formatted for Windows. That means that it uses the vfat filesystem, which is simpler and less capable than ext2. Our first step will be to replace the vfat format with ext2. (Note: if for some reason the stick came without formatting, we could create a formatted partition on it using fdisk. If we were working with a floppy, we would format it first using fdformat.) Insert the memory stick into one of the USB ports on the left side of the monitor screen.

Creating the ext2 file system is simple using the mke2fs command. However, we first have to make sure that the vfat drive is not mounted (connected to the file system directory tree). Modern Linux systems automatically detect insertion of a memory stick and mount it for you. We can see the devices currently mounted using the mount command. The result should look somewhat like the following:

/dev/hda5 on / type ext3 (rw,noatime)
none on /proc type proc (rw)
none on /proc/bus/usb type usbfs (rw)
none on /sys type sysfs (rw)
/dev/hda7 on /boot type ext3 (rw,noatime)
/dev/hda1 on /mnt/windows type ntfs (ro,umask=0022,nls=iso8859-1)

The memory stick should be the first (and only) SCSI drive, referred to by Unix as /dev/sda. (Additional SCSI drives would be named /dev/sdb, /dev/sdc, etc. The floppy drive is /dev/fd0.) If you see a reference to /dev/sda1, that means that the memory stick has already been mounted, probably at /media/disk. To unmount it, use umount*. You may then proceed with the ext2 format, using mke2fs*. Finally, remount the device and change the permissions on it to be writable by anyone*. (The items marked with a * need root privileges, so you should use the sudo command.) These steps are summarized below.

$ mount
...
/dev/sda1 on /media/disk type ext2 (rw,noexec,nosuid,nodev,sync,users,user=nhowe)
$ sudo umount /media/disk
Password:
$ mount
$ sudo /sbin/mke2fs /dev/sda1
mke2fs 1.35 (28-Feb-2004)
Filesystem label=
OS type: Linux
Block size=1024 (log=0)
Fragment size=1024 (log=0)
31488 inodes, 125828 blocks
6291 blocks (5.00%) reserved for the super user
First data block=1
16 block groups
8192 blocks per group, 8192 fragments per group
1968 inodes per group
Superblock backups stored on blocks: 
	8193, 24577, 40961, 57345, 73729

Writing inode tables:  done                            
Writing superblocks and filesystem accounting information: done

This filesystem will be automatically checked every 21 mounts or
180 days, whichever comes first.  Use tune2fs -c or -i to override.
$ sudo mount /dev/sda1 /media
$ ls -al /media
total 17
drwxr-xr-x  3 root root  1024 Feb  4 15:23 .
drwxr-xr-x  5 root root  4096 Feb  4 15:01 ..
drwx------  2 root root 12288 Feb  4 15:23 lost+found
$ sudo chmod 777 /media
$ ls -al /media
total 17
drwxrwxrwx  3 root root  1024 Feb  4 15:23 .
drwxr-xr-x  5 root root  4096 Feb  4 15:01 ..
drwx------  2 root root 12288 Feb  4 15:23 lost+found

Note that the mount and umount commands require root access, and require a certain amount of system knowledge to accomplish successfully. The root requirement can be waived by including a line in a system file called /etc/fstab, specifying which filesystems are allowed to be mounted where. Still, using this method a user had to know enough to issue a mount command when a new piece of media was inserted -- one of the reasons Linux has a reputation as less user-friendly than Windows or Macintosh. As you have seen, current systems now have an autodetect/automount daemon to handle everything automatically.

Structure of an Ext2 Filesystem

On disk, the Ext2 filesystem is organized as shown in the picture below:

Ext2

The first 1024 bytes of the disk, the "boot block", are reserved for the partition boot sectors and are unused by the Ext2 filesystem. The rest of the partition is split into block groups, each of which has the layout shown in the figure above. On a 1.44 MB floppy disk, there is only one block group. However, a 128 MB memory stick will have several (16, according to the example above).

The Ext2 data structures above are defined in include/linux/ext2_fs.h. In the following sections we will go through each of these data structures.

The SuperBlock

The superblock is defined in struct ext2_super_block, line 339 of include/linux/ext2_fs.h. It contains information such as the total number of blocks on disk, the size of a block (usually 1024 bytes), the number of free blocks, etc. The meaning of each field in the ext2_super_block structure is explained in ext2_fs.h. Part of this structure has been reported below:

struct ext2_super_block {
__u32 s_inodes_count;       /* Inodes count */
__u32 s_blocks_count;       /* Blocks count */
...
__u32 s_free_blocks_count;  /* Free blocks count */
__u32 s_free_inodes_count;  /* Free inodes count */
__u32 s_first_data_block;   /* First Data Block */
__u32 s_log_block_size;     /* Block size */
...
__u32 s_blocks_per_group;   /* # Blocks per group */
...
__u16 s_magic;  /* Magic signature */
...

The __u32, __u16 and __u8 data types denote unsigned 32-, 16- and 8-bit integer numbers.

s_inodes_count and s_blocks_count store the number of inodes and blocks on disk. If you look back at the output of mke2fs, you will see the total number of blocks and inodes is displayed. Also notice from the same output that the size of a block is 1024 bytes. Multiplying the total number of blocks by the block size gives you the total storage on the disk.

The size of a block is given by s_log_block_size. This value expresses the size of a block as a power of 2, using 1024 bytes as the unit. Thus, 0 denotes 1024-byte blocks, 1 denotes 2048-byte blocks, and so on. To calculate the size in bytes of a block:

unsigned int block_size = 1024 << super.s_log_block_size;   /* block size in bytes */

The super block also tells us the number of blocks per group with s_blocks_per_group. Using mke2fs, we can see that this value is 8192.  (The reason for this number will be explained below.)  Because there are 1440 blocks on a floppy, there can only be one group on a floppy disk. However, drives with larger capacities (e.g., our memory sticks) will have more groups.

question Q1. What is the difference in terms of wasted space and average file size between using a small block size (1024) and a big block size (e.g. 8192)?
question Q2. What is the size in bytes of a block if s_log_block_size=3 in the super-block?

The superblock is located at offset 1024 of the disk. The code to read the superblock from a disk is shown below. This code also checks the magic number of the super block (EXT2_SUPER_MAGIC) to confirm that we are reading from an Ext2 filesystem. For simplicity, other error checking has been omitted.

#include <linux/ext2_fs.h>

...

int main()
{
int sd;
struct ext2_super_block super;

sd = open("/dev/sda1", O_RDONLY);    /* open sda device */

lseek(sd, 1024, SEEK_SET);    /* position head above super-block */
read(sd, &super, sizeof(super));      /* read super-block */

if (super.s_magic != EXT2_SUPER_MAGIC)
exit(1); /* bad file system */

block_size = 1024 << super.s_log_block_size;  /* calculate block size in bytes */

...
ext2super.c ext2super.c reads the superblock from a disk and prints it to screen.

Group Descriptors

In the blocks immediately following the super-block reside the list of block-group descriptors. This list contains a descriptor for each block group on the disk. In the case of a floppy, there is only one block group and therefore one group descriptor. For a bigger disk, we have to calculate the size of this list by using s_blocks_count and s_blocks_per_group in the superblock:

/* calculate number of block groups on the disk */
unsigned int group_count = 1 + (super.s_blocks_count-1) / super.s_blocks_per_group;

/* calculate size of the group descriptor list in bytes */
unsigned int descr_list_size = group_count * sizeof(struct ext2_group_descr);

A group descriptor is defined by the ext2_group_descr structure, line 148 of ext2_fs.h. This structure is reported below:

struct ext2_group_desc
{
__u32 bg_block_bitmap;       /* Blocks bitmap block */
__u32 bg_inode_bitmap;       /* Inodes bitmap block */
__u32 bg_inode_table;        /* Inodes table block */
__u16 bg_free_blocks_count;  /* Free blocks count */
__u16 bg_free_inodes_count;  /* Free inodes count */
__u16 bg_used_dirs_count;    /* Directories count */
__u16 bg_pad;
__u32 bg_reserved[3];
};

A 1.44 MB floppy has one group descriptor only, which can be read using the following code (the same code will return the group descriptor for the first block group of any disk, which is sufficient to read the root directory of the disk):

struct ext2_group_descr group_descr;
...
lseek(sd, 1024 + block_size, SEEK_SET);  /* position head above the group descriptor block */
read(sd, &group_descr, sizeof(group_descr));
ext2group.c ext2group.c reads the first group descriptor of a disk and prints it to screen.

The group descriptor tells us the location of the block/inode bitmaps and of the inode table (described later) through the bg_block_bitmap, bg_inode_bitmap and bg_inode_table fields. These values indicate the blocks where the bitmaps and the table are located. It is handy to have a function to convert a block number to an offset on disk, which can be easily done by knowing that all blocks on disk have the same size of block_size bytes (calculated earlier from the super-block):

#define BASE_OFFSET 1024  /* location of the super-block in the first group */

#define BLOCK_OFFSET(block) (BASE_OFFSET + (block-1)*block_size)

Blocks are numbered starting from 1. Block 1 is the superblock of the first group, block 2 contains the group descriptors, and so on. Block 0 is the NULL block and does not correspond to any disk offset.

The blocks and inodes bitmaps

A bitmap is a sequence of bits. Each bit represents a specific block (blocks bitmap) or inode (inode bitmap) in the block group. A bit value of 0 indicates that the block/inode is free, while a value of 1 indicates that the block/inode is being used. In practice, the first few blocks of a group will always be in use, because they store the header information for the group. A bitmap always refers to the block-group it belongs to, and its size must fit in one block.

blocks bitmap

Limiting the size of a bitmap to one block also limits the size of a block-group, because a bitmap always refers to the blocks/inodes in the group it belongs to. Consider the blocks bitmap: given a block size of 1024 bytes, and knowing that each byte is made of 8 bits, we can calculate the maximum number of blocks that the blocks bitmap can represent: 8 * 1024 = 8192 blocks. Therefore, 8192 blocks is the size of a block-group using a 1024-byte block size, as we also see from the output of mkfs.ext2 in the first section.

question Q3. Calculate the number of blocks in a block-group given a block size of 4096 bytes.
question Q4. Consider how the blocks and inodes bitmaps might be used to create "secret" files on a disk not visible to the ordinary operating system. Explain how inconsistencies in these data structures could be detected through the other header information. (Obviously, this would be a slow and time-consuming process.)

The following code fragment reads the block bitmap from disk:

struct ext2_super_block super;  /* the super block */
struct ext2_group_desc group;        /* the group descritopr */
unsigned char *bitmap;

/* ... [read superblock and group descriptor] ... */

bitmap = malloc(block_size);    /* allocate memory for the bitmap */
lseek(sd, BLOCK_OFFSET(group->bg_block_bitmap), SEEK_SET);
read(sd, bitmap, block_size);   /* read bitmap from disk */

...

free(bitmap);

The inode table

The inode table consists of a series of consecutive blocks, each of which contains a predefined number of inodes. The block number of the first block of the inode table is stored in the bg_inode_table field of the group descriptor. To figure out how many blocks are occupied by the inode table, divide the total number of inodes in a group (stored in the s_inodes_per_group field of the superblock) by the number of inodes that fit on a block:

/* number of inodes per block */
unsigned int inodes_per_block = block_size / sizeof(struct ext2_inode);

/* size in blocks of the inode table */
unsigned int itable_blocks = super.s_inodes_per_group / inodes_per_block;

In the case of a floppy disk, there are 184 inodes per group and a block size of 1024 bytes. The size of an inode is 128 bytes, therefore the inode table will take 184 / (1024/128) = 23 blocks.

question Q5. How many blocks will the inode table take up on each block group of our 128 MB memory stick? What is the total size of the group header information, both in absolute terms and relative to the number of data blocks?

The inode table contains everything the operating system needs to know about a file, including the type of file, permissions, owner, and, most important, where its data blocks are located on disk. It is no surprise therefore that this table needs to be accessed very frequently and its read access time should be minimized as much as possible. Reading an inode from disk every time it is needed is usually a very bad idea. However, in this context we will adopt this method to keep the example code as simple as possible. We provide a general function to read an inode from the inode table:

static void read_inode(
     int sd,                                /* the floppy disk file descriptor */
     const struct ext2_super_block *super,  /* the super block for the disk */
     const struct ext2_group_desc *group,   /* the block group to which the inode belongs */
     int inode_no,                          /* the inode number to read  */
     struct ext2_inode *inode)              /* where to put the inode */
{
    int group_no = inode_no/super->s_inodes_per_group;
    lseek(sd, BLOCK_OFFSET(group[group_no].bg_inode_table)+(inode_no-group_no*num_inodes_per_group-1)*sizeof(struct ext2_inode),SEEK_SET);
    read(sd, inode, sizeof(struct ext2_inode));
}

The offset of the inode to read is calculated by adding together the absolute offset of the inode table and the distance of the desired inode from the beginning of the inode table.

question Q6. Calculate the offset of inode 93 on disk. Assume a block size of 1024 bytes and bg_inode_table=5. To which block does the inode belong to? (The size of an inode structure is 128 bytes)

Inodes are numbered starting from 1. An inode is defined as a struct ext2_inode, in ext2_fs.h. The most important fields of this structure have been reported below:

struct ext2_inode {
__u16   i_mode;           /* File type and access rights */
__u16   i_uid;            /* Low 16 bits of Owner Uid */
__u32   i_size;           /* Size in bytes */
__u32   i_atime;          /* Access time */
__u32   i_ctime;          /* Creation time */
__u32   i_mtime;          /* Modification time */
__u32   i_dtime;          /* Deletion Time */
__u16   i_gid;            /* Low 16 bits of Group Id */
__u16   i_links_count;    /* Links count */
__u32   i_blocks;         /* Blocks count */
__u32   i_flags;          /* File flags */
...
__u32   i_block[EXT2_N_BLOCKS];  /* Pointers to blocks */
...
};

Most of the fields in this structure are self-explanatory. Particular atttention should be paid to the fields in bold, explained in the following paragraphs.

i_mode determines the type and access rights of a file. Possible file types are listed below. For each file type is defined a macro (sys/stat.h) that can be used to test for that specific file type.

TypeMacro
Regular fileS_ISREG(m)
Directory S_ISDIR(m)
Character Device  S_ISCHR(m)
Block DeviceS_ISBLK(m)
FifoS_ISIFO(m)
SocketS_ISSOCK(m)
Symbolic LinkS_ISLNK(m)

The file permissions are also stored in i_mode. These permissions can be tested by ANDing i_mode with a set of symbols defined in sys/stat.h:

DomainReadWriteExecAll
UserS_IRUSRS_IWUSRS_IXUSRS_IRWXU
GroupS_IRGRPS_IWGRPS_IXGRPS_IRWXG
AllS_IROTHS_IWOTHS_IXOTHS_IRWXO

For example, to test if a file has user-execute permissions:

if (inode.i_mode & S_IXUSR) ... 

The i_blocks field of the inode structure counts the number of blocks used by the file. Pointers to the actual data blocks of the file are stored in the i_block[EXT2_N_BLOCKS] array. The EXT2_N_BLOCKS symbol is defined in ext2_fs.h (line 177) as following:

#define EXT2_NDIR_BLOCKS12                          /* number of direct blocks */
#define EXT2_IND_BLOCKEXT2_NDIR_BLOCKS              /* single indirect block   */
#define EXT2_DIND_BLOCK(EXT2_IND_BLOCK + 1)         /* double indirect block   */
#define EXT2_TIND_BLOCK(EXT2_DIND_BLOCK + 1)        /* trible indirect block   */
#define EXT2_N_BLOCKS(EXT2_TIND_BLOCK + 1)          /* total number of blocks  */

In total there are 15 pointers in the i_block[] array. The meaning of each of these pointers is explained below:

question Q7. Given a block size of 4096 bytes, what is the maximum size (in both blocks and bytes) of a file without indirection? What is the total maximum size of a file if all pointers (direct and indirect) are used? (Remember that each "block pointer" is declared as __u32, that is 4 bytes)

The following example prints the blocks used by the root directory of a disk, which is always on inode 2 of the inode table:

struct ext2_inode inode;
struct ext2_group_desc group;

/* ... [ read superblock and group descriptor ] ... */

read_inode(sd, &super, &group, 2, &inode);   /* read root inode (#2) from the floppy disk */

for (i=0; i<EXT2_N_BLOCKS; i++)
if (i < EXT2_NDIR_BLOCKS) /* direct blocks */
printf("Block %2u : %u\n", i, inode.i_block[i]);
else if (i == EXT2_IND_BLOCK)     /* single indirect block */
printf("Single   : %u\n", inode.i_block[i]);
else if (i == EXT2_DIND_BLOCK)    /* double indirect block */
printf("Double   : %u\n", inode.i_block[i]);
else if (i == EXT2_TIND_BLOCK)    /* triple indirect block */
printf("Triple   : %u\n", inode.i_block[i]);
ext2root.c ext2root.c reads the root inode (#2) from the inode table of a disk.

Directory entries in the inode table

Directory entries in the inode table require special attention. To test if an inode refers to a directory file we can use the S_ISDIR(mode) macro:

if (S_ISDIR(inode.i_mode)) ... 

In the case of directory entries, the data blocks pointed by i_block[] contain a list of the files in the directory and their respective inode numbers.

Directory Entry

Directory inode

The list is composed of ext2_dir_entry_2 structures:

struct ext2_dir_entry_2 {
__u32 inode;              /* Inode number */
__u16 rec_len;            /* Directory entry length */
__u8 name_len;            /* Name length */
__u8 file_type;
charname[EXT2_NAME_LEN];  /* File name */
};

The file_type field indicates what kind of file the entry is referring to. Possible values are:

file_typeDescription
0Unknown
1Regular File
2Directory
3Character Device
4Block Device
5Named pipe
6Socket
7Symbolic Link

Each entry has a variable size depending on the length of the file name. The maximum length of a file name is EXT2_NAME_LEN, which is usually 255. The name_len field stores the length of the file name, while rec_len stores the total size of the entry and is used to locate the next entry in the list.

dir entries

Example of EXT2 directory

question Q8. Can you explain when and why NULL characters ("\0") are appended to the end of the file name?

The following code reads the entries of a directory. Assume that the inode of the directory is stored in inode:

struct ext2_dir_entry_2 *entry;
unsigned int size;
unsigned char block[BLOCK_SIZE];

...

lseek(sd, BLOCK_OFFSET(inode->i_block[0]), SEEK_SET);
read(sd, block, block_size); /* read block from disk*/

size = 0;    /* keep track of the bytes read */
entry = (struct ext2_dir_entry_2 *) block;   /* first entry in the directory */
while(size < inode->i_size) {
char file_name[EXT2_NAME_LEN+1];
memcpy(file_name, entry->name, entry->name_len);
file_name[entry->name_len] = 0;      /* append null char to the file name */
printf("%10u %s\n", entry->inode, file_name);
entry = (void*) entry + entry->rec_len;      /* move to the next entry */
size += entry->rec_len;
}

The code above reads the first data block of a directory from disk. The block is stored in the block array. As explained earlier, this block contains a list of the directory's entries. The problem is that entries in this list have a variable size. This is the reason why we cannot just read the data block into an array of struct ext2_dir_entry_2 elements.

The entry pointer points to the current entry in the directory. The file name of the entry is copied from entry->name into file_name so that a null character can be appended to it. The inode and name of the file is then displayed.

Finally, the position of the following entry in the list is given by entry->rec_len. This field indicates the length in bytes of the current entry record. Therefore, the next entry is located at address (void*) entry + entry->rec_len.

Notice that the code above only works if the size of the directory is less than a block. If the entries list takes more than one block, the program will crash. The program in the link below includes a function called read_file that will read the entire directory contents into a memory buffer in order to solve this problem. (This may also fail if a directory is so large that an adequate block of memory cannot be allocated.)

source ext2list.c lists the contents of the root directory of a disk.

question Q9. To speed up deletion of an entry from a directory, the Linux kernel sets the inode field of the entry to be deleted to 0, and suitably increments the value of the rec_len field of the previous valid entry. Redraw the figure above after deletion of the test.txt entry.
question Q10. Modify ext2list.c so that it shows the contents of the whole disk. Start by assuming that all files are on the first block group, then modify your code to deal with multiple block groups.  (Hint:  because of the presence of symbolic links, you will need to introduce a data structure to keep track of which inode numbers have already been seen and processed.)

Locating a file

Locating the data blocks belonging to a file implies locating its inode in the inode table first. The inode of the desired file is generally not known at the time the open operation is issued. What we know is the path of the file. For example:

int sd = open("/home/ealtieri/hello.txt", O_RDONLY);

The desired file is hello.txt, while its path is /home/ealtieri/hello.txt.

To find out the inode belonging to the file we first need to descend through its path, starting from the root directory, until we reach the file's parent directory. At this point we can locate the ext2_dir_entry_2 entry corresponding to hello.txt and then its inode number.

locate

Once the inode of the file is known, the data blocks belonging to the hello.txt are specified by the inode.block[] array.

The root directory is always located at inode 2.

question Q11. Illustrate the process of looking up the file /usr/bin/gcc similarly to the figure above. You are free to assign any number to the inodes as long as your numbers are consistent and do not exceed the maximum inode number (see mkfs.ext2).  (You can view the actual inode numbers using stat, if you wish.)
question Q12. [Extra credit] Write a program that will take a path name as a command line argument, and will then locate the specified file on the disk (if it exists) and output its contents character by character. (Note that your program will bypass any permission settings on the file, so it would be unwise to allow any user other than root to run such a process.)