File System Background

Background

This covers filesystems basics and the hardware present on the Netbook. The current filesystem design document for Netbook LX is File_System_Paper.

Unix/Linux FS basics

Some details for those perhaps unfamiliar with the basics:

Unix filesystems are unified - i.e. there is only one 'tree' of files and thus only one 'root' from which all files are referenced. i.e. this is unlike DOS/Windows/EPOC where you first select a device/disk (C: D: Z:) and then navigate into the tree of files below that. In Unix each extra disk/device is 'mounted' somewhere in the tree. Doing this overlays anything that was already accessible at that location. In kernel 2.6 this has changed so that it is possible to see both the new files (on the newly mounted disk) and the old files (on the old disk) that appear in the tree at the same place.

Although everything appears as one tree, each device can have it's own filesystem, so quite complex arrangements of storage types and characteristics can be set up. This is one reason why unix installs files by their type of use (config files into /etc, user files into /home, temporary files into /tmp local binaries into /bin). So these could potentially be on different filesystems/partitions on different disks, in ram, or over the network, shared between many machines. This works well except where you want to store a whole set of files for different applications on a different medium -then it works against you.

Hardware Options

Discussion of the characteristics and possible filesystems for use on Internal flash, External Compact Flash and in RAM.

Internal Flash

This is NAND flash. Conventional disk filesystems will work but are not very efficient, and some will wear the flash out very quickly because they write data in the same place for every write (e.g. FAT-based filesystems write data to the File Alocation Table on every write so this bit of the flash wears out much faster than the rest). Filesystems designed for flash also deal with it's characteristics. e.g. NOR flash has extremely slow deletions, all flash can only delete in much larger chunks than it can write, and the number of writes to a page is limited before it must be erased and rewritten).

Flash is accessed through the MTD layer in Linux (Memory Technology Devices). This deals with the details of talking to each chip (different manufacturers and types have differenc command sets. This is fundamentally different to the block device layer which is used for talking to disks. We do not currently have working MTD support for the NetbookLX (SFAIK) - this is necessary for using JFFS2 or YAFFS.

NOR flash is like memory/rewriteable ROM (word access for reads, block access for deletes), whilst NAND flash is like disks (reads and writes in pages (512 bytes on current hardware)).

The two filesystems that support NAND flash are JFFS2 and YAFFS. JFFS2 was originally designed for NOR but now also supports NAND well. YAFFS supports NAND (and RAM). Both are log-structured (journalling) filesystems where each entry is written 'on the end' so there is no extra wear on any part of the device. There is no 'index' on the drive as in a conventional disk-based fielsystem - the index is constructed in RAM on startup when the whole filesystem is read through. This means that boot times can be quite long due to the necessary scanning. YAFFS scans faster than JFFS2 because only headers rather than all the data need to be read.

Note that the wear-levelling provided by the journalling nature of the FS is limited under some circumstances: if the device is nearly full and most of the files are never deleted then disproportiate wear will occur in the remaining segment of flash where files are being written and deleted. JFFS2 has built-in compression which typically means you get about twice as much stuff in as your were expecting. This results in a larger memory footprint in RAM. YAFFS is simpler and very low-resource. To get compression you need to explicitly use compressed files or add a compression filesystem on top (see Compression filesystems)

As we have quite a lot of RAM on the Netbook Lx but are short of flash it seems that JFFS2's compression is much more important than its overheads so this seems likely to be the best Filesystem. It is widely used in portable devices and has been very well tested, although the NAND support is relatively new (2 years). There should be no problems using this.

Compact Flash cards

Compact flash is a slightly confusing technology. It looks like IDE (hard drives) to the machine and can be used as a direct replacement, but inside the device is flash. This used to be NOR flash, but these days is NAND flash. A controller is included in the device to convert between IDE (ATA commands) and the flash chips. We have no control over this controller and manufacturers don't usually tell you much about what it does. The important things from our point of view is how the controller distributes writes over the flash. If it just uses a direct mapping from disk sector to flash page then filesystems that repeatedly write to the same area will wear out those flash zones quickly.The controllers have a stack of 'spare' flash which they use to fill holes in the device when sectors go bad. It is a feature of NAND flash that bad sectors are expected, even from new. Smarter controllers should attempt to distribute writes round the device, and it would be smart to see if it is worth buying devices that do this. However we do not have control of the CF cards people use so need to try and minimize any problems of flash wear by filesystem choice and overall system design.

Because the CF card appears as an IDE device (i.e through the block layer) it if difficult to use filesystems designed explicitly for NAND flash - we have to use filesystems designed for disks. There is a module called blkmtd which allows an IDE device to be accessed as if it was a real MTD device, which is normally a crazy thing to do but may make sense for CF cards. This can be used with JFFS2, but not with YAFFS (because the OOB information for the flash chips is not exposed). This will improve wear-leveling, but cannot deal with flash errors.

We are also constrained if we want to be able to boot from the CF with the existing BooSt - this only recognizes a FAT filesystem so any card which we can boot from needs a FAT partition at the start of the device. This may only be necessary during development or for system restore. The same consideration affects transfer of files to other devices that read CF cards - nearly all will expect to see VFAT.

Our experience of ext3 on the Samdisk CF cards shows that these devices do the above simple mapping and because of the way the ext3 journaling works (all writes go through a 32Mb section at the end of the drive), this part of the device wears out extremely fast if atime modification is enabled (a few days/reboots). However with atime updates disabled it is probably a viable choice, although the journaling does increase the number of writes to the device.

Common filesystems for disk use are: vfat, ext2, ext3, reiserfs, jfs, xfs, ntfs.

Vfat and ntfs are not native UNix FSes and do not support important features like user and group IDs, access permissions and links so are not practical for most systems. Linux NTFS write support is also not well tested. We can ignore these for storing unix system files, but VFAT is useful for storing files that need to be read on other devices and compatibility(cameras, BooSt loading etc.). VFAT is famously unreliable as a frequently-updated filing system.

Ext2 is very reliable and well tested. It's main problem is that it must be shut down properly otherwise it has to do automatic file integrity checks on startup. ('fsck'ing). This can take a long time (3 minutes on 512Mb CF?). The importance of this depends on how often the system is going to be rebooted without having gone through a proper shutdown. As a battery-powered device it seems that this is quite likely to happen so this filesystem is not ideal. It also makes no special efforts to distribute writes, although it doesn't concentrate writes at the start of the device as much as FAT does [Is this true?].

Ext3, reiserfs, jfs and xfs are all journaled so they do not lose filesystem integrity on sudden power loss (unwritten bits of data may be lost of course). However these journals are not-necessarily done the same way as the flash filesystem journals in JFFS2 and YAFFS, so they don't necessarily reduce flash wear.

Ext3 uses a 32MB journal at the end of the drive and all writes go through this section, then are rewritten to the right place later (by kjournald) so in fact it concentrates all writes in one part of the device and doubles the number of writes overall.

Reiserfs uses a very different strategy of balanced B-trees or (in reiser4) 'dancing trees'. It stores small files very efficiently and as the oldest journaled FS (reiser3) is very reliable. There is a new Reiser4 version out which is even faster and more wonderful, but very new. I still haven't found out how either version distributes writes though, and thus whether it would really help.

JFS and XFS are both designed for large (datacenter style) storage and thus probably don't offer useful advantages.

MMC

These cards contain NAND but (like CF cards) do not expose it directly, presenting a disk-like read/write sector interface to the machine. This mean we can use the same filesystems as CF with it, and the same considerations apply. Which probably leaves us choosing between VFAT (for compatibility) and ext2/ext3/reiserfs for reliability. It seems that option sother than FAT may simply not be supported by the hardware - which makes that choice easy.

RAM

RAM can be used via a filesystem as well as as pure RAM. This makes sense especially for the storage of temporary files, and potentially for large chunks of the system.

RAM is much faster than flash so there are significant speed gains to be had from having heavily-used parts of the filesystem in RAM. The tradeoff is that RAM is also needed by applications and the kernel.

Ramdisk - this is a simple fixed-size filesystem in ram, created by emulating a hard disk in memory. The size of the FS cannot be increased or reduced and the pages used are not swappable. It is useful for booting because bootloaders support it but it is not a senbile RAM-based filing system for general use any longer.

ramfs - this filesystem can shrink and grow as required.

tempfs - this is like ramfs but you can also specify a maximum size for it and the pages are swapable. It is always present in any kernel since 2.3 as it is used internally - only the user-visible part is optional. You cannot loop mount into a tempfs.

Filesystem Layout

This is a simplified description of the main directories, by purpose, and what they are used for.

/ - the root of the filesystem
/bin - system executables
/etc - system config files
/usr - where most stuff beyond the base system go so we get /usr/bin for application binaries, /usr/lib for application libraries, /usr/share for application files that are not architecture specific (i.e. they would be the same on x86 or arm machines - docs, examples, colour schemes etc.
/home - the conventional place for user files and config, but can be moved fairly easily.
/tmp - temporary files which will not be needed over a reboot - thus an ideal candidate for a ram-based filesystem
/var - application files e.g package database info, newly-arrived email, log messages, things waiting to be printed.

We can class directories by if/when they are changed (i.e are written to)

/bin - never (or only when system-utilities are installed)
/etc - whenever system config is changed - this could be whenever new network conenctions are set-up for example.
/home - whenever users save files or change their own personal config (e.g. mail client config, colour schemes)
/tmp - whenever any app needs a temporary file - i.e. very often
/var - whenever any app needs to save system data such as new mail arriving, i.e fairly often, whenever any log messages are logged.

Note that unix supports 3 timestamps on files - atime, mtime and ctime. atime is the last access to the files, mtime is the last change to the file itself and ctime is the last change of the file sttributes (owner, permissions). atime implies that every time a file is read, this timestamp is updated, so every read actually genrates a write. For Flash-based filesystems, and low-power systems, this feature is not worth the cost so it should be turned off (using the 'noatime' option on mount). This is assumed in the above description of how often files are accessed.

Power Management

With kernel 2.4 there is not a good device dependency model, so it is tricky for suspend/resume events to be made to happen in the best order. This is fixed in kernel 2.6. Thus, with 2.4, shutting down the CF slot whilst it contains a filesystem that is being actively written-to causes lockups. This may be fixable by kernel changes, but it is probably more appropriate to address it by making sure that there are not lots of actively-written files on the CF at shutdown time. Such writes cannot be completely eliminated (a user may be saving their important file and closing the lid at the same time unless we simply don't allow writes to the CF at all) so we do need to ensure that we are actually fixing the problem rather than making it unlikely (as this would be guaranteed to come and bite people at bad times).

Journalled disk filesystems (eg ext3) have implications for power consumption. Typically there is a journal daemon that kicks off every 5 seconds to move stuff from the journal to it's final resting place on disk. This uses power. You can use noflushd to stop the kernel flushing things to disk more often than necessary. This is promarily aimed at laptop users, but may be worth looking at for optimising CF writes, so that they only happen when necessary.

Keeping most things that change in RAM (which has to be powered all the time anyway) is one way of minimising power consumption.

Other considerations

BooSt only understands FAT(16 or 12?) So at least the kernel must be botted from this. It manages a spare blocks section for this FAT partition.

The squashfs filesytem offers much better compression than JFFS2, but is read-only (at the moment).

Adding applications from removable media

Zeroinstall

(Not very Debian, and not much help to us.)

This is an interesting scheme to allow users to install and run packages. They are downloaded the first time (normally from the net, but could be from a CF card) and then cached in /var/uri/packagename so that next time they are instantly available. This doesn't help us much because things are copied off the install medium, and there isn't room in our system. Also everything has to work from local directories, as oposed to where they are normally installed, and are installed on a per-user basis rather than for the system. This last item is not a problem given that the machine is intended for single-user use. Some apps can cope with being installed to odd places, but I suspect a lot can't and would have to be recompiled with a suitable prefix at least. This we could do, but it is a significant deviation from the standard Debian ways we are hoping to benefit from.

UnionFS

This provides a neat way of adding applications from external media so that they appear transparently in the right places in the filesystem. It requires kernel 2.6. However at this stage it seems that the code is no more than a prototype, so we probably don't have the time or resources to make it work.

Familiar scheme

[TODO - links present, which are dangling when CF not present]

So there is no obvious way that we can transparently have application on the CF appear in the filesystem and package manager when the CF is inserted. The closest we can come is to pretend that the apps are always there but they won't work unless present.

Auditeon

Table of Contents