|
Talk Overview
- Title: eXplode: a Lightweight, General System for Finding Serious Storage System Errors
- High bit: Junfeng described his work on a system called eXplode, which uses model checking techniques to find serious errors in a large number of storage systems. The system works on real code, finding bugs without requiring the source. It found bugs in every storage system checked (17 total), some leading to a complete loss of data.
Why Chase Storage System Errors?
- Storage systems are among the most important pieces of systems code
- Large variety: file systems, databases, version control repositories
- Bugs in storage systems are especially serious
- Hard disks crash and data can be lost or corrupted (hard disks can be volatile because they're made of mechanical moving parts)
- People often trust storage systems more than other systems
- People get very angry when their data is lost (e.g., Gmail mass data loss)
- Implementors have conflict between speed and reliability
- Storage systems are difficult to test
- Complicated code with bugs that are often difficult to diagnose by code inspection or manual testing
- Data corruption errors often arise from rare events that only occur sporadically and are difficult to reproduce
- Very difficult to test hardware failures (plus repeatedly pulling the power cord isn't a very scalable solution)
- Anecdote about JFS where the developer admitted that they had most likely been bitten years earlier by a bug that loses an entire filesystem. eXplode was able to find that bug in minutes, but the developers could not find the bug in 3 years via manual testing.
Proposed Solution
- Key idea: Explore all choices (this is a software model checking idea)
- Drive execution into tricky corner cases
- At every point in the program where there are multiple possible actions, try all the actions (each such action is called a choice).
- End result: Rare events appear as frequently as common ones, so the probability of hitting a rare corner case increases dramatically
- The Algorithm:
- Generate a new state by executing storage system code with some set of choices (the storage system is exercised by a user-provided driver)
- Simulate a crash
- On crash, generate all possible "crash disks"
- After a crash, some number of disk blocks might have been in the buffer cache (all those written since the last sync)
- Each crash disk corresponds to some subset of all these blocks (that is, each crash disk corresponds to the disk that results if that particular subset of dirty blocks WAS written to disk)
- Run fsck (the file system recovery utility) to attempt to recover the disk. Maintain the invariant that the disk state should at least correspond to the disk state after the last sync.
Comparison with Past Work
- At a high-level, eXplode basically does what its predecessor FiSC did
- The focus of eXplode is making the ideas from FiSC more general, lighter-weight, and easier to apply to real storage systems
- eXplode's intrusiveness is much smaller than FiSC's, meaning it can more easily be applied to new systems
- FiSC based on running a modified version of Linux in user mode to create a fake checking environment. Extremely complex.
- eXplode runs storage system on regular Linux system sitting on real hardware and achieves required behavior with a small kernel module (EKM) and a user-space runtime
- eXplode subsumes FiSC, as it's both simpler and more effective
eXplode's Exploration of New States
- choose(N) is a function that forks N copies of the current process, returning a number [0, N) in each copy, so that multiple decisions can be made at each state
- eXplode uses choose() to explore new states starting from an initial pristine state:
- External choices: To write a custom checker, call choose() to decide what storage system operations to perform at each step (e.g., create a file, remove a directory)
- Internal choices: In addition, instrument a dozen kernel functions with choose(2) where one branch returns an error code (e.g., NULL for kmalloc) and the other branch executes normally
- The intuition here is that we want to drive execution down paths that rarely execute under normal testing conditions
- These few functions are not specific to each storage system, so these changes have to only be made once for every OS you run on (you will likely check much fewer OSes than storage systems, so this is not much of a burden)
- A state can be the configuration of a filesystem (e.g., '/foo/bar/baz.txt')
- Performs breadth-first search, trying to perform all possible actions at one state before moving onto the next state
- Hash states in order to remove duplicates
eXplode's Checkpointing
- Because eXplode runs as one process, it can only explore one choice at a given time. Thus, it must have a way of checkpointing and restoring states so that it can systematically explore all choices.
- Many systems do checkpoint-and-restore by storing the state to memory or disk
- Instead, eXplode restores states by replaying a sequence of choices from a pristine initial state
- Less memory-intensive than storing disk images
- Lighter-weight than using a full-blown VM
- In practice, choice sequences are quite short and fast to replay, so performance has not been a major concern thus far
- To checkpoint a state:
- Make note of choices which took execution to current state
- To restore a state:
- Unmount the storage system
- Run mkfs (initialize the storage system)
- Replay choices
- Some tricky implementation details:
- Set priorities of kernel threads to ensure scheduling determinism
- Use RAM disks to avoid interrupt non-determinism
More Detailed Checking: Crashes after Crashes
- Crashes are often highly correlated, so the system might crash AGAIN during recovery
- For all crash disks CD, generate all possible crash-crash disks CCD (that is, possible disks resulting from crash during recovery)
- Check invariant: fsck(CD) == fsck(CCD) when CCD is a crash-crash-disk of the crash-disk (CD)
- Optimization: Use dynamically-determined write dependencies to throw away redundant disks
Why eXplode Works Well in Practice
- Intuition: If you have a way to generate all possible states during a crash (crash disks) and a way to systematically execute storage system operations (explore new states), then you can simulate a crash at every step and automatically check for corruption errors
- Most bugs found were pretty shallow, meaning that they took less than a dozen state transitions
- However, these transitions would have been very difficult to make by manual testing since they often rely on rare events occurring in some strange order
- Most bugs found within 5 to 20 minutes of testing, so time performance of doing replay to restore states didn't matter too much in practice
- Found 36 bugs in a wide range of storage systems on Linux and FreeBSD
- Filesystems: ext2, ext3, ReiserFS, Reiser4, JFS, XFS, MSDOS, VFAT, HFS, HFS+
- Version control: CVS, Subversion, large proprietary system
- Database: Berkeley DB
- Network file system: NFS
- RAID
- VMware GSX server
- Results impressive because of the breadth of systems checked, and the ability for eXplode to check systems that were layered on top of each other (e.g., CVS on top of ext3)
- The results from eXplode subsume the results from FiSC (previous work)
Writeup Authors
- Philip Guo - pg at cs.stanford.edu
- Peter Pawlowski - piotrek at cs.stanford.edu
|
|