The Design and Implementation of a Log-Structured File System Rosenblum, Ousterhout Introduction Trends: CPU speed increased drastically while disk access has only improved a little. Apps become disk-bound (disk is bottleneck). * cpu speed increases exponentially every year * disk improvements mostly in capacity and cost (not performance) - bandwidth can be improved via parallelism - access time (seek) hard to improve due to mechanics * main memory increasing exponentially in size - larger read caches & write buffers Workload (office environment): average file is small (3-4 KB). results in many small disk IOs. metadata updates dominate create/deletion times. Problems with existing filesystems: - data spread out too much (directory, file inode, and data blocks in three separate places--> results in many seeks and poor disk bandwidth utilization - too many synchronous writes with too long latency Assumptions: caches are good, most reads serviced by main memory caches. Writes will dominate disk traffic. LFS: writes are batched and written to log sequentially: advantages-- eliminates disk seeks, speeds up crash recovery (only last portion of log needs to be examined; not entire disk like fsck). reads are done by using index structures to point to where data is in the log. How to read? just use inodes as usual with single/double/etc indirect blocks. how to find inodes? stored in an inode map; checkpoint region points to inode map blocks. inode map in memory most of the time. Differences w/ old filesystems: log is only structure on disk (as opposed to older filesystems used log for metadata or temp space) Challenge: always need large segments/extents to write to. segment cleaner eliminates unused data in segments. what should be the cleaning policy? paper proposes segment cleaning algorithm that performed well in simulations. prototype: Sprite LFS. achieve 70% disk bandwidth utilization compared to 5-10% for old UNIX filesystems for writes. performance better in all cases except read files sequentially after being written randomly. Managing free space: two approaches: threading and copying. LFS uses both. Segments are threaded thoughout the log. segment size is such that read/write of a segment is much longer than disk seek time (ie, 512KB or 1MB; this amortizes the cost of doing the seek). segment cleaning: copy live data out of log segments, and store in smaller, "clean" segment. inode pointers must be updated to point to data blocks in the clean segment. Segment cleaning questions: * When to run segment cleaner? * How many segments to clean at a time? start cleaning when drop below tens of free segments stop cleaning when 50 - 100 segments are free. performance wasn't too sensitive to these thresholds Major questions: (performance sensitive issues) * Which segments to clean? * How to group live blocks in clean segments? Metric: Write cost = multiple of time if there was no cleaning overhead. write cost = 1 perfect; write cost = 10 only 1/10 of bandwidth can be used for *actual* writing due to cleaning overhead write cost = 2 / (1-u) [ where u is segment utilization ] question: how come the paper says that if segment to be cleaned has no live blocks (u=0), then write cost is 1.0? shouldn't it be 2.0? (it is correct in the graph but incorrect in the text) Important trade-off between utilized space and performance-- the less space utilized, the higher will be the performance... both in LFS as well as in UNIX FFS... in UNIX FFS, however, offers only a fixed point on this curve-- 10% of available bandwidth with up to 90% utilization. LFS performance varies as per the utilization of segments that are selected for cleaning. LFS has a security disadvantage compared to FFS-- file deletes are just logged by writing an updated directory inode... old data will hang around in the log until a segment is cleaned. To do a secure delete, cleaning should be done immediately. Cleaning policies Authors initially tried a cleaning policy where they used a greedy algorithm (choose lowest utilization for cleaning first) on both uniform and "hot-and-cold" (10% of files accessed 90% of the time) workloads. The greedy algorithm didn't provide any improvement under the hot-and-cold workload as compared to the uniform workload. The problem was that free space in cold segments was more "valuable" since cold segments did not get cleaned until very low utilization. To exploit locality, a algorithm which evaulated benefit vs. cost of cleaning a segment was used. Benefit was based on the amount of free space that would be generated by cleaning a segment *and* the age of the data in the segment (how likely was the data to continue to be used; the older the data, the more benefit to cleaning and storing away the segment). Cost was 1+u (need to read segment, and then write out utilized portion). Benefit vs. cost algorithm created expected bimodal distribution such that highly utilized, older segments were cleaned as well as lowly utilized newer segments. Information about the age of a segment (most recent modified block in the segment) and utilization (number of used blocks in the segment) were stored in a segmet usage table in the prototype system. Sprite LFS does checkpoints once every 30 seconds, although authors comment that might be too often. Roll forward can also be implemented Performance: LFS outperforms or does equivalent as FFS in every case except for doing sequential reads after random writes (LFS needs to do multiple seeks). Traditional UNIX FFS exploits LOGICAL LOCALITY-- it spends time during writes to attempt to organize data on disk well to optimize for reads. LFS exploits TEMPORAL LOCALITY-- data that is used at around the same time is at the same place in the log. If logical locality and temporal locality are about equivalent (sequential read after sequential write), both systems will perform about the same. If they are different, systems will perform differently. Experience Average write cost was between 1.2 and 1.6 as opposed to 2.5-3 in simulations. Reasons for better performance in experience than in simulations: - files in simulations were one block long. real files are longer, and deletes of real files end up creating much more free space in segments -> less utilization -> smaller write cost - in real file systems, some files are never written at all resulting in "colder" segments used in simulations where every file was evenutally written at some point Crash recovery amazingly fast in LFS, due to only need to look at last checkpoint plus possible roll forward. Disk recovery times are in seconds as opposed to tens of minutes or longer for fsck. As disks get larger, fsck time will continue to increase.