The Design and Implementation of a Log-Structured File System
Rosenblum, Ousterhout

Introduction

Trends: CPU speed increased drastically while disk access has only improved
a little.  Apps become disk-bound (disk is bottleneck).  
	* cpu speed increases exponentially every year
	* disk improvements mostly in capacity and cost (not performance)
		- bandwidth can be improved via parallelism
		- access time (seek) hard to improve due to mechanics
	* main memory increasing exponentially in size
		- larger read caches & write buffers

Workload (office environment):  average file is small (3-4 KB). results in
many small disk IOs.   metadata updates dominate create/deletion times.

Problems with existing filesystems:
	- data spread out too much (directory, file inode, and data blocks
	  in three separate places--> results in many seeks and poor
	  disk bandwidth utilization
	- too many synchronous writes with too long latency

Assumptions: caches are good, most reads serviced by main memory caches.
Writes will dominate disk traffic.

LFS: writes are batched and written to log sequentially: advantages--
eliminates disk seeks, speeds up crash recovery (only last portion of log
needs to be examined; not entire disk like fsck).  reads are done by using
index structures to point to where data is in the log.

How to read?  just use inodes as usual with single/double/etc indirect
blocks.  how to find inodes?  stored in an inode map; checkpoint region
points to inode map blocks.  inode map in memory most of the time.  

Differences w/ old filesystems: log is only structure on disk (as opposed to
older filesystems used log for metadata or temp space)

Challenge: always need large segments/extents to write to.  segment
cleaner eliminates unused data in segments.  what should be the cleaning
policy?  paper proposes segment cleaning algorithm that performed well in
simulations. 

prototype: Sprite LFS. achieve 70% disk bandwidth utilization compared to
5-10% for old UNIX filesystems for writes.  performance better in all
cases except read files sequentially after being written randomly.

Managing free space: two approaches: threading and copying.  LFS uses
both.  Segments are threaded thoughout the log.  segment size is such that
read/write of a segment is much longer than disk seek time (ie, 512KB or
1MB; this amortizes the cost of doing the seek).  

segment cleaning: copy live data out of log segments, and store in
smaller, "clean" segment.  inode pointers must be updated to point to data
blocks in the clean segment.  

Segment cleaning questions:

* When to run segment cleaner?
* How many segments to clean at a time?
start cleaning when drop below tens of free segments
stop cleaning when 50 - 100 segments are free.
performance wasn't too sensitive to these thresholds

Major questions: (performance sensitive issues)
* Which segments to clean?
* How to group live blocks in clean segments?

Metric: Write cost = multiple of time if there was no cleaning overhead.
write cost = 1 perfect; write cost = 10 only 1/10 of bandwidth can be used
for *actual* writing due to cleaning overhead

write cost = 2 / (1-u) [ where u is segment utilization ]

question: how come the paper says that if segment to be cleaned has no
live blocks (u=0), then write cost is 1.0?  shouldn't it be 2.0? (it is
correct in the graph but incorrect in the text)

Important trade-off between utilized space and performance-- the less
space utilized, the higher will be the performance... both in LFS as well
as in UNIX FFS... in UNIX FFS, however, offers only a fixed point on this
curve-- 10% of available bandwidth with up to 90% utilization.  LFS
performance varies as per the utilization of segments that are selected
for cleaning.   


LFS has a security disadvantage compared to FFS-- file deletes are just
logged by writing an updated directory inode... old data will hang around
in the log until a segment is cleaned.  To do a secure delete, cleaning
should be done immediately.

Cleaning policies

Authors initially tried a cleaning policy where they used a greedy
algorithm (choose lowest utilization for cleaning first) on both uniform
and "hot-and-cold" (10% of files accessed 90% of the time) workloads.  The
greedy algorithm didn't provide any improvement under the hot-and-cold
workload as compared to the uniform workload.  The problem was that free
space in cold segments was more "valuable" since cold segments did not get
cleaned until very low utilization.  To exploit locality, a
algorithm which evaulated benefit vs. cost of cleaning a segment was
used.  Benefit was based on the amount of free space that would be
generated by cleaning a segment *and* the age of the data in the segment
(how likely was the data to continue to be used; the older the data, the
more benefit to cleaning and storing away the segment).  Cost was 1+u
(need to read segment, and then write out utilized portion).   Benefit vs.
cost algorithm created expected bimodal distribution such that highly
utilized, older segments were cleaned as well as lowly utilized newer
segments.  

Information about the age of a segment (most recent modified block in the
segment) and utilization (number of used blocks in the segment) were
stored in a segmet usage table in the prototype system.

Sprite LFS does checkpoints once every 30 seconds, although authors
comment that might be too often.  Roll forward can also be implemented

Performance: LFS outperforms or does equivalent as FFS in every case
except for doing sequential reads after random writes (LFS needs to do
multiple seeks).

Traditional UNIX FFS exploits LOGICAL LOCALITY-- it spends time during
writes to attempt to organize data on disk well to optimize for reads.
LFS exploits TEMPORAL LOCALITY-- data that is used at around the same time
is at the same place in the log.  If logical locality and temporal
locality are about equivalent (sequential read after sequential write),
both systems will perform about the same.  If they are different, systems
will perform differently.

Experience

Average write cost was between 1.2 and 1.6 as opposed to 2.5-3 in
simulations.  Reasons for better performance in experience than in
simulations: 

	- files in simulations were one block long.  real files are
	  longer, and deletes of real files end up creating much 
	  more free space in segments -> less utilization ->
          smaller write cost		

	- in real file systems, some files are never written at all
	  resulting in "colder" segments used in simulations where
	  every file was evenutally written at some point

Crash recovery amazingly fast in LFS, due to only need to look at last
checkpoint plus possible roll forward.  Disk recovery times are in seconds
as opposed to tens of minutes or longer for fsck.  As disks get larger,
fsck time will continue to increase.