Analysis of Transaction Management Performance

Dan Duchamp

 

Goal: investigate and performance of a distributed transaction processing manager that could be included in an OS.

 

TranMan:

-        supports begin, abort, and commit transaction

-        supports arbitrary nesting and distribution

-        transactions should be cheap enough to use as a primitive

-        multicast used to make distributed commit less expensive

 

 

 

Two commit protocols implemented:

1)     Optimized Two-Phase Commit
standard 2-phase commit, plus:
- participants in a transaction don’t force a commit record to disk upon commit.  This saves one disk write.  Instead, the commit record at the coordinator is used as a proxy.  Once the coordinator commits, it write a log record to disk, and if a participant is ever confused about whether a transaction committed or not, it just asks the coordinator.  As a result, the coordinator must not forget about whether a particular transaction committed or aborted until all the participants acknowledge they know the result of it.  To avoid extra messages, these acknowledgements are piggybacked onto existing communication.
- participants drop their locks before committing transactions for performance.  (while this sacrifices recoverability, the idea is that in practice transactions will usually commit).  This allows for greater concurrency since the locks are dropped a bit earlier.

2)     Non-Blocking (Three-Phase) Commit
can deal with loss or coordinator and/or network partition
participants don’t end up blocking forever with locks held if the coordinator fails or becomes inaccessible.
Phase 1 – coordinator send list of involved participants as part of prepare message
If a coordinator dies, the participants just don’t hang out holding locks forever.  Instead, after a timeout, any participant can become a coordinator.  Multiple coordinators are OK, as long as one of them is able to get commit/abort messages from all the participants, and then inform all participants.
If a coordinator dies after it has received commit messages from all the participants, but before it has communicated the commit to all the participants, one or more participants will elect themselves to be coordinator(s), and continue phase 2 to complete the transaction.

Phase 2 – replication phase
coordinator collects info from participants about whether they want to commit or abort, and replicates this information at other participants.  Participants write prepare and commit records in their log.  A commit or abort is not allowed until enough information is replicated at other sites to exclude the other outcome.  No participant forgets about a transaction until all participants have committed or aborted.

Log Batching = Group Commit is used to improve throughput.  When a transaction commits, a log record is not written to disk immediately.  Instead, once a group of transactions have all committed, and there are enough commit records to fill an entire disk blocks, all the transactions are committed as a group by writing their commit records together.  The disadvantage of group commit is that if the system dies, more than one “committed” transaction can be lost since the log record never got to disk.  The fundamental trade-off is throughput vs. durability.

 

TranMan uses Mach’s external pager facility (it can specify which pages should get swapped out when) to help implement write-ahead logging.  (Make sure that a page gets written to disk before it gets swapped out?)

 

A Communication Manager is inserted in between the client application and Mach’s NetMsgServer, and it spies on all communication going by to determine when new clients become involved in a given distributed transaction.

 

 

Multi-Threading

To leverage multiprocessors, TranMans create a pool of threads that can generally execute any transaction code.  Threads are never destroyed.

 

Because obtaining locks is a synchronous activity that takes time and involves RPCs across nodes, multi-threading is very beneficial—a thread can be put to sleep until the appropriate lock is granted and only woken up then.

 

 

Duchamp gripes a lot about Mach’s bad kernel performance.  Mach RPC performance sucks (28.5 ms for an RPC in 1989).

 

Performance

 

Log force takes 15ms, so it was important to minimize the number of them.

 

Empirically, the time it took to do a distributed transaction increased in a sub-linear fashion, even though it may seem like a lot of operations in a distributed transaction (such as multiple participants sending prepare messages) could all happen concurrently.  In reality, the coordinators became bottlenecks.

 

Multicasting resulted in much better performance, showing that the coordinators multiple sends to participants were responsible for degraded performance in addition to repeated receives.  However, Duchamp remarks that this might be due to OS scheduling policies.

 

Also, it took 5ms to drop a remote lock???  I am not quite sure if I buy this… I thought the remote RPC time was 28.5ms?  How do you drop a remote lock any faster?

 

2-Phase Commit:

 

What is the latency of distributed commitment? 

3datagrams + 2 log forces for a 2-site write.  Predicted=66.5ms; Measured=77.5.

 

What is the effect of the “read-only optimization”?  (Read-only sites are excluded from second phase of protocol.)

 

What is the effect of delayed-commit (use commit at coordinator)?