Contents

Speaker background

  • Article about Mike Burrow's background
  • PHD in distributed file systems
    • Gave up on that quickly
  • Worked on:
    • authentication schemes
    • Local Area networks
    • Original search engine for altavista
    • Worked on search engine for Microsoft
      • Left to work at Google
    • Low-level stuff, like mutexes and auto-finding race conditions
  • At Google, tries to prevent disasters from use of concurrency
    • Lots of new Google employees don't have this experience

Why Concurrency?

  • Uniprocessors aren't speeding up enough
  • Only solution is multiprocessors
    • Google gets more and more queries each year

What are good concurrency primitives?

  • Threads and locks are the way to go
    • Overview of Locks
    • General purpose
    • Simple enough to write and maintain (although not simple)
    • Efficient
    • Provides abstraction, which reduces complexity

Alternatives to locks and their problems

  • Separate Address Spaces
    • Threads that don't communicate via memory
    • Can fail independently
    • Fine if it fits the problem
    • Problems:
      • Slower
      • Just a subspace of general threading - what if threads need to share data?
  • Events
    • "People who promote this should be killed"
    • Involves specifying what should execute after what
    • Uses continuations
    • Problems:
      • Destroys abstraction!
      • Debugging is hard
        • Stack trace is useless - who knows what sequence of events actually led to that state
      • Complicated to use
        • Variables that should be local often get put in global structures
        • code is split up
  • Cooperative Multitasking
    • The programmer explicitly states where in the code a thread can be descheduled
    • If a thread doesn't deschedule itself, all other threads will appear to freeze
    • No preemption
    • No real concurrency
    • However, things CAN preempt during printg
      • Makes debugging a nightmare
    • Atomicity is fragile
  • Some of these techniques might work for small cases, but in real world, systems get bigger and have more and more features. The only technique that scales are general-purpose locks.
    • Explicit synchronization is maintainable

Message Passing

  • Overview
  • The dual of locks (Needham and Lauer link)
    • Acquire/Release corresponds to Request/Reply
    • Race = out-of-order messages
    • "Any bug you can create in one you can create in the other."
  • Messages are used when:
    • Hardware or code structure suits it
    • Unit of work is large compared to cache miss
    • Using batching to improve throughput
  • Locks are used when:
    • Unit of work is small
    • Doing work inline is cheaper

Lock-free Data Structures

  • Extremely complicated to write
    • Only recently (2002) have efficient lock-free data structures been made link
    • Require garbage collection
      • It's too hard to know when programming whether another thread can touch the data
      • especially hard in c++
    • Structures are slower than locked data structures unless there's a single point where threads block on a lock
      • Can usually split locked structures to increase throughput and concurrency
        • i.e., instead of locking an entire hash table, lock the individual buckets

Transactional Memory

  • Surround code that should be executed atomically in an atomic block:
  atomic {
  ...
  }
  • Works like a database: If there is interference during an atomic operation, the entire operation needs to be rolled back and re-executed
    • Requires buffering
    • Atomic regions need to be relatively small
    • Current hardware not at a state yet where this is a feasible solution to be used generally
    • Software implementations are too slow
  • Will eventually become a useful construct
  • Can still have races, although some are easier to prevent
  • Locks are still needed
    • e.g., two threads that use a lock to notify each other when to start/stop

Races

  • A flaw in a system which is dependent on the timing or ordering of events Overview
  • Not too hard to find if you have the right tool
  • High-level races need different tools
    • i.e., Protecting invariants across variables

Deadlocks

  • Easy to detect
    • A deadlock corresponds to a cycle in the acquired-before graph
  • Easy to debug
    • Threads stop
    • Can examine state of program at point of deadlock
  • Better to have a deadlock than a race!
  • Create a partial order on the locks to prevent all deadlocks
  • Would a user rather have a race than a deadlock?
    • Maybe it's better to have no service rather than the wrong service
    • Can restart the program

Things threads interact poorly with

  • Signal handlers
    • use sigwait()
  • fork()
    • locks held/invariants broken
    • cuts across abstraction boundaries
    • child threads share things with parent, like file descriptors
    • child threads change some things, like the pid
    • When using fork(), no concurrent code can depend on these basic invariants!

Programming philosophy

  • Brittle code is good!
    • An error should crash the system, so it is easy to detect and fix
    • Even things like assertions should stay in deployed code
      • In a time of ubiquitous networks, updates to fix problems can be pushed out fast
  • The only case where code should be redundant and recover from errors is critical code
    • Spacecraft
    • Life support systems

The Mutex Specification

  • Minimally implement lock(mu) and unlock(mu)
  • Beyond that:
    • Should mutexes be FIFO?
    • Should mutexes be re-entrant?
    • Reader-writer locks?
    • Some assumptions might not hold.
      • For example: In a C++ 'unref' call (decrementing an object's refcount), you need the property that, after the atomic moment of the unlock, you never touch an object's memory again.
      • DEC had a bug along these lines. (Random note: Burrows was annoyed with the way they released the patch, which broke code without incrementing any version number).

FIFO mutexes

  • Used in real-time system.
  • Poor performance under contention. Does this matter?
    • More CPUs -> more contention
    • Overload should not reduce throughput (or you get stuck)
    • Programmers make mistakes
  • Linux manpage for fast mutexes (futexes) says (approx.) "We made the fast path fast and didn't worry about the slow case, because 'any sane design avoids mutex contention.'" But you can't predict how mutexes will be used.

Mutex performance

  • Shows table of mutex time per acquire-release on a 2GHz 4-CPU Opteron. See notes.
 for (;;) { lock(mu); unlock(mu); }
  • linuxthreads mutexes are FIFO, so really bad under contention. Even the fast path is slow for some reason.
  • NPTL is the new POSIX thread library: Fast path is good. Explicitly ignored slow path, and it shows.
  • Mike's Mutex (throughput-tuned): Fast and slow path are the same. Google went with this based on their needs.
  • Mike's Mutex (latency-tuned): Fast path is good, slow path almost as good.
  • Spinlocks are faster than all, but performance is more fragile.

Lock annotations

  • Like type-checking for locks.
  • Idea: Annotate locks with the data they protect. Annotate routines with the locks that must be held a-priori.
  • Ensures lock invariants by local inspection, and really good with static analyzer.
  • Highly recommended. More complex lock structures (maybe trees) require some extra work.
 mutex mu;  // < mu2; protects x
 int x;     // under mu;
 mutex mu2; // > mu; protects y
 int y;     // under mu2;
 // L < mu, mu2
 void foo() {
   lock(&mu);
   bar();
   locK(&mu2);
   y += x;
   unlock(&mu2);
   unlock(&mu);
 }
 // L >= mu
 void bar() {
   x++;
 }

Mutex uses

  • Protect invariants, including atomicity of assignments.
  • Serialize access, including memory reordering and compiler access reordering.
  • Don't rely on 32-bit instruction atomicity, especially when assuming something like 'size_t' is 32-bits (it could change).
  • Don't use 2 different mutexes for 2 values in a bitfield, since their accesses are linked.

Re-entrant mutexes

  • Quick definition: when a thread can acquire a lock more than once.
  • Generally "evil." Requires extra caution.
  • The problem:
    • A mutex acquire ensures a particular invariant. A nested acquire might assume an invariant that is violated somewhere within the outer acquire.
  • This is particularly problematic in OOP, where we can override a routine without knowing all the locking details.
  • Burrows says: "Never subclass, though subtyping is fine". Some subclasses are trivial, but in general you must communicate all locking details to subclasser or risk invariant violations.

Condition variables

  • Often associated with monitors.
  • Locking variant that uses wait/signal/broadcast to synchronize threads based on a condition.
    • wait(condvar, lock) blocks the current thread.
    • signal(condvar) wakes one thread that is waiting on a condition variable.
    • broadcast(condvar) wakes all threads waiting on a condition variable.
  • A lock is associated with a condition variable. Waiting on a condition implicitly releases the lock. When a thread is signaled, the lock is implicitly re-acquired.
  • Sample:
 void wait_for_cond(mutex *mu, condvar *cv, bool *cond) {
   lock(mu);
   while (!*cond) {
     wait(cv, mu); // wait on cv, temporarily unlocking mu
   }
   // *cond is true here
   unlock(mu);
 }
 void make_cond_true(mutex *mu, condvar *cv, bool *cond) {
   lock(mu);
   *cond = true;
   broadcast(cv);     // wake waiters
   unlock(mu);
 }
  • Note the while loop around the wait, which checks the condition each time this thread has been signaled. This is necessary since threads can be signaled out-of-order.

Hoare Condition Variables

  • The original condition variable, as described by C.A.R. Hoare in the 70's.
  • Hoare condition variables are FIFO. So there are stronger guarantees on invariants after waking from a wait.
  • No need for the while loop that checks the condition after waiting. This seems nice, but...
    • Slower, for the same reason that FIFO thread scheduling is slow.
    • Loses the benefits of the 'while-loop' condition variables:
      • Ensure the state of the condition by local inspection.
      • Allows spurious wakeups to any waiting thread.
      • Trades performance for a cleaner abstraction.
      • Allows for non-FIFO thread scheduling (ie. faster).

Conditional Critical Sections

  • Key idea:
    • Hide the condition variable in the mutex itself.
    • Pass the actual condition (as a closure, or function pointer) to wait.
    • Signal automatically when we unlock (though technically anytime is fine). So we never have to signal or broadcast.
    • The thread that makes the condition true can do all the condition checks, so we don't have to context switch, and we only have to signal/broadcast when we're sure a condition is true.
  • Takeaway points:
    • Burrows believes this is a better abstraction than standard condition variables.
    • Performance improves due to fewer context switches and spurious wakeups.
    • When Burrows implemented this variant, the size of the mutex library went up by 10x.

Summary

  • We use thread/mutexes because:
    • With proper tools and annotations, they work well enough.
    • Current alternatives are complex or slow.
  • But, mutexes can be complicated:
    • Conditional critical sections seem a decent compromise at this time.

Fun before-lecture and after-lecture rants (loosely accurate)

  • Python is a bad language.
    • Whitespace shouldn't be significant. This causes source control problems.
    • Hard to know the types of objects, which makes it difficult to maintain large projects.
    • Python should have static types.
  • "C++ is a piece of junk. Since 1985 it's gotten worse." Full of pitfalls.
  • "One feature of C++ that was an improvement: the '//' commenting."
  • Burrows prefers C to other languages.
  • Google has a C++ style guide that's super long and rules out huge parts of the language. It's unfortunately necessary.
  • A whole Google data-center went down because of unfairness in thread scheduling from a spinlock. This spinlock was in Chubby (written by Burrows), Google's distributed locking service that manages internal naming services for their data centers. Quirks of the processor were to blame.

Write-up authors

  • Nathan Marz - nathanm at stanford.edu
  • Josh Wiseman - joshwise at stanford.edu
Last modified May 2, 2007 7:26 am / Skin by Kevin Hughes
MediaWiki