|
Talk Overview: Watts, Faults, and Threads
- Google views their computing environment in a holistic sense:
- "Hardware" is the entire data center, not individual devices
- "Application," e.g. web search, gmail, Docs & Spreadsheet, is dozens of binaries running across different machines in the data center
- Cost-efficiency is a key success metric for the business; number of machines, power consumption, fault recovery, etc. contribute to this metric
- A Google data center is a "softer" computer because much of the functionality that traditionally resides at the operating system and hardware level is implemented at the application software level; examples of such functionality include:
Faults
- Faults are a key performance issue for the Google environment
- With a large amount of hardware and software under management in a single application, faults and recovery are a normal part of everyday operation; for example, GFS is typically busy recovering failed disks while serving live traffic
- Consequentially, fault processing is a standard part of the overall performance profile of Google infrastructure and applications
- Software-level fault processing is critical but is not the entire answer; for example, an application served on a cluster of three machines typically cannot handle its design throughput gracefully if more than one machine fails--need attention to faults at the hardware level too
- At the hardware level, faults are becoming more significant as components miniaturize
- CMOS-level errors are becoming more frequent over time as components become smaller--smaller components increase the probability of failure due to environmental interference, and the sheer number of components and decrease in clock cycle times gives more opportunity for failures even aside from increases in the underlying failure rate per bit
- Bit error rates could increase by about 10x between current semiconductor technology and 16nm technology (about 8% degradation per bit per generation)
- For disk drives, capacities are growing faster than I/O throughput, so drive failure recovery is becoming more expensive over time
- Google has developed an infrastructure for monitoring system health
- Logs and collects system health data using BigTable
- Example application: disk drive failure study
- Disk drive failure study (see paper)
- Google data center health monitoring offered a special opportunity for assessing disk drive failure rates
- Large population
- Real application loads over long period, rather than simulated stress in testing environment
- System health infrastructure provided data about performance and self-monitoring metrics during device lifetime
- Key findings:
- Little relationship noted between overall failure rates and operating temperatures or activity levels
- While some disk drive self-monitoring parameters were predictive of impending failures, many drives fail without showing abnormalities in their self-monitoring data in advance
Watts
- Why should we care?
- Datacenter power consumption includes power for the computers themselves, power required to cool them, and losses in the power conversion equipment; reducing computer power consumption reduces all of these components
- Business cost of data center power consumption includes both the cost of buying the power, as well as the capital cost of building the data center's power and cooling infrastructure
- Year after year, the cost of hardware is rather constant for better and better performance... but energy consumption grows. The trend may lead to energy dominating the cost of hardware! This 2005 article Power could cost more than server already speaks of this issue.
- Cost of energy is rising.
- Multi-core: a solution to watts and faults?
- Chip Multiprocessor (CMP) consumes less than 'traditional' super scalar out-of-order processors (The Price of Performance, Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads)
- Very power efficient, smaller system, smaller wires, less speculation (less garbaged execution).
- Google-like applications generally do not get good performance boost from super-scalar enhancements--more efficient to use this silicon and power consumption to improve throughput by adding more cores
- Correctness is simpler, and can be made fault tolerant easier.
- But, are there enough threads? And can we expect programmers to build efficient and correct concurrent programs?
- How do machines, clusters, and data centers use power?
- Google found that their servers typically operate at about 35% of full computing capacity
- Unfortunately, most servers are not optimized to reduce power consumption for below-peak computing loads. A typical server with decent energy usage consumes 50% of maximum power even when idle (some are closer to 80%!).
- At 35% of full computing capacity, such servers are operating at only 50% of their peak energy efficiency.
- Widening the dynamic power range
- Making machines use less power when idle would push the efficiency.
- If a machine at idle used 10% of its maximum power instead of 50%, than power efficiency would be close to 80% at typical load.
- As part of improving overall power efficiency, Google has advocated a simplification of computer power supply standards to optimize AC=>DC conversion efficiency (see paper)
- Current PC power supplies operate at about 60-70% efficiency
- Much of the inefficiency of current power conversion is due to the requirement for multiple output voltages, a legacy of technical needs from early PCs that is no longer relevant for contemporary hardware
- Google proposes standardizing on a single 12V output voltage; by simplifying the design of the power supply and avoiding the over-provisioning required for multiple output voltages, conversion efficiency would be closer to 85%, or 90% with higher-quality components
Developer Productivity
- Need to maintain developer productivity in the face of:
- More thread-level parallelism (=> more concurrency issues in software)
- More complex computing environment (multiple machines, etc.)
- Higher failure rates
- Power/energy concerns
- To maintain productivity, we need breakthrough improvements
- Simplified programming environment--e.g. MapReduce framework greatly simplifies writing parallel programs for a cluster of machines by handling distribution of work, failure recovery, etc.
- New generation of tools, e.g. performance monitoring, error handling, tracing, etc., for data center environment
- Example of such a tool: Google-wide CPU profiler
- Collects CPU data from all machines to aggregate data on all executables x all machines
- Generates results for applications, individual executables, and shared library code
- Would like to see more academic work in computing issues for large clusters
- Working on funding/arranging large clusters to share among multiple universities
- Could hold application contests to create load on these clusters for study
- Would like to share development tools, such as MapReduce, so that researchers in other fields, e.g. chemists, could use this sort of cluster in their work
Writeup Authors
- Adam Ginsburg - reverse(bgmada) at stanford.edu
- Pascal-Louis Perez - reverse(lacsap) at cs.stanford.edu
|
|