Projects:Leslie Wu
Shattr: Group Testing in Large-Scale Software IntegrationAbstractLarge-Scale Software Integration is an increasingly important, costly, and critical piece of modern software development and operations, yet software testing -- specifically large-scale integration testing -- remains largely under researched, according to researchers in the field, such as Amitabh Srivastava, VP of Software Quality at Microsoft. In this report, we consider a theoretical social network model of developer teams that undergird software-as-a-service, and define two cost metrics to consider when organizing software integration and testing pipelines at Internet scale. We compare several integration models and propose "Shattr" a technique inspired directly by work on group testing for (k-)complexes (subsets of size k). Using a simple simulation model, we demonstrate how "Shattr" may reduce the time required to isolate integration defects without necessarily requiring more testing hardware resources, and propose a modified software staging pipeline that incorporates Shattr-style permutative group testing. MotivationMany of today's consumer and enterprise applications (Google Mail, Salesforce, Amazon.com) are now deployed through the Web, specifically through software-as-a-service. These web applications are often expected to operate continually, on a 24x7 basis, and in order to innovate and remain agile, companies such as Google and Amazon.com expose dozens of web services externally *and* internally. For example, according to Google fellow Jeff Dean, a Google pageview may trigger more than fifty internal web services. According to the CTO of Amazon.com, "If you hit the Amazon.com gateway page, the application calls more than 100 services to collect data and construct the page for you." Vogel also notes that "Testing in a very large-scale distributed setting is a major challenge" (Vogel, ACM Queue, see #citations), and we try to address this issue on a theoretical level in this report. Social NetworksWe first note that the service model of software development is not simply a technical choice that supports agile development practices, but in addition a *social* choice in how people are organized. It is notable that two Web companies at the vanguard of software development at scale, GOOG and AMZN, organize their corporate engineers in a way that matches the technical organization of their internal and external services. GOOG reportedly has a very flat organizational structure, and AMZN's Jeff Bezos is known to be a proponent of a "2-pizza team" model of software development (citation here), where each team owns a service or a number of related software services. As an increasing number of corporations, governments, and other large institutions evolve towards service-oriented architectures, it becomes important not only to consider the software quality of individual services and features, but of the end-to-end software quality of the aggregate set of services. Data-Center Scale IntegrationUnlike previous work (as found in traditional applications of group testing, see Du and Hwang 2000, #citations), we try to identify the *interaction* between components that causes an integration defect, rather than just identifying defective singletons in isolation. We also more explicitly consider the social aspects of software development in the large, as practiced by Internet-facing consumer and enterprise Web application companies in the last decade. In this way, we differentiate our focus from previous testing research, such as Delta Debugging, which offers a similar approach algorithmically but focuses its analysis on a single user without considering the social complexity, and testing-in-the-small rather than integration testing-in-the-large. In "Delta Debugging" the researchers gave a logarithmic search technique but did not focus on the constants involved, or on the ways testing can or can not be made parallel. While this is fine for debugging applications that run on a single host, such as Mozilla or gcc, constant factors and the parallel nature of integration testing start to become important when integrating services or feature sets on a Operating System or Internet-scale web application scale. Thus, we argue that such a data-center-scale integration requirement necessitates deeper theoretical grounding beyond the analysis found in previous debugging research, which we concretely find in the theory of "group testing". Group TestingHistory of the MethodThe idea for group testing is usually credited to Robert Dorfman, an economist who worked for the United States government during World War II and later wrote a well-known work on linear programming. In the introduction to Du and Hwang's comprehensive book on combinatorial group testing (Du and Hwang 2000), Robert Dorfman explains his recollection -- fifty years later -- of the events leading to the development of group testing.
Dorfman then wrote a seminal report titled "The detection of defective members of large populations" in the Ann. of Mathematical Statistics, 1943. Although group testing was not ultimately applied towards the problem of detecting syphilis, Dorfman's blood testing problem found its way into Feller's treatise on probability in the 1950's and beyond. Later, group testing was brought back to find leaks in a chemical apparatus (Sobel and Groll 1959) and the testing of condensers and resistors. In the last decade, group testing has found further application in pooling designs, and nonadaptive group testing have become "Important Tools for DNA Sequencing" (Du and Hwang 2006, see #citations). Categories of Group TestingDorfman's original blood testing analysis described "probabilistic group testing", where one knows or estimates the probability that an item (or in his case, a person!) would be found to be defective. Later work, describes "combinatorial group testing" (C. H. Li, "A sequential method for screening experimental variables", 1962), wherein instead of considering probability distributions as in probabilistic group testing, one assumes that the defective set is a combinatorial subset of the items in question. (In this report, we consider combinatorial group testing, and leave probabilistic group testing for future work.) As Du and Hwang point out, combinatorial group testing has been studied more recently in "complexity theory, graph theory, learning models, communication channels and fault tolerant computing" and a brief survey of work on group testing by computer scientists reveals a recent body of work in theoretical computer science. I learned of group testing from (S.) Muthu Muthukrishnan, a CS professor at Rutgers, who mentioned it briefly during a talk at the Stanford Workshop on Modern Massive Data Sets 2006.
Beyond combinatorial vs probabilistic group testing, one can also categorize group testing into "adaptive" and "nonadaptive" group testing. In "adaptive" group testing, one can use the test results from one stage in testing and adaptively design and then algorithmically chose the appropriate set of tests to be run in the next stage. In "nonadaptive" group testing, the testing sequence may be staged, but the specific tests to be run are fixed in advance. Nonadaptive group testing is common in biology, specifically for DNA sequencing, whereas for software integration, we opt for an "adaptive" group testing approach, as such a scheme can reduce the number of tests ultimately required. Finally, we make a distinction between classical group testing, wherein the defective set consists mainly of singletons -- a faulty person or a leaky pipette for example -- and "group testing for complexes" where the defective set may furthermore contain defective subsets which are *not* specifically closed under subset. As described by Macula (2006) and earlier by Torney (1999), we let [t] represent a finite population with t elements, and suppose we have an unknown number of k-subsets S of [t], where d=|S| and S is known as the "set of positive k-complexes". In the "group testing for complexes problem" we must identify S by performing binary containment tests on subsets (pools) of [t]. More informally, imagine a set of items, and define a hypergraph that has those items as the vertex set. Then in the group testing for complexes problem, we want to exactly identify those hyperedges -- subsets of vertices -- that test "positive". More simply, instead of a hypergraph, consider a simple graph over those items (restricting our model from k-complexes to 1-complexes) and let the problem be the identification of defective (the term used often in traditional group testing) or positive (the term often used in pooling designs for DNA sequencing and the like) edges. Summarizing, the scope of the algorithms in this report is confined to combinatorial and adaptive group testing for 1-complexes where d is furthermore fixed to be one. We leave generalization of these assumptions to further work. A Theoretical ModelAs in traditional group testing, we try to minimize the number of tests required. Given this, we also assume that there exists a comprehensive set of automated integration tests, or at least that there is some more-or-less repeatable way to determine whether or not a subset of components can be integrated successfully without defect. In more recent work on pooling designs, researchers also design tests in a way that does not assume the robustness of the underlying testing mechanism, but for our purposes, we do not generalize in this fashion. The second metric we define is the depth of the recursion tree (or perhaps dag, in general) required to identify and isolate the integration defect. This depth is meant to correspond roughly to the overall time required to complete an integration test cycle. We suggest this metric -- also mentioned by theoretical researchers in Knill and Muthu (1995) and probably before then -- as development cycles can matter greatly when the units involved are hours, days, and weeks. That is, an integration testing cycle that takes 4 hours rather than 8 hours, or one day rather than three, may make a big difference in an organization that supports thousands of software engineers that support interdependent software processes. We abstract a development organization using a graph-theoretic social model, where each item in the vertex set corresponds to a single service ("feature" or "component", if you like). We assume that each service is owned by an independent team of developers, and that services can thusly be deployed and upgraded independently of other services. We also assume that individual services are well-tested, but the integration of new or upgraded services is *not* well-tested. A set of $n$ vertices then corresponds to a possible set of $n$ \choose 2 interactions, an edge for each possible interaction in the complete graph $k_n$. An edge is then said to be positive (or "defective") if when the two service endpoints, when integrated or upgraded together, cause a failure. In this model, the integration problem is seen as simply finding a defective edge in a simple graph. Integration MethodsA standard integration method for development on a non-trivial scale is to use code branches, forking code to develop new feature sets. However, this integration method often does not apply to the service model of development, where two services may or may not be sourced from the same code repository. In a CS343 lecture, Microsoft VP Srivastava hinted that Vista used a bottom-up integration technique. In the absence of specific details, we simply propose the obvious bottom-up integration technique whereby sets of services/features are incrementally aggregated, integrated, and tested, until all new services/features are enabled. Figure 1 shows a simple example where services (n=8) are pairwise integrated. In figure 2, services A and B together form the defective edge, but this is not found until the final integration step at the root node. Note that without further testing, one would not be able to discern this defective edge (is it a conflict between A and C, for example?). More generally, note that the probability is high that integration defects are not found until close to the top of the integration tree. On the flip side, if the branching factor is higher and defects are rare, an integration tree may be an ideal approach. Again, in the absence of any specific or methodical approach to integration testing at scale, we simply propose an obvious top-down approach -- that of divide-and-conquer. Given n=64 services, for example, we can divide the services into four groupings, and test all pairwise interactions between these groupings. Once a defect is found, we can then recurse on that "edge". The drawback to this approach is that it may not scale if we need to run more tests in parallel, as the number of pairwise interactions will grow quadratically. In this report, we propose a simple integration method that is similar in spirit to recent work that employs almost disjunct matrices for two-stage integration testing on complexes (Macula 2004) Instead of testing all pairwise interactions, we try to test larger, overlapping subsets. The intuition is that we can try large chunks of new stuff, together, throwing them "against the wall" to see which ones shatter (break). Given the subcomponents that break, we can then recurse on the intersection of those positive subsets. For example, consider again the example of n=64 services, where we want to subdivide into four groups of 16 services each. Figure 3 shows that it suffices to perform tests on three "triangles" of this tetrahedron. For example, if our groups are identified as A, B, C, and D, it suffices to test A+B+C, B+C+D, and C+D+A. Looking again at figure three, if two subsets test positive, we know that the inner edge must be defective. If only one subset tests positive, then we know that the "outer edge" for that subset must be defective. In both cases, we simply recurse on that edge as long as it contains more than two actual services. Software StagingThe example in figure 3 of three Shattr sets could also be applied to the software deployment/staging pipeline. In a traditional software staging pipeline, software begins in an "alpha" stage, and then is successively promoted to a "beta" or "development" stage, an optional gamma stage, and then finally to production. Such a staging pipeline serves to reduce the risk involved in deploying new software to production. However, developing software at scale has the added risk of social complexity. To be specific, consider an example where a developer "breaks the build" or "hoses [the] test [stage]". If the build is that of Windows Vista Service Pack X or the testing stage is host to a hundred services with dozens of feature teams, then breaking a build or hosing a test stage with a bad, unexpected interaction between services could waste hours of developer time. Instead of this star model, where all new services feed into test, the Shattr sets suggest an alternate approach. Instead of a single test/beta stage, one could imagine three different beta stages, $b_1$, $b_2$, and $b_3$, where services are divided into one of four as described above. Then, most services are deployed to two out of three beta stages, and some services are deployed to all beta stages. In the case of integration failure, then, using the same technique of intersecting sets, developers could potentially identify integration defects much more quickly. Experiments / ImplementationIn order to make sure I understood the problem domain, I developed a very simple simulator of integration testing, written in Ruby. The Shattr-style algorithm tested, for example, can be expressed in the following Ruby code: For those unfamiliar with Ruby code, the comments also describe the algorithm. Let *arr* be an array of elements (services/features/components). Then,
To get an idea of the effect of choosing a different integration method, consider a sample simulation result for n=128 services, Shattr with k=8, Tree k=4:
These time results are of course unitless, as they simply refer to the maximum depth of the recursion tree, but to put things in perspective, imagine the units to be hours or perhaps even days. From this perspective, the time difference may be notable. The entire simulation code is available at Projects:Leslie Wu:Shattr sim, along with a few basic unit tests of the integration testing simulation. The divide-and-conquer differs slightly from the algorithm described above, but only in a minor way (short-circuit recursion). Rather than go through a complete randomized analysis, we simply point out that the use of overlapping "Shattr-style" sets simply increases the branching factor of our "n-ary" search, in the same way that external memory data structures often play with branching factors in order to make sorting or searching more efficient, given a certain concrete model of computation. Software build and deployment engineers can then chose to trade-off the amount of testing resources required versus the need for iteration speed. Related WorkEarly on in the literature review, I came across work by Tsai and colleagues in COMPSAC and AWCC '04 (#citations). In Tsai (COMPSAC '04), they suggest that a service-oriented architecture provides an impetus to move from "Independent Verification" to "Collaborative Verification and Validation". In order to evaluate the reliability of a web service, they test a group of alternative web services and use a voting mechanism. We note that they use testing to find a faulty web service, rather than to find actual integration defects between and across web services. Furthermore, by "group testing" they seem to simply mean testing multiple versions of the same web service. In Tsai AWCC '04, they mainly test to find best WS. By employing progressive test case generation, they can identify test dependencies and use topsort to prune web service testing. In "Delta debugging", Zeller et al. typically consider a single user and try to whittle down a defect-causing input to the minimal set. In this sense, their work is similar, as they also propose a simple logarithmic divide-and-conquer approach. However, they do not seem to make the explicit connection to the subfield of combinatorial and adaptive group testing on k-complexes, and furthermore do not consider the software integration problem at scale. Thus, we regard this work as a small step towards a deeper understanding of how to test and integrate more efficiently for software development in-the-large. On a simpler level, "Delta debugging" is about the case of debugging where a failure case has already been identified, but the root cause is unknown, whereas this work focuses on software integration testing where we assume that defects generally exist, but hope to shake out the worst ones as quickly as possible. For integration testing, we may be confident that a permutative group testing pipeline with three parallel stages finds the "important" defects with high probability, even if it does not cover all possible interaction corner cases. In other words, we can chose to design our integration testing staging pipelines and methods to trade off between developer cost, machine cost, time-to and ease-of deployment, probability of risk, test engineer burden, and so on, whereas debugging is more specifically a technique for removing (and/or introducing!) defects, and tends not to bring these trade offs to the forefront of software design. On a theoretical level, Macula and colleagues have been continuing theoretical work on understanding two-stage algorithms for combinatorial group testing on k-complexes, by using almost disjunct and random matrices. In contrast, our short project focuses more on a practical approach that does not require as many tests. I was surprised to find that group testing on k-complexes to be a fairly new area of study, and thus spent longer trying to figure out what had actually been done and what could actually be done on an algorithmic level. Future WorkThe Shattr-style algorithm we described here is only a small step in actually making a large-scale integration test process more efficient. Future work might investigate the tradeoffs involved in actually changing a large-scale deployment pipeline from a single pipe to multiple pipes in parallel, or actually gather data on the prevalence of integration defects, and the tradeoffs involved in devoted more resources to integration testing. From a cognitive level, we also would like not to overburden the programmer by offering her too many variants in the testing pipeline. Thus, this would call for better testing matrices, dashboards, and ways of controlling permutative deployment schemes. On a practical level, it would be useful to find middle ground between the simplified model we describe here and the generalized group testing for k-complexes problem studied by theoreticians such as Muthu and Macula. A more specific graph model could be employed, as interactions between components are not typically a complete graph. And once again, we find that the core theory behind practical group testing on complexes is relatively new, and there is a lack of simple guidelines for developers to follow if they wish to implement Shattr-style or similarly parallelized permutative combinatorial group testing methods. Finally, is group testing in the wild actually practical? Would it speed up the development process of Microsoft Windows Vista, or the iteration cycles at Amazon.com or Google? Is permutative group testing needed to scale the agile practice of continuous integration (also see Mozilla's tinderbox to data-center software integration scale? Build processes tend to be neglected as necessary evils, leading to little delays across a software organization that, potentially, add up to millions of wasted moments in time. It remains to be seen how critical of a problem integration testing will be as service-styled software organizations scale even further up and out, and come to some level of maturity. AcknowledgmentsI pitched the initial concept for applying CGT to software integration to Amitabh Srivastava, "Corporate Vice President of the Windows Live Core and a Microsoft Technical Fellow", who thought the idea seemed novel, and encouraged me to pursue the approach. Thanks to S. Muthu for telling me a bit about group testing. Thanks also to Monica Lam and Martin Rinard for organizing CS343, Dawson E, Ted K, Phil G and that whole crew. ConclusionsIn this work, we proposed the novel application of the theory of “Combinatorial Group Testing for Complexes” to Software Integration at scale. Specifically, we analyzed a simple graph-theoretic model for Software Integration as a social network system, defined several software service integration cost metrics, and described various integration techniques. We simulated these techniques in Ruby, demonstrating a potential way to find integration defects more quickly. Finally, we also we provided a literature review, with pointers to previous and related academic work. Citations
Proposed Milestones (as of 7 May 2007)
Status report (20 May 2007)
Status report (24 May 2007)
Status report (27 May 2007)
Last modified June 11, 2007 1:11 am / Skin by Kevin Hughes
![]() |