Contents

spa4j: Structural Pattern Analysis for Java

Abstract

Common practices in software development often show up in varying projects, forming patterns. Usage of patterns are encouraged, since it helps write good code. In this project, we search for patterns in real world code and see how prevalent they are in practice. We first give a description of our complete tool chain that analyzes Java bytecode and searches for certain patterns. In-depth descriptions of the patterns and rules that we formulated are also given. Then, we show our results for the Java Runtime Environment, and two libraries -- Jena and Lucene. Finally, we conclude with a direction for future work to build upon and add additional value to our system.

Motivation

Software design patterns are standard solutions to a set of common problems that frequently arise in software development. Usage of design patterns is encouraged as it increases the scalability and re-usability of software. These propertties, scalability and re-usability, seem to be evident in good software.

In this project, we wanted to develop this observation and determine if there is any correlation between good software and prevalent usage of design patterns. Primarily, we focused on object-oriented, structural patterns that layout the relationships between entities in the system. In addition, we explored several creational patterns that deal with the creation of objects in the system.

Approach

System Structure Overview

The system works by analyzing a collection of Java bytecode, extracting the object-oriented, structural features of the code in the form of datalog predicates. The analysis portion outputs a set of datalog files that are required for executing pattern rules and queries. The rules and queries, written in datalog+, are first syntactically translated to datalog then fed into bddbddb (a bdd-based datalog implementation), which outputs query results. At a high-level, the system looks like this:

image:spa4j_system_overview.png

Java Bytecode Analysis (spa4j,joeq)

The analysis component is initialized by specifying classes or packages to analyze and providing those classes via some ClassLoader. Using joeq to scrape the bytecode, spa4j then looks at all associated classes and methods to extract the following structural, object-oriented features:

  • subClass(class, super)
    • specifies class/interface subtyping via the "extends" or "implements" keywords
  • abstractClass(class)
    • specifies that a particular class is abstract (but not an interface)
  • interface(class)
    • specifies that a particular class is actually an interface
  • classVisibility(class, visibility)
    • specifies the visibility and access-level for a particular class
  • field(class, fieldType)
    • specifies that a particular class has a field of a particular type
  • method(method, definingClass)
    • specifies that a particular method instance is defined in a particular class
  • subMethod(method, super)
    • specifies method overriding (for inherited methods) and method implementation (for required interface methods)
  • returns(method, returnType)
    • specifies the return type of a particular method
  • param(method, paramType, position)
    • specifies the type and position of each parameter to a particular method
  • staticMethod(method)
    • specifies that a particular method is static
  • abstractMethod(method)
    • specifies that a particular method is abstract
  • constructor(initMethod)
    • specifies that a particular method is a constructor
  • finalMethod(method)
    • specifies that a particular method is final
  • call(caller, callee)
    • specifies that the caller method calls the callee method somewhere in its body
  • methodVisibility(method, visibility)
    • specifies the visibility and access-level for a particular method
  • sameSig(method1, method2)
    • specifies that two particular methods have identical signatures; note that this expands upon the subMethod relationship by a large degree

In addition to generating predicates for each of the extracted features, the spa4j analysis front-end also creates the appropriate relations, domains, and domain map files for use during the execution phase. The relations are used to define the predicates at an abstract, schematic level. The domains are used to abstract the relation parameters as integer elements. Finally, the domain map files are used to map those integers back to meaningful values (class names, method names, visibility descriptions, etc).

The collection of analysis output files, along with the datalog+ rules and queries--covered in the next section--are then sent into the execution component, covered below.

Rules and Query Language (datalog+)

  • Datalog is a query and rule language for deductive databases. It is also the input language for bddbddb. There are two types of datalog rules: ground rules such as
    method(foo, bar).
    that are facts retrieved by the analysis phase, and complex relations such as
    nonPublicMethod(m) :- methodVisibility(m, v), v>0.
    that can be deduced from other rules. Queries end in question marks(?) instead of the period(.) for rules.
  • Limitations of (bddbddb) datalog includes limited support for quantifiers and the lack of parentheses or the disjunction operator. They can be circumvented by using multiple rules of the same name for disjunction and simulating other quantifiers with the existential quantifier and negation. However the rules tend to get long and hard to understand as multiple levels are required to write certain rules.
  • Datalog+ is an extension to datalog which overcomes the aforementioned limitations and helps write rules in a more natural manner. The grammar for datalog+ is:
    • rule := predicate :- expr .
    • predicate := IDENT ( IDENT (, IDENT)* )
    • expr := predicate
| expr | expr (disjunction)
| expr , expr (conjunction)
| expr & expr ( " )
| expr -> expr (implication)
| expr = expr (equality)
| expr != expr (disequality)
| expr < expr (less than)
| expr > expr (greater than)
| ! expr (negation)
| @ IDENT expr (universal quantifier)
| # IDENT expr (existential quantifier)
| #1 IDENT expr (unique existential quantifier)
| ( expr ) (parentheses)
  • The following rules are applied recursively to form pure datalog rules:
    • @ x e ==> !(#x !e)
    • #1 x e(x) ==> # x (e(x), !(# v (e(v), x!=v))) (for fresh variable v)
    • e1 -> e2 ==> !(e1, !e2)
    • # x e, e1 | e2, and !e are broken into multiple rules if e is not atomic (a predicate or comparison).
  • The additional operators and quantifiers allows us to specify rules more naturally. For example, we can require some properties for all methods of a class, or universally quantify over all subclasses, etc. With the parentheses, we can specify most patterns in one datalog+ rule instead of multiple datalog rules.

Pattern Execution Engine (spa4j,bddbddb)

The execution component is split into two phases: the pre-processor, and the query execution engine.

  • Pre-Processor
The pre-processing phase is responsible for converting any datalog+ rules into vanilla datalog syntax (as described in the previous section) and pruning from the predicate set any relations that are unused for the specified rules/queries.
Predicate pruning is important to perform up front due to inefficiencies in the version of bddbddb that we were working with. While it is true that bddbddb performs its own pruning prior to query execution, the pruning involves walking several linked lists, comparing objects at each location; while theoretically this shouldn't be too bad, in practice it was found to take time in the order of hours to prune out unused predicates for simple queries from input sets with sizes in the millions.
  • Query Execution Engine
The execution phase is fairly straight-forward. Using the pre-processed input parameters, spa4j sets up the appropriate temporary files, then pipes them into bddbddb for analysis. bddbddb applies the specified rules to the set of input predicates until a fixed point is found, then queries the dataset for the requested patterns. The results are then dumped to a file for offline inspection.

Patterns

In this section, we give a brief description and datalog rules of each design pattern.

Structural Design Patterns

  1. Single Method Pattern
    • Description
      • A class contains only one method that is not a constructor.
    • Datalog Rule
      • SingleMethodClass(c) :- !SingleMethodClass2(c), method(_,c).
      • SingleMethodClass2(c) :- method(a, c), method(b,c), a!=b, !constructor(a), !constructor(b).
  2. Proxy Pattern
    • Description
      • In Proxy Pattern, a proxy class acts as an interface to another resource such as a network connection, a large object in memory, a file, or something that is expensive or hard to duplicate. In general, one instance of the complex object is created, and multiple porxy objects are created, all of which contain a reference to the single original complex object. Any operations performed on the proxies are forwarded to the orginial object. Once all instances of the proxy are out of scope, the complex object's memory is deallocated.
      • UML Diagram
    • Datalog Rule
      • subMethod(a,c) :- subMethod(a,b), subMethod(b,c).
      • subClass(a,c) :- subClass(a,b), subClass(b,c).
      • proxyMethod(m,d) :- subMethod(m,d), call(m,d).
      • ExistsProxyInClass(m,class) :- method(m2,class), proxyMethod(m2,m).
      • NoProxyMethodOfSuper(class,super) :- method(m,super), !ExistsProxyInClass(m,class).
      • proxyClass(class,super) :- subClass(class,super), !NoProxyMethodOfSuper(class,super), method(m,super), ExistsProxyInClass(m,class).
  3. Adapter Pattern
    • Description
      • A wrapper class adapts one interface for a class into one that a client expects. An adapter pattern is used where classes could not work together because of incompatible interfaces by wrapping its own interface around that of an already existing class. The adapter class also handles any logic that is necessary to transform data into a form that is useful for the client. For instance, if multiple boolean values are stored as a single integer but the client expects True/False, the adapter class would be responsible for mapping single integer value to true/false.
      • UML Diagram
    • Datalog Rule
      • Adapter1(a, m) :- call(m, m3), method(m3, a).
      • Adapter2(a, c, i) :- method(m, c), subMethod(m, m2), method(m2, i), !Adapter1(a, m).
      • Adapter(a,c,i) :- subClass(c, a), subClass(c, i), !Adapter2(a, c, i), a!=i.
  4. Template Method Pattern
    • Description
      • Template Method Pattern allows separation of algorithm and implementation. A template method defines the skeleton of an algorithm and the subclasses of the method override the abstract methods to provide concrete behavior. An abstract class provides the basic step of an algorithm design implemented using abstract methods. Then, subclasses override abstract methods to implement specific actions. The overall algorithm is in one abstract class and implementation details lie in subclasses.
      • UML Diagram
    • Datalog Rule
      • TemplateMethod(m) :- abstractClass(c), method(m,c), !abstractMethod(m), call(m,m2), abstractMethod(m2), method(m2,c).
  5. Non-Virtual Interface Pattern
    • Description
      • NVI (Non-Virtual Interface) Pattern is very strongly related to the template method. However the NVI pattern identifies advantages of a non-abstract method invoking the subordinate abstract methods. The NVI pattern is detected if a final method in an abstract class calls some other abstract method in the class.
    • Datalog Rule
      • NonVirtualInterface(m) :- abstractClass(c), method(m,c), !abstractMethod(m), call(m,m2), abstractMethod(m2), method(m2,c), finalMethod(m).

Creational Design Patterns

  1. Factory Method Pattern
    • Description
      • The factory method pattern manages the problem of creating objects (products) without specifying the exact class of object that will be created. In this pattern, a separate method for creating the objects is defined in Creator and subclasses override this method to specify the derived type of product that will be created. The factory method is often used to refer to any method whose main purpose is to create objects.
      • UML Diagram
    • Datalog Rule
      • nonPublicMethod(m) :- methodVisibility(m, v), v>0.
      • nonPublicClass(c) :- classVisibility(c,v), v>0.
      • nonPublicConstructor(init,c) :- constructor(init), nonPublicMethod(init), method(init,c).
      • nonPublicConstructor(init,c) :- constructor(init), nonPublicClass(c), method(init,c).
      • FactoryMethod(f,c) :- nonPublicConstructor(init,c), call(f,init), method(f,_), returns(f,s), subClass(c,s).
  2. Abstract Factory Pattern
    • Description
      • The abstract factory pattern allows encapsulation of a group of individual factories that have a common theme. The client creates a concrete implementation of the abstract factory and then uses the generic interfaces to create the concrete objects. The client does not know about which concrete objects it gets from each of these factories since it uses only the generic interfaces of their products. This pattern separates the details of implementation of a set of objects from its general usage. Also it insulates the creation of objects from their usage allowing new derived types to be introduced with no change to the code that uses the base object.
      • Abstract Factory UML
    • Datalog Rule
      • subClass(a,b) :- subClass(a,_), subClass(b,_), a=b.
      • subClass(a,b) :- subClass(_,a), subClass(_,b), a=b.
      • subClass(a,c) :- subClass(a,b), subClass(b,c).
      • strictSubClass(a,b) :- subClass(a,b), a!=b.
      • AbstractFactory(abstractFactory,forClass) :- abstractClass(abstractFactory), strictSubClass(forClass,abstractFactory), method(g,abstractFactory), returns(g,abstractFactory), staticMethod(g), call(g,init), constructor(init), method(init,forClass).
  3. Concrete Factory Pattern
    • Description
      • Concrete Factory Pattern centralizes decision of which implementation of product to instantiate. This is similar to Abstract Factory Pattern except this pattern reveals concrete factories that manage instantiation of products.
    • Datalog Rule
      • nonPublicMethod(m) :- methodVisibility(m, v), v>0.
      • nonPublicClass(c) :- classVisibility(c,v), v>0.
      • nonPublicConstructor(init,c) :- constructor(init), nonPublicMethod(init), method(init,c).
      • nonPublicConstructor(init,c) :- constructor(init), nonPublicClass(c), method(init,c).
      • ConcreteFactory1(c, m) :- call(m, init), nonPublicConstructor(init, c).
      • ConcreteFactory3(c, f, s) :- method(m, f), returns(m, s), !ConcreteFactory1(c, m).
      • ConcreteFactory(factory, forClass) :- method(m2, factory), returns(m2, forClass), subClass(c, forClass), !ConcreteFactory3(c, factory, forClass).

Behavioral Design Patterns

  1. Visitor Pattern
    • Description
      • The visitor pattern allows adding new operations to exsting object structures without modifying those structures. Visitee classes have an accept method that takes a visitor object as an argument. Visitor is an interface that has a visit() method for each element class. The accept() method of Visitee class calls back the visit() method for its class.
    • Datalog Rule
      • Visitor(visitor, visitee) :- method(m, visitee), param(m, visitor, _), call(m, m2), method(m2,or), visitee != visitor.
  2. Template Method Pattern (see above)
  3. Non-Virtual Interface Pattern (see above)

Results

JRE 1.5.0_12 Results

Analysis of the Java Runtime Environment 1.5.0_12 code base (java.**.* and javax.**.*) with spa4j covered 4878 classes, 46,322 methods, and 17,427 fields, creating 1,277,749 total structural predicates. Out of all the classes, 1506 were marked as single-method classes, leaving 3372 classes of interest. For each set of query results, up to 15 results were randomly sampled (15 with replacement) and manually verified by inspecting the source files; the false positive rates listed below are derived from such sampling.

Pattern Matches False Positive Rate
Adapter 31
classes
70%'*
(7/10)
Factory Method 1434
methods
8.3%
(1/12)
Concrete Factory 870
classes
54.5%
(6/11)
Abstract Factory 43
classes
0%
(0/10)
Overridden Vis 269
methods
0%
(0/15)
Proxy 77
classes
0%
(0/14)
Template Method 542
methods
0%
(0/15)
Non Virtual Interface 37
methods
0%
(0/13)
Visitor 3428
classes
92.9%
(13/14)

Notes:

  • The false positives for the adapter pattern could have been substantially reduced if the case where the adaptee is java.lang.Object were removed.
  • Some of the results were not verifiable because the source was unavailable.

Jena 2.5.2 Results

Analysis of the Jena 2.5.2 code base with spa4j covered 2325 classes, 18287 methods, and 7278 fields, creating 282427 total structural predicates. Out of all the classes, 729 were marked as single-method classes, leaving 1596 classes of interest. For each set of query results, 15 results were randomly sampled (with replacement) and manually verified by inspecting the source files; the false positive rates listed below are derived from such sampling.

Pattern Matches False Positive Rate
Adapter 22
classes
0%
(0/15)
Factory Method 225
methods
14%
(2/14)
Concrete Factory 155
classes
43%
(6/14)
Abstract Factory 28
classes
9%
(1/11)
Overridden Vis 135
methods
0%
(0/14)
Proxy 45
classes
0%
(0/13)
Template Method 157
methods
0%
(0/14)
Non Virtual Interface 29
methods
0%
(0/11)
Visitor 1478
classes
93%
(14/15)

Lucene Results

Analysis of the Lucene 2.1.0 code base with spa4j covered 324 classes, 2460 methods, and 1226 fields, creating 34218 total structural predicates. Out of all the classes, 88 were marked as single-method classes, leaving 236 classes of interest. For each set of query results, up to 15 results were randomly sampled (15 with replacement) and manually verified by inspecting the source files; the false positive rates listed below are derived from such sampling.

Pattern Matches False Positive Rate
Adapter 0
classes
0%
(0/0)
Factory Method 84
methods
60%
(9/15)
Concrete Factory 62
classes
33%
(5/15)
Abstract Factory 1
class
0%
(0/1)
Overridden Vis 12
methods
0%
(0/12)
Proxy 9
classes
22%
(2/9)
Template Method 48
methods
0%
(0/15)
Non Virtual Interface 8
methods
0%
(0/8)
Visitor 143
classes
80%
(13/15)

Overall Results

Overall, we saw excellent precision with our queries for structural patterns. This can be accounted for by recognizing that the object-oriented features extracted from the bytecode during the analysis phase were primarily structural features.

Precision dropped on the queries for creational patterns, largely due to a lack of semantic analysis. In most of the creational false positives, intermediate objects were being created in some method of interest, but not directly returned. Including some notion of pointer analysis into spa4j would be a boon for the precision of creational pattern queries.

Finally, precision was horrible for our only pure behavioral pattern, the visitor pattern. Like creational patterns, the visitor pattern is much more semantic in nature than it is structural. Due to the loose structural definition of the pattern, a large majority of the results were false positives. Again, the inclusion of a pointer analysis would greatly reduce these errorneus results.

Future Work

It is apparent from our results that future work should focus on the inclusion of semantic analyses in spa4j. While structural analysis has its benefits, semantic analysis would allow us to find a wider ranger of patterns with greater precision. It also opens the possibility to write more patterns.

Additionally, integration of our structural features into PQL, which already supports pointer analysis, would be the next logical step. While datalog+ takes some of the sting out of writing complex pattern rules, writing those rules in the Java-like PQL syntax would be much more natural. Extending the PQL grammar to support our feature set is a wise direction in which to proceed.

Also an interesting extension would be to look for anti-patterns or bugs in code using our tool set.

Conclusion

In this work, we proposed to extract structural design patterns from Java code and check the correlation between those patterns and quality of the code. Specifically, we analyzed the Java Runtime Environment v1.5.0_12, Apache's Lucene v2.1.0, and HP Labs' Jena 2.5.2.--all of which are heavily used, quality APIs--and extracted numerous design patterns. After the analysis, we estimated the correctness of our results by randomly sampling the results and manually inspecting the source files.

Although some false positives existed in the results, many of these structural patterns are used throughout these three code bases. The results provide strong evidence of a correlation between design patterns and quality code. As a part of future work, we need to extend the analysis furtherand integrate semantic analysss to better support existing patterns and open up the door for new patterns. Only then will the answer to the question--whether the prevalence of design patterns implies better code with fewer bugs--be revealed.

Citations/References

  1. John Whaley and Monica S. Lam, CloningBased ContextSensitive Pointer Alias Analysis Using Binary Decision Diagrams.
  2. Monica S. Lam, John Whaley, V. Benjamin Livshits, Michael C. Martin, Dzintars Avots, Michael Carbin, Christopher Unkel, ContextSensitive Program Analysis as Database Queries.
  3. Amnon H. Eden Formal Specification of Object-Oriented Design].
  4. LEPUS Language for Patterns Uniform Specification.

Old Material

Our original proposal and rough project plan (deprecated) are preserved below.

Problem

  • To determine correlation between usage of design patterns and good code
    • Analyze the Java library
    • Analyze large Java open source projects
  • A general method to detect structural patterns in code

Milestones

  1. Set up development environment (joeq, bddbddb, Sourceforge, etc.) (Complete)
  2. Extract the Proxy pattern using the interactive bddbddb shell and hand-coded datalog rules (Complete)
  3. Extract more advanced patterns (abstract factory pattern, factory method pattern, etc)
    • Complete:
      • Added methodVisibility(M,V) predicate for visibility analysis. To exercise this, a rule and query where added to detect sub-methods overriding protected visibility with public visibility.
      • Added sameSig(M,M) predicate to detect methods with identical signatures but that are not hierarchically related. Usage of this predicate is planned for non-interface based proxy detection and interface extraction (where no interfaces exist)
      • Added field(C,C) predicate for pattern analyses that depend on class fields (an anticipated use case).
      • Added pruning infrastructure to prune out predicates that aren't used in the specified pattern rules prior to feeding the predicates into bddbddb. bddbddb's predicate pruning performs poorly when dealing with a half million predicates.
      • Factory Method pattern extraction (Results for the Java library)
      • Template Method pattern extraction (Results for the Java library)
      • Adapter pattern extraction (Results for the Java library)
  4. Automate datalog rule generation using PQL or some other expression package
    • Complete:
      • datalogplus expression package developed in OCaml


We are managing our project in Sourceforge. Take a look at our site: http://sourceforge.net/projects/spa4j/

Last modified June 13, 2007 6:27 am / Skin by Kevin Hughes
MediaWiki