Projects:Source Code Search with Syntax-Based Heuristics
BackgroundMany software development organizations have large inventories of source code to maintain and extend. These code bases are typically based on contributions from many authors, including staff within the organization as well as third-party contributions such as open-source libraries. A key challenge for developers in such settings is efficiently navigating these code bases. Code-oriented search engines can be an important tool for managing this complexity. Code search systems cover a wide range of use cases across the software engineering lifecycle. Examples include:
In addition, as with other search technologies, once source code content has been indexed and made accessible, users may find other applications for the system that were not originally envisioned. A number of code search systems are available today, with a range of different applications. As an over-simplified taxonomy, some of the key types include the following. Public search engines for open-source codeExamples of such systems include Koders, Codase, Krugle, and Google Code Search. These systems seem to adopt principles of large-scale web search engines, with large index sizes, relevance analysis based on information-retrieval practices, and (presumably) sophisticated crawlers that discover and acquire open source distributions for indexing. They have result displays specialized for code search, with keyword-in-context highlighting, code-oriented formatting, and navigation for indexed code based around package structure. In addition, these systems typically have a degree of understanding of the syntax of the code they index, which allows them to offer query features such as targeting search terms to specific regions of the code, e.g. variable or function names, comments, etc., and presumably may play a role in their relevance determinations. As a result, these systems have some degree of customization for particular languages, although the more prominent systems nevertheless manage to support a wide variety of languages. IDE-based toolsAt the other end of the range, full-featured development environments, such as the Eclipse IDE or IntelliJ IDEA, often provide their own search capabilities. These systems can support a deep level of integration with the code under indexing, since they typically support a small set of languages, and have detailed knowledge of the build environment. This level of integration can support more detailed query features, such as searches based on code structure, and result formats are often correspondingly more structured than those for web-based code search. On the other hand, this sort of integrated search is not typically designed for the indexing scale a large organization might require. Keyword-style intranet search enginesA variety of products are available for indexing source code within an organization, with features similar to those offered on public open-source code search systems. In fact, some of these products are offered by vendors (e.g. Koders, Krugle) that operate web-based open-source search. Products for general intranet search that are not specifically oriented to source code, such as the unmodified Lucene system, could also be deployed for this purpose, although their interfaces are not specialized for this application. Tag systemsTags systems, notably Ctags and similar systems (etags, exuberant ctags, and the client-server gtags), provide indexing of source code entities such as function names for efficient access within editors. While this is also a type of source code search, it does not have the same information-retrieval character as the keyword-based search engines outlined above, and functions in some ways more as a navigation system than a search engine tasked with finding approximate matches. find ... -exec "grep ..."While grep can also be used to search source code, and can be a good solution for many installations, it does not offer competitive query performance for code bases of the size considered here, and does not easily incorporate information based on code structure, matches across different parts of a file, etc. Project OverviewThe goal of this project was to develop a prototype for an organization-scale source code search engine, and to explore applications of code syntax and parsing to improve search relevance estimation. Conceptually, this type of product could fill a gap between large-scale source code search, with relatively lightweight parsing-based heuristics, and IDE-style code search, with smaller scale and a heavier level of configuration. The goal of integrating syntax with search relevance was to consider heuristics that are somewhat more experimental than the technique of assigning text to fields based on syntactic role, which is deployed in many code search engines today (see the ONJava.com article "Using Lucene to Search Java Source Code" for an example of this). At the same time, the search system needed to maintain a configuration requirement low enough to be able to index code automatically from third-party distributions without deep knowledge of the build environment. The intention was to build on the advantages of information-retrieval-style code search, such as large index sizes, lexical relevance functions, and minimal per-project configuration, while also incorporating deeper heuristics relevant to code structure. The prototype focused on indexing Java code, for two main reasons. First, a robust, accessible parser for Java was available as part of the Eclipse Java Development Tools (JDT). Second, the more regular structure of the package system used in Java simplified the process of analyzing inter-file dependencies. Even if the build environment is not modeled completely or accurately, it is often possible to guess linkages across files or even distributions due to package naming conventions. It should be possible to extend the ideas discussed here to other languages, although the requirement for parsing makes it difficult to support more than a handful of languages, at least with the same level of heuristics. ImplementationOverviewThe prototype search engine for this project was implemented in Java, using the Apache Lucene search engine toolkit. Indexes were stored on and queried from disk. To parse Java source code files, the prototype relied on the parser component of the Eclipse JDT, which can be operated independently of the compiler back-end, and is adaptable to compliance with various versions of the Java specification. As sample content, the prototype indexed source code for the Java-based projects from the Apache Software Foundation, as well as the Eclipse IDE; this amounted to approximately 46,000 files and 316MB of source code. Lexical AnalysisSource code presents different lexical issues than natural language text. In addition to offering syntactic structure that can be reconstructed by parsing, detailed later, source code also has particular tokenization requirements. Entity names in code are often formed from individual words joined together by punctuation, such as underscores (e.g. To address this use case, the tokenizer in the prototype system was designed to split terms on all punctuation, transitions between letters and digits, and case transitions (but with logic to preserve initial capitals with their words). Hence, the method name Aside from tokenization, lexical processing was fairly conventional. Stopwords were removed using a typical English-language stoplist. (It is not clear whether common Java "stopwords" should be removed as well, since common reserved words in Java do have important and concrete influence on the meaning of the code, as opposed to stopwords in the usual information retrieval sense, which have vague influence on meaning and are often optional or implicit in text.) Words were stemmed using the rule-based Porter stemmer, which processes noun plurals and well as verb conjugations and many derivations. This may be too aggressive a transformation for the code search context, but it is useful as a first approximation. The inclusion of stemming in the analysis pipeline allowed queries to match despite some differences in word endings with the index. For example, the query "item sizes" could match the method Keyword-in-ContextAn important feature of web search engines, which is also important in code search, is "keyword-in-context," meaning the inclusion of a small excerpt from the text of each matching document, focusing on how the query terms relate to the document. This display format provides important information to the user about the relevance of the resulting documents that is often not apparent from their titles, filenames, or other summary data, and the quality of the summaries provided is an important component of the user’s overall perception of search quality. The code search prototype used a simple code-oriented method for constructing a keyword-in-context for a file given a particular query. For each query term, the first line in the file matching the term would be displayed, along with the lines above and below it. As a result, several disjoint excerpts of code could potentially be displayed as part of the summary, although these were merged if they overlapped. In addition, all matches for the query terms were highlighted in the results. This methodology worked well enough for example purposes, although there was clear room for improvement in this area. For example, taking into account the co-occurrence of query terms, or the syntax-driven weighting of different term occurrences with the document, would likely improve the quality of the excerpt selection. RelevanceThe relevance function for the prototype used the standard Lucene similarity metric, which is based on the vector space model, where similarity between the document and query is computed as a cosine of term vectors, with a variety of attachment points for term weightings. The prototype used these attachment points to specify higher weighting for particular fields of text, such as the filename and path within a distribution, and to particular syntactic components, such as type or method names. In addition, these weighting methods were used to implement popularity-based heuristics (described below), which scale the value of particular terms in particular documents according to estimates for the relevance of these terms to the documents. The Lucene indexing model supports indexing of additional text fields that are not used for display. The prototype used these fields to supplement the content of a source file with additional text from related files that may be tangentially relevant, but is designated with low relevance weight, as well as to manipulate the weights of key terms, such as type names defined within a file. HeuristicsThe main area of exploration in this project was developing heuristics for search relevance based on syntactic analysis of the underlying source code. The project considered three methods in particular, all inspired to some extent by heuristics used in web search engines. These methods begin with the abstract syntax tree constructed by the Eclipse parser from each input file. Files are considered independently from each other, without build information such as dependencies on other distributions. Estimating "popularity" by counting usage instancesSome queries use terms common to multiple files or even multiple packages, and hence generate a large number of nominal matches, more than users expect to evaluate manually. As a result, relevance-based sorting is a critical factor in determining the quality of results for these queries. One criterion that can be used to predict relevance is the "popularity" of particular source code files for specific query terms. This is an approximate notion, but intuitively it might include preferring the class definition over an instance of the class in response to a search on the class name, or, when multiple class definitions match a query, preferring the ones that are more widely used across the code base. For example, the query "string" might return the definition of To estimate which source code files could be popular for particular terms, based on these type-centric intuitions, the prototype indexer conducted a simple linkage analysis across all of the source code files to be included in the index. For each file, the indexer determined which types are defined in the file, and which types are used as instance or local variables within the file. This data was compiled into a popularity index for each type, where popularity is determined by the number of users of each type. Rather than taking raw occurrence counts across the entire corpus at face value, the indexer assumed that multiple usages of a type within a single file are highly correlated, and hence should be discounted. The indexer transformed these raw counts per file into log counts per file, under the assumption that orders of magnitude of usage within a file were relevant to the popularity estimate, but that small differences were not. Similar reasoning applied at the package level: multiple uses were presumed to be correlated, so the sum of the individual per-file scores within a package were themselves scaled by a log function. This second tier of re-scaling would probably be best implemented at the distribution level, since this is a more natural division within the source code corpus, but was calculated here at the package level for convenience. The result of this calculation was a score for each type based on how often the type is instantiated, with preference to types instantiated in a variety of different source files and packages, rather than to types instantiated many times in a very local area of the index. This heuristic is inspired by the link graph analysis performed by web search engines as part of their relevance calculations, although in this case the heuristic is much more simplistic because it analyzes only direct links between types and their instantiations, not paths along multiple links. This simplification is not entirely unreasonable in this application because the context is not adversarial, and there is no need to account for "spam" in the index. While the prototype conducted this sort of linkage analysis for classes only, it may also make sense for methods and other entities that are defined in one place and accessed repeatedly. Increased weight for key fieldsA second application of syntactic analysis was to identify regions of the code that seemed particularly important within the file, and mark these for increased term weight in the relevance function. In particular, class, method, and package names were given increased weight at indexing time in the files that define them (as opposed to files that use them). This heuristic helped ensure that searches on class or method names were directed to the files that define them, everything else being equal. Inclusion of supplemental keywordsThe third relevance heuristic in the prototype targeted recall, rather than relevance per se. Some queries may describe code using terms not actually present within the code itself, such as synonyms or other related terms. The idea behind this heuristic is that these terms may be used in other code nearby a reference to the code in question. For example, the word "text" is fairly related to the class By introducing text that is likely to be unrelated to a file, this heuristic is potentially quite detrimental to relevance, although the very low weight of this secondary text reduces this impact (but also reduces the benefit). As a filter to focus this secondary text more closely on terms likely to be meaningful, it would be interesting to focus on terms that are popular near instantiations of a type across different files and distributions, perhaps discarding the terms that appear infrequently or only locally within the corpus (e.g. in a single distribution). At the same time, terms that appear quite often in this associated text across many different types could also be eliminated, as they are likely to be general terms that are not specific to a particular type, such as common program terminology or reserved words. If this heuristic proves useful, it would also be interesting to extend it to methods and other entities that are shared across files, although methods in particular lack an analogy to instance names, which seem to be a particularly important source of this text. Type ResolutionOne of the challenges of implementing cross-file heuristics is that types cannot be globally resolved by the parser, given the lack of build information. As a work-around, the prototype indexer approximated type linkages by conjecturing that each non-qualified type used in a file might have come from any on-demand package import listed in the file, unless it was specifically imported individually. These conjectures were matched with the types actually defined in other files (qualified with their package information). For example, the use of the class ResultsIt is difficult to meaningfully evaluate a search engine in the absence of actual user queries. Without a clear understanding of how users actually employ the service, it is hard to know how to optimize relevance or extend the feature set. For these reasons, it is not possible at this point to offer a solid evaluation of the prototype search quality, or to quantify the value of the particular heuristics explored here. However, there are some basic results worth noting. First, the parsing approach was technically feasible. The Eclipse JDT parser proved to be very robust, and was able to analyze the entire input corpus without incident. The approximation described above for cross-file type resolution seemed to work well in the author’s informal observation, and the type-instance popularity metric generated a sensible list of top classes that included many examples from the Java standard library, as well as key classes from JUnit, Eclipse, and others (see examples). Queries where the heuristics were relevant generally worked as expected; for example, a search for “index” would return some of the most important Lucene classes related to indexing. To really understand the performance of the prototype, it will be necessary to either deploy a version of it in an organizational setting, or to simulate usage by indexing a real corpus and reviewing results from actual user queries. This type of data would guide the development of syntactic heuristics. For example, perhaps local variable names should receive amplified term weighting, like method and type names do in prototype, but it is difficult to evaluate this conjecture in the absence of usage data. With such data, we could simulate query performance, isolate examples where results differ depending on the term weighting heuristic, and compare these difference to get an overall idea of whether higher weights for local variable names actually improves or hurts usability overall. At a higher level, the heuristics and other features present in the prototype are designed around a set of use cases that seem plausible, but there may be other important use cases as well that are not well-served by these features. By assessing performance in actual use, we can discover these other use cases and design features to address them. Future DirectionsWhile future work in this area should be guided by actual user requirements and experiences, there are many possible extensions that could be interesting directions for the prototype phase. Within the area of syntax-based features, some ideas include:
While parsing-based features are the focus of this project, there are many other features that would be required to develop a truly usable code search engine from the current prototype; these could include:
The intention of this project was to explore applications to code search within a single organization, which limits scale challenges and eliminates the need for sophisticated content acquisition methods; any extension to open-source search in the style of existing large-scale public open-source search services would have to consider extensions in these areas, which can be very difficult challenges in their own right, as well as the ways to manage the impact of the resulting scale on the relevance function and index-time processing steps.
Last modified June 13, 2007 6:38 pm / Skin by Kevin Hughes
![]() |