Contents

Type Inference in Python

Abstract

Types of variables in Python [1] are not statically declared. This provides the developer with more flexibility, and goes hand-in-hand with the philosophies of agile software development. However, it also means that the curious developer cannot determine the types of variables without manual inspection of several lines of code that could possibly cut across multiple functions and modules. This effort will be amplified if the code is written by someone else. Worse yet, unexpected type errors could potentially happen at runtime, which could never occur in a fully statically typed program.

In this paper, I propose the use of static type inference at the source code level in Python as a way to provide the developer (or any interested agent) with the most accurate type information about variables. My first attempt at developing an algorithm led to a local flow-sensitive analysis which provides type information about local variables at every program point.

The rest of this paper is organized as follows. In Section 1, I provide some background information about type inference in general. Section 2 describes the previous work done in the area and how this work is different. I explain my approach and results in Section 3, and in Section 4 point out what directions future work should take. Section 5 consists of my concluding remarks about the topic. The code I've so far written at the time of writing this paper is provided in Appendix A.

1. Background

Type inference is the process of finding the most accurate types of variables in a program without explicit declaration of those types in the source. Perhaps it is best explained by some examples. Consider the following code:

 import random
 def f():
	x = [1]
	if random.randint(0,1):
		x = 1
	print len(x)
 f()

This code is fine prior to runtime and compiles fine. However, the condition of the if statement is true 50% of the time in runtime. So, half the time this code is run, the type of x before the print statement is int. The built-in function len expects a list as its argument. When it is invoked on an int, it generates a type error.

In this simple example, the error can be detected by about two seconds of manual code inspection. However, you can imagine that if this situation occurred over hundreds of lines of code, across function and modules, finding the error would be a very unpleasant experience.

A related second example is when a Python developer inherits someone else's code and needs to figure out the type of some variable passed as argument to some function. The developer would have to find all call sites of the function, and trace back what possible types the argument could take. This track-back process grows exponentially with the number of nested call sites.

Type inference could help the developer keep his or her sanity in these situations. In the code example above, prior to runtime, the type inference engine could detect that it is possible for x to take on the integer type when len is called and inform the developer of the potential error. In the second example, a development environment that reports the possible types of variables, as determined by inter-procedural type inference could be very handy.

2. Previous Work

With the exception of ATI (described in Section 2.5), all the previous effort in Python type inference that I have come across is focused on inferring types in order to eliminate runtime type checks and ultimately improve performance. This is an interesting cause, but one that I am not interested in. The goal of this project is using type information to make the development process more convenient. As such, information obtained by type inference should be fed back into the development environment, and so must be coupled tightly with the source code -- a property that is missing in the current efforts.

2.1 Brett Cannon's thesis

In his thesis [5], Brett Cannon uses a combination of localized flow-sensitive and flow-insensitive analysis to infer types of atomic variables in a Python program. At the time of writing his thesis, the _ast module didn't exist in Python so he had to write custom code for parsing. According to an email conversation I had with him, he would've done things differently if he had to write his analyzer again today. This work is mainly concerned with changing the Python compiler to take into account inferred type information when emitting bytecode for enhancing runtime performance.

2.2 Starkiller

Starkiller [6] is another static type inferecer for Python. It is flow-insensitive and inter-procedural. Much like Brett Cannon's work, Starkiller is also concerned with removing type checks and dynamic dispatches from runtime by inferring types of expressions in compile time.

2.3 Psyco

Psyco [7] is a just-in-time (JIT) compiler for Python that infers integral and string types statically and emits machine code for operations on them directly to improve runtime performance.

2.4 PyPy

PyPy [8] is a Python interpreter written entirely in Python. In its interpretation tool-chain, it creates a control-flow-graph for the Python program and then uses flow-insensitive type inference to annotate variables with type information. PyPy could have been a good basis for this project to build on, but due to time constraints and a large mapping differential between the CFG produced by PyPy and the source code, it did not get used.

2.5 ATI

The author of Agressive Type Inference (ATI) [9] developed his approach because he wanted to convert Python source to Perl which is not possible without some type inference, since Perl, unlike Python, uses type qualifiers. It is a flow-insensitive algorithm and works completely on source code with no intention of improving runtime performance. ATI could have potentially been an interesting basis for this project to build on, but at the time of writing this paper, the source of ATI was unavailable.

3. Approach

As mentioned before, this project is concerned with inferring types in such a way that the type information can be fed back to the user. We define user as the actor that sees and modifies the source code prior to runtime, and controls the inputs and outputs of the program in runtime. Common users include the developer, the development environment, etc.

Figure 1

Figure 1 shows a common Python development setting. The developer writes code, compiles it, and runs it in the Python interpreter. The user of the program then runs it, provides input to it, and consumes the output and errors. The components contained in the dashed rectangle are ones that are visible to and under the control of the user.

Figure 2

Figure 2 shows the contribution of this project to the development setting. A type inference engine analyzes the source code (continually or when triggered by the user) and produces type information that goes directly back to the user. The user then decides what to do with this information. One example scenario is when the user is a development environment and it uses the type information to populate the "onHover" boxes that it pops up when the user hovers the mouse over variables in the source.

3.1 Parsing

Starting in Python 2.5, Python comes with a standard built-in abstract syntax tree (AST) library called _ast [2]. I used this library to convert Python source code into an AST. In order to do flow-sensitive analysis, the next step was to convert the AST into a control flow graph (CFG).

3.2 AST to CFG

Python doesn't provide standard support for creating a CFG from Python source. After getting an AST from source, I implemented the data structures and algorithm for representing the code in a CFG without loss of information.

The main data structure is a basic block that contains a single statement (copied from the AST) and has type information at its entry and exit points. Basic blocks are linked together via directed control links. Each basic block can have one or more basic blocks as its successors, and one or more basic blocks as its predecessors. The only exceptions to this rule are the entry and exit blocks of the CFG. The entry block does not have any predecessors and the exit block does not have any successors. There could be multiple exit blocks in a CFG but only one entry block.

Control-flow constructs have to be treated with care. An If-statement creates a fork in the control flow graph. The condition of the If is evaluated a basic block, which has two successors: the start block of the true branch, and the start block of the false (else) branch. The two branches join again at the statement after the If block.

While-statement and For-statement create loops (or back edges) in the control flow graph. In the current implementation, the semantics of For result in overgeneralization of the type of the bound variable, because it is not possible to know prior to runtime what values the bound variable could potentially take. As a result, the bound variable's type is set to the most general type (i.e. the Any type as described in the next section).

The CFG data structures are very similar to those of Joeq [3].

3.3 Data-flow analysis

The data-flow analysis used is of the style described in Chapter 9 of the Dragon Book [4]. Assuming the reader is familiar with this style, a description of the algorithm follows.

The direction of the analysis is forward.

The data-flow value at each program point is a map. Each entry maps a variable name to a list of possible types that variable could take on some path of execution. For example, the following value:

{x: [int], y: [int, list]}

means variable x could be an int, and variable y could be an int or a list.

The semi-lattice for each entry in the map is represented in Figure 3:

Image:Proj_naeim_fig3.PNG

The meet operator is union. So in terms of sets, the top value (Unknown) is the empty set and the bottom value (Any) is the universal set. By default, all possible variable names are assigned Unknown. When a variable is defined, its type becomes a value other than Unknown. If a variable is Unknown, it will not have an entry in the map.

The transfer function is defined as follows:

  • Let IN[b, x] be the set of types variable x could have at the entry of basic block b
  • Let OUT[b, x] be the set of types variable x could have at the exit of basic block b
  • If b contains an atomic (primitive) assignment of type T to variable x, then
    • OUT[b, x] = T
  • If b contains an eval or exec statement then
    • OUT[b, x] = Any
  • Otherwise
    • OUT[b, x] = IN[b, x]

The values are initialized as follows:

  • OUT[entry, a] = Any, for all arguments passed to the function
  • OUT[entry, x] = Uknown, for all other variables x
  • OUT[b, x] = Uknown, for all basic blocks b and all variables x

3.4 Examples

I ran the algorithm on several toy examples. Below is the code for these test cases and the type-annotated control flow graphs generated for them (thumbnails on the right.) In these graphs, the first line in each block is the type information at the entry of the block, and the last line is the type information at the exit. The block's statements are in the middle. Symbol U in the type information signifies the universal set, i.e. Any, i.e. the bottom of the semi-lattice.

test_assign
Enlarge
test_assign
test_if
Enlarge
test_if
test_while
Enlarge
test_while
test_for
Enlarge
test_for


 class MyClass:
    pass

 def test_assign():
    a = 1
    b = 2.0
    c = True
    d = None
    e = "str"
    f = 'str'
    g = [1,2,3]
    h = {'a':1}
    i = (1,2,3)
    j = 8j+3
    k = iter([1,2,3])
    l = MyClass()

 def test_if():
    x = 1
    if x == 1:
        x = [1,2,3]
    return x

 def test_while():
    x = 1
    while True:
        z = 1
        if True:
            x = "one"
        else:
            k = 1
    else:
        y = 1
    q = 1

 def test_for():
    y = 1
    for x in [0,1,2]:
        print x
    else:
        print y
    z = 1

4. Future Directions

This work has only scratched the surface on the topic of source level type inference in Python. There are several directions future work on this topic can take.

The data-flow analysis described in Section 3.3 can be improved significantly to handle more common cases. Type information propagation through variable assignments is an important addition that must be made. For example, if we have x = 1 followed by y = x, the integral type of x must be propagated to y as well.

It is also important to make the analysis inter-procedural and infer and propagate the types of function arguments and function return values. For example, if we have def func(): return 1 followed by x = func(), the type of x should be inferred to be an integer. Currently in such a circumstance, the type of x is set to Any (i.e. bottom of the semi-lattice) which is too general. Two famous algorithms, namely the Hindley-Milner algorithm [10] or the Cartesian Product algorithm [11] could be employed for this purpose.

In the current implementation, container types such as lists or dictionaries are considered as atomic types themselves. While this is a useful abstraction in its own right, it would also be useful to infer information about the values that are contained inside a container. For example, a list of integers' type could be inferred as list/int and a list of integers and strings could be list/[int,str].

The object-oriented features of Python are completely ignored in this implementation. One could get much accurate type results if the types of objects were annotated as their classes, as opposed to the generic "object" type. This would require adding support for detecting class declarations and object assignments in the code. This has several implications. First of all, the number of types will become unbounded (i.e. the semi-lattice grows indefinitely horizontally). Secondly, object-oriented concepts such as inheritance and polymorphism must be taken into account.

Finally, while the current implementation outputs a graph of the Python code with type annotations, the end goal is to use the inferred type information for making the development process more pleasant. Perhaps an integrated development environment (IDE), or a plug-in for an existing IDE, could be developed that uses the type information in the background and provides a usable interface to the developer that maps source code to the type information.

5. Conclusions

I have presented an approach for making the development process in Python easier by providing inferred type information to the developer. With the emergence of the web as an application platform which has led to the popularity of scripting languages, more and more applications are being written in these languages and the code and developer bases are expanding quickly. It is in the nature of most of these languages to lack static type information. Python is an extreme case with no static type declarations. There have been proposals [12, 13] for adding static typing, optional or otherwise, to the Python language. In my opinion, static typing should not become part of Python. If one wants static typing, they should look at other language options, such as Java. Static typing would only clutter up the source code and take away the freedoms and flexibilities that Python was designed to provide as a language in the first place.

However, I think providing inferred type information to the developer during the development process is useful if it requires no effort at all on the part of the developer. I have only scratched the surface in this paper as to what can be achieved by very simple static analysis techniques. By studying a large body of Python code and figuring out the typical kinds of mistakes that lead to runtime type errors, one can expand on these techniques and achieve higher and higher rates of static error detection. At the same time, the inferred type information is there for the developer to see if they so wish. This makes the development process more pleasant and saves time in manual code inspection and manual type inference.

Why flow-sensitive? The nature of scripting languages such as Python requires flow-sensitivity. The same variable can take on multiple types in its lifetime. A flow-insensitive analysis would yield many false positives. I have shown that a flow-sensitive algorithm is not difficult to implement and is by nature more accurate than a flow-insensitive counterpart. Aycock in [9] argues that while Python allows different types of values to be assigned to the same variable, this is not a common practice. My counter-argument to this claim is that many Python developers come from a statically typed background (e.g. C++, Java, etc.) In those statically typed languages, they were confined within the walls of the static type jail to not assign a different type to the same variable. They carry these habits over with them when they start programming with Python. However, Python as a language does not put any such restrictions on the developer, so it is very conceivable that a young developer whose first language is Python will use this freedom, and perhaps even come up with useful ways to exploit this freedom to do interesting things -- things that were not possible in a statically typed language. Therefore, a flow-sensitive tool, like the one I have presented would help this developer without producing many false positives, like the flow-insensitive tool would.

References

  1. Python Programming Language, http://www.python.org
  2. Python Library Reference, Abstract Syntax Trees, http://docs.python.org/lib/ast.html
  3. An Introduction to the Joeq Compiler System, http://suif.stanford.edu/~courses/cs243/joeq/index.html
  4. A. Aho, M. Lam, R. Sethi, J. Ullman, Compilers: Principles, Techniques, & Tools, 2nd edition.
  5. B. Cannon, Localized Type Inference of Atomic Types in Python. [pdf]
  6. M. Salib, Faster than C: Static Type Inference with Starkiller. [pdf]
  7. Psyco, http://psyco.sourceforge.net/
  8. PyPy, http://pypy.org/
  9. J. Aycock, Agressive Type Inference. [html]
  10. Wikipedia, Hindley-Milner Type Inference Algorithm, http://en.wikipedia.org/wiki/Type_inference#Hindley.E2.80.93Milner_type_inference_algorithm
  11. O. Agesen, The Cartesian Product Algorithm: Simple and Precise Type Inference of Parametric Polymorphism. [postscript]
  12. G. v. Rossum, Adding Optional Static Typing to Python. [html]
  13. G. v. Rossum, Adding Optional Static Typing to Python Part II. [html]

Appendix A: Code

Usage: Write any Python functions you like in the string that is passed to compile in main(). Then, run the script. A CFG will be created and type annotated. The output is in dot format that can be passed to Graphviz dot to generate a graph.

#! /usr/bin/python

import _ast

### Main Function ###
def main():
    ast = compile("""
def test_if():
    x = 1
    if x == 1:
        x = [1,2,3]
    return x
# insert other functions here as desired
""", "<string>", 'exec', _ast.PyCF_ONLY_AST)


    function_graphs = []

    # process functions and populate function_graphs with name->start_block mappings
    for node in ast.body:
        if type(node) == _ast.FunctionDef:
            start_block = BasicBlock()
            process_nodes(node.body, start_block)
            function_graphs.append( (node.name, node, start_block) )
    
    # at this point function_graphs has a CFG for each function

    for fg in function_graphs:
        annotate_graph(fg)

    print to_dot(function_graphs)


def annotate_graph(fg):
    name = fg[0]
    func_def = fg[1]
    start_block = fg[2]
    
    func_args = func_def.args.args
    for a in func_args:
        if type(a) == _ast.Name:
            start_block.before[a.id] = "U"

    repeat = True
    while repeat:
        repeat = False
        for blk in block_iterator(fg):
            old_after = blk.after.copy()
            process_block(blk)
            new_after = blk.after.copy()

            if repeat == False and old_after != new_after:
                repeat = True

            for n in blk.next_blocks:
                n.before = meet(n.before, blk.after)
        
def process_block(blk):
    blk.after = blk.before.copy()
    s = blk.statement
    if type(s) == _ast.Assign:
        val = s.value
        for t in s.targets:
            if type(t) == _ast.Name:
                ty = set()
                universal = False

                if type(val) == _ast.Num:
                    if type(val.n) == int:
                        ty.add("int")
                    elif type(val.n) == float:
                        ty.add("float")
                    else:
                        ty.add("int") # catch-all
                elif type(val) == _ast.Name:
                    if val.id == "True" or val.id == "False":
                        ty.add("bool")
                    elif val.id == "None":
                        ty.add("none")
                    else:
                        universal = True # TODO do assignment propagation
                elif type(val) == _ast.Str:
                    ty.add("str")
                elif type(val) == _ast.List:
                    ty.add("list")
                elif type(val) == _ast.Dict:
                    ty.add("dict")
                elif type(val) == _ast.Tuple:
                    ty.add("tuple")
                else:
                    universal = True
                
                if universal:
                    blk.after[t.id] = "U"
                else:
                    blk.after[t.id] = ty

# fg = (name, func_def, start_block)
def block_iterator(fg):
    start_block = fg[2]

    to_process = [start_block]
    processed = set()
    while len(to_process) > 0:
        blk = to_process.pop()
        processed.add(blk)
        for n in blk.next_blocks:
            if n not in processed:
                to_process.append(n)
    return iter(processed)
    

def meet(ti1, ti2):
    m = ti1.copy()
    for v in ti2.iterkeys(): # vars
        if v in m:
            if ti2[v] == "U" or m[v] == "U": # universal set
                m[v] = "U"
            else:
                for t in ti2[v]: # types
                    m[v].add(t)
        else:
            if type(ti2[v]) == set:
                m[v] = set(ti2[v])
            elif type(ti2[v]) == str:
                m[v] = str(ti2[v])
    return m
    
def ti_to_string(ti):
    s = ""
    s += "{"
    a1 = False
    for a in ti.iteritems():
        a1 = True
        v = a[0]
        ts = a[1]
        s += v + ": "
        if type(ts) == str:
            s += ts
        elif type(ts) == set:
            s += "["
            a2 = False
            for t in ts:
                a2 = True
                s += (t + ", ")
            if a2:
                s = s[0:-2]
            s += "]"
        s += ", "
    if a1:
        s = s[0:-2]
    s += "}"
    return s

        
def to_dot(fgs):
    s = ""
    s += "digraph cfg {\n"
    s += "  node [\n"
    s += "    shape = \"box\"\n"
    s += "  ];\n\n"

    for fg in fgs:
        name = fg[0]
        startblk = fg[2]
        s += "  \"" + name + "\" -> BB" + str(startblk.uid) + "\n"
        s += startblk.to_dot()

    s += "}\n"
    
    return s


def process_nodes(nodes, current_block):
    next_block = current_block
    for node in nodes:
        next_block = process_node(node, next_block)
    return next_block

### Meat of the AST->CFG algorithm ###
def process_node(node, current_block):
    next_block = current_block

    if type(node) == _ast.FunctionDef:
        pass

    elif type(node) == _ast.If:
        current_block.statement = node.test
        t_block = BasicBlock()
        end_t_block = process_nodes(node.body, t_block)
        f_block = BasicBlock()
        end_f_block = process_nodes(node.orelse, f_block)
        next_block = BasicBlock()

        current_block.link_next_block(t_block)
        current_block.link_next_block(f_block)
        next_block.link_prev_block(end_t_block)
        next_block.link_prev_block(end_f_block)

    elif type(node) == _ast.While:
        test_block = BasicBlock()
        end_test_block = process_node(node.test, test_block)
        t_block = BasicBlock()
        end_t_block = process_nodes(node.body, t_block)
        f_block = BasicBlock()
        end_f_block = process_nodes(node.orelse, f_block)
        next_block = BasicBlock()

        current_block.link_next_block(test_block)
        end_test_block.link_next_block(t_block)
        end_test_block.link_next_block(f_block)
        end_t_block.link_next_block(test_block) #loop
        next_block.link_prev_block(end_f_block)

    elif type(node) == _ast.For:
        assign_node = _ast.Assign()
        assign_node.col_offset = node.target.col_offset
        assign_node.lineno = node.target.lineno
        assign_node.targets = [node.target]
        assign_node.value = "UNKNOWN"
        test_block = BasicBlock()
        test_block.statement = assign_node
        
        t_block = BasicBlock()
        end_t_block = process_nodes(node.body, t_block)
        f_block = BasicBlock()
        end_f_block = process_nodes(node.orelse, f_block)
        next_block = BasicBlock()

        current_block.link_next_block(test_block)
        test_block.link_next_block(t_block)
        test_block.link_next_block(f_block)
        end_t_block.link_next_block(test_block) #loop
        next_block.link_prev_block(end_f_block)

    else:
        current_block.statement = node
        next_block = BasicBlock()
        current_block.link_next_block(next_block)


    return next_block


       

### Data structure of a basic block ###
class BasicBlock:
    uid = 0
    def __init__(self):
        self.statement = None
        self.prev_blocks = []
        self.next_blocks = []
        BasicBlock.uid = BasicBlock.uid + 1
        self.uid = BasicBlock.uid
        self.before = {}
        self.after = {}

    def link_prev_block(self, b):
        self.prev_blocks.append(b)
        b.next_blocks.append(self)

    def link_next_block(self, b):
        self.next_blocks.append(b)
        b.prev_blocks.append(self)

    def to_string(self, printed=set()):
        s = ""
        s += "----------------------" + "\n"
        s += "BB:" + str(self.uid) + "\n"
        s += "    " + str(self.statement) + "\n"
        for b in self.prev_blocks:
            s += "PREV = " + "BB:" + str(b.uid) + "\n"
        for b in self.next_blocks:
            s += "NEXT = " + "BB:" + str(b.uid) + "\n"

        printed.add(self)
        for b in self.prev_blocks:
            if b not in printed:
                s += b.to_string(printed)
        for b in self.next_blocks:
            if b not in printed:
                s += b.to_string(printed)
        
        return s

    def to_dot(self): # TODO make this use block_iterator
        s = ""
        to_process = [self]
        processed = set()
        while len(to_process) > 0:
            blk = to_process.pop()
            processed.add(blk)
            for n in blk.next_blocks:
                s += "  BB" + str(blk.uid) + " -> BB" + str(n.uid) + ";\n"
                if n not in processed:
                    to_process.append(n)

        for blk in processed:
            s += "  BB" + str(blk.uid) + " [label=\""
            s += str(ti_to_string(blk.before)) + "\\n"
            s += str(blk.statement) + "\\n"
            s += str(ti_to_string(blk.after)) + "\\n"
            s += "\"];\n"

        return s


### Entry point ###
if __name__ == "__main__":
    main() 


























Milestones

Week 1 Plan

  • Learn about type inference algorithms
  • Look at existing projects and find a code base
  • Start doing the preliminary stuff (parsing, generating ASTs, etc.)

Week 1 Progress

  • Brett Cannon replied and pointed me to the _ast module in standard Python
  • This page has good info on getting a CFG from Python code
  • The plan is to go from source to AST to CFG, then type annotate the CFG and somehow translate annotation back to the source realm to output useful information (such as type errors) to the user with respect to the source code
  • I've been looking at two type-inference algorithms:
    • Hindley-Milner
    • Cartesian Product
  • Starkiller is interesting
  • More links:
  • Got Benjamin Pierce's book titled "Types and Programming Languages" and the sequel "Advanced Topics in Types and Programming Languages" from the library. Chapter 22 of the former and chapter 10 of the latter are relevant.

Week 2 Plan

  • Implement a suitable type inference algorithm
  • Implement type error detection
  • Create test cases

Week 2 Progress

  • I used the PyPy framework to generate a CFG (control-flow graph) for any Python program.
  • Next step is to analyze the CFG and annotate types of variables. This will probably be a forward data-flow analysis with a fixed-point solution.
  • Met with Ted again and discussed progress so far, and future directions
    • The goal of this project is to add type information to Python code at the source level, i.e. at a level to which the developer has visibility. Previous work have focused on adding type information to interpreter/bytecode layer for enhancing performance mostly.
  • Since the PyPy-generated CFG has a mapping differential to source code, I decided to stay away from it.
  • Currently, I have this infrastructure in place:
    • I use the _ast module to get one AST per function definition in arbitrary Python code
    • I wrote some code that turns these ASTs into CFGs. A CFG consists of basic blocks that contain statements and are linked according to the static control flow information. These control-flow related structures are handled: If, While, For. Try/Catch statements aren't supported yet.
    • I used Graphviz DOT to generate a visual representation of the CFGs (helpful for my own debugging purposes)
    • I wrote a flow-sensitive analyzer that takes the CFG and annotates program points (basic block inputs/outputs) with type information of local variables. The algorithm is very simple. It detects assignment of primitives to variables and propagates this information through code.
  • I tested this on numerous test cases and it seems to work fine.
  • Need to enhance algorithm a little:
    • Maybe support try/catch?
    • Support del statement
    • Handle eval/exec by basically throwing out all type information collected up to the point they are seen
  • Next step is to go from this annotated AST to annotated/augmented source
  • Also TODO: Prepare presentation for Wednesday

Week 3 Plan

  • Run algorithms on test cases
  • Run on real Python programs and hope for the best

Week 3 Progress

...

Last modified June 13, 2007 1:37 am / Skin by Kevin Hughes
MediaWiki