nowave.it

Python bytecode analysis (1)

Thu 21 January 2021

This article documents some experiments in Python bytecode analysis. Relevant code and slides of a recent meetup talk can be found at https://github.com/gmodena/pycdump. This work is an evolution of Ned Batchelder's 2008 The structure of .pyc files. CPython has changed quite a bit since the time that blog was written. In the examples I'll use cpython 3.7 as a reference platform.

CPython 101

cpython code execution
Fig 1. cpython code execution

Python is usually referred to as an interpreted language. To be more precise CPython , the reference implementation of the Python language, is a bytecode interpreter. Each time a script (a .py file) is executed, a compilation step generates bytecode, that is then interpreted and executed by a virtual machine.

Bytecode generation

When a .py file is imported, the interpreter generates a bunch .pyc files . They contain the compiled bytecode of the imported modules. Their purpose is to avoid compiling the script at each subsequent import if the .pyc is newer than the corresponding .py file.

The standard library ships with several modules and utility functions to generare, analyse and manipulate bytecode. The compileall module, for instance, can be used as a script to compile sources. Let's borrow Ned Batchelder's example.py:

$ cat example.py
a, b = 1, 0
if a or b:
    print("Hello", a)

We can compile it with:

$ python -m compileall example.py

The resulting bytecode is found under the __pycache__ directory.

$ cat __pycache__/example.cpython-37.pyc
B
?S"^=?@s*d\ZZeserede?iZded<dS))??ZHellorrN)?a?b?print?c?rr?
example.p

Bytecode looks like... bytes.

Bytecode execution

CPython is a stack-based virtual machine. Any function, name or symbol is pushed onto a stack. The interpreter performs operations by popping elements from the stack and pushing back results. When a function is called, a new frame is pushed onto the stack. A frame is an area of memory which contains the function name, arguments and a the program's line number at which to resume execution once the function returns. Every time a function returns, its frame is popped. The inspect module can be used to inspect the stack of a python script or (repl session). The statements below have been executed in ipython:

>>> import inspect 
>>> print(inspect.stack())
[FrameInfo(frame=<frame at 0x7f95806d1ba8, file '<ipython-input-1-80e4091818df>', line 2, code <module>>, filename='<ipython-input-1-80e4091818df>', lineno=2, function='<module>', code_context=['print(inspect.stack())\n'], index=0), FrameInfo(frame=<frame at 0x7f958051af48, file '/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py', line 3417, code run_code>, filename='/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py', lineno=3417, function='run_code', code_context=['                    exec(code_obj, self.user_global_ns, self.user_ns)\n'], index=0), 

 ...

 FrameInfo(frame=<frame at 0x7f957f0d3bd8, file '/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/__init__.py', line 126, code start_ipython>, filename='/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/__init__.py', lineno=126, function='start_ipython', code_context=['    return launch_new_instance(argv=argv, **kwargs)\n'], index=0), FrameInfo(frame=<frame at 0x7f957b7889f8, file '/Users/gmodena/miniconda3/bin/ipython', line 10, code <module>>, filename='/Users/gmodena/miniconda3/bin/ipython', lineno=10, function='<module>', code_context=['    sys.exit(start_ipython())\n'], index=0)]

The output shows a list of frames (wrapped in FrameInfo objects), with the most recent function call at the bottom - print(inspect.stack()- all the way up to the ipython session startup.

More details details of bytecode execution can be found in ceval.c.

Structure of .pyc files

A pyc file contains a 16 byte header (four 32-bit words) and a variable size payload.

pyc files
Fig 2. pyc file layout

From byte 16 onwards the payload stores a marshalled code object. Code objects expose, among others, the following attributes:

Code Objects

Everything in python is an object and, once compiled, each object will store its own bytecode. Let's run a simple example in the repl:

>>> def sum(a, b):
         return a + b

>>> type(sum)
function

Functions are objects too. The bytecode of the sum function is accessible via the __code__ attribute.

>>> sum.__code__ # a code object to be executed
<code object sum at 0x7fc7fa7ca5d0, file "<ipython-input-1-5c0b117d5737>", line 1>

__code__ is a code object. We can inspect its raw bytecode strings representation (byte literals) with:

>>> print(sum.__code__.co_code)
b'|\x00|\x01\x17\x00S\x00'
>>> print([co for co in sum.__code__.co_code])
[124, 0, 124, 1, 23, 0, 83, 0]

This string represents a list of opcodes and their arguments (if any), that will be interpreted and executed by the program loop in ceval.c. The dis module can be used to disassemble bytecode to human readable form.

>>> import dis
>>> dis.dis(sum)
  2           0 LOAD_FAST                0 (a)
              2 LOAD_FAST                1 (b)
              4 BINARY_ADD
              6 RETURN_VALUE

sum has been assembled into the following operations:

Putting it all together: a basic disassembler

dump.py contains a simple, very much limited, disassembler implemented using standard library modules. It loads a pyc file, and extracts the header field-by-field, reading 4 bytes at a time.

FIELD_SIZE = 4  # 32 // 8

def main(fname):
    with open(fname, "rb") as infile:
        # Header: bytes 0 - 3
        magic_number = binascii.hexlify(infile.read(FIELD_SIZE))
        # Header: bytes 4 - 7
        bit_field = infile.read(FIELD_SIZE)
        # Header: bytes 8 - 11
        moddate = infile.read(FIELD_SIZE)
        # Header: bytes 12 - 15
        source_size = infile.read(FIELD_SIZE)
        modtime = time.asctime(time.localtime(struct.unpack("=L", moddate)[0]))
        source_size = struct.unpack("=L", source_size)

From byte 16 onwards it extracts the payload and reconstructs the program structure as a list of code objects (frames). Code objects in pyc files are serialised using an internal binary format. The marshal module comes with utility functions to manipulate it.

        # Payload : bytes 16 - ...
        code_obj = marshal.load(infile)
        frames = dump(code_obj)

Finally, it loops over the code objects and disassembles the binary.

        for tpl in frames:
            dis.disassemble(tpl[2])

Unmarshall and dump the code object

dump() is a utility functions that recursively (and naively) traverses the call stack and returns a list of objects sorted by their first line in the python source code (code_obj.co_firstlineno).

def dump(code_obj):
    frames = []

    def ddump(code_obj):
        for const in code_obj.co_consts:
            if isinstance(const, CodeType):
                ddump(const)
        frames.append((code_obj.co_filename, code_obj.co_firstlineno, code_obj))

    ddump(code_obj)
    frames.sort(key=lambda tpl: tpl[1])
    return frames

Disassemble

The code below shows the disasembler output executed on the compiled example.py script.

$ python dump.py __pycache__/example.cpython-37.pyc
File name: __pycache__/example.cpython-37.pyc
Magic number: b'420d0d0a'
Bit fieldb'\x00\x00\x00\x00'
Modification time: Sat Jan 18 01:55:13 2020
Source size: (46,)


  1           0 LOAD_CONST               0 ((1, 0))
              2 UNPACK_SEQUENCE          2
              4 STORE_NAME               0 (a)
              6 STORE_NAME               1 (b)

  2           8 LOAD_NAME                0 (a)
             10 POP_JUMP_IF_TRUE        16
             12 LOAD_NAME                1 (b)
             14 POP_JUMP_IF_FALSE       26

  3     >>   16 LOAD_NAME                2 (print)
             18 LOAD_CONST               1 ('Hello')
             20 LOAD_NAME                0 (a)
             22 CALL_FUNCTION            2
             24 POP_TOP
        >>   26 LOAD_CONST               2 (None)
             28 RETURN_VALUE

Conclusion

This article gave a high level overview of the CPython virtual machine, and some of the binary analysis tools available in the standard library. There are a ton of use cases and projects that greatly expand on this topic. Some of my favourite are: