Thu 21 January 2021
This article documents some experiments in Python bytecode analysis. Relevant code and slides of a recent meetup talk can be found at https://github.com/gmodena/pycdump. This work is an evolution of Ned Batchelder's 2008 The structure of .pyc files. CPython has changed quite a bit since the time that blog was written. In the examples I'll use
cpython 3.7 as a reference platform.
Python is usually referred to as an interpreted language. To be more precise
CPython , the reference implementation of the Python language, is a bytecode interpreter. Each time a script (a
.py file) is executed, a compilation step generates bytecode, that is then interpreted and executed by a virtual machine.
.py file is imported, the interpreter generates a bunch
.pyc files . They contain the compiled bytecode of the imported modules. Their purpose is to avoid compiling the script at each subsequent import if the
.pyc is newer than the corresponding
The standard library ships with several modules and utility functions to generare, analyse and manipulate bytecode. The
compileall module, for instance, can be used as a script to compile sources. Let's borrow Ned Batchelder's
$ cat example.py a, b = 1, 0 if a or b: print("Hello", a)
We can compile it with:
$ python -m compileall example.py
The resulting bytecode is found under the
$ cat __pycache__/example.cpython-37.pyc B ?S"^=?@s*d\ZZeserede?iZded<dS))??ZHellorrN)?a?b?print?c?rr? example.p
Bytecode looks like... bytes.
CPython is a stack-based virtual machine. Any function, name or symbol is pushed onto a stack. The interpreter performs operations by popping elements from the stack and pushing back results. When a function is called, a new frame is pushed onto the stack. A frame is an area of memory which contains the function name, arguments and a the program's line number at which to resume execution once the function returns. Every time a function returns, its frame is popped. The
inspect module can be used to inspect the stack of a python script or (repl session). The statements below have been executed in
>>> import inspect >>> print(inspect.stack()) [FrameInfo(frame=<frame at 0x7f95806d1ba8, file '<ipython-input-1-80e4091818df>', line 2, code <module>>, filename='<ipython-input-1-80e4091818df>', lineno=2, function='<module>', code_context=['print(inspect.stack())\n'], index=0), FrameInfo(frame=<frame at 0x7f958051af48, file '/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py', line 3417, code run_code>, filename='/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py', lineno=3417, function='run_code', code_context=[' exec(code_obj, self.user_global_ns, self.user_ns)\n'], index=0), ... FrameInfo(frame=<frame at 0x7f957f0d3bd8, file '/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/__init__.py', line 126, code start_ipython>, filename='/Users/gmodena/miniconda3/lib/python3.7/site-packages/IPython/__init__.py', lineno=126, function='start_ipython', code_context=[' return launch_new_instance(argv=argv, **kwargs)\n'], index=0), FrameInfo(frame=<frame at 0x7f957b7889f8, file '/Users/gmodena/miniconda3/bin/ipython', line 10, code <module>>, filename='/Users/gmodena/miniconda3/bin/ipython', lineno=10, function='<module>', code_context=[' sys.exit(start_ipython())\n'], index=0)]
The output shows a list of frames (wrapped in
FrameInfo objects), with the most recent function call at the bottom -
print(inspect.stack()- all the way up to the
ipython session startup.
More details details of bytecode execution can be found in ceval.c.
pyc file contains a 16 byte header (four 32-bit words) and a variable size payload.
Everything in python is an object and, once compiled, each object will store its own bytecode. Let's run a simple example in the repl:
>>> def sum(a, b): return a + b >>> type(sum) function
Functions are objects too. The bytecode of the
sum function is accessible via the
>>> sum.__code__ # a code object to be executed <code object sum at 0x7fc7fa7ca5d0, file "<ipython-input-1-5c0b117d5737>", line 1>
__code__ is a code object. We can inspect its raw bytecode strings representation (byte literals) with:
>>> print(sum.__code__.co_code) b'|\x00|\x01\x17\x00S\x00' >>> print([co for co in sum.__code__.co_code]) [124, 0, 124, 1, 23, 0, 83, 0]
This string represents a list of opcodes and their arguments (if any), that will be interpreted and executed by the program loop in
dis module can be used to disassemble bytecode to human readable form.
>>> import dis >>> dis.dis(sum) 2 0 LOAD_FAST 0 (a) 2 LOAD_FAST 1 (b) 4 BINARY_ADD 6 RETURN_VALUE
sum has been assembled into the following operations:
LOAD_FAST <index> (opcode
124 in the byte string) pushes the function's arguments (
b) at index position
1 on the stack.
23) pops two elements from the stack (
a), adds them together, and pushes the results back on the stack
RETURN_VALUEpops an element from the stack (the function return value)
dump.py contains a simple, very much limited, disassembler implemented using standard library modules. It loads a
pyc file, and extracts the header field-by-field, reading 4 bytes at a time.
FIELD_SIZE = 4 # 32 // 8 def main(fname): with open(fname, "rb") as infile: # Header: bytes 0 - 3 magic_number = binascii.hexlify(infile.read(FIELD_SIZE)) # Header: bytes 4 - 7 bit_field = infile.read(FIELD_SIZE) # Header: bytes 8 - 11 moddate = infile.read(FIELD_SIZE) # Header: bytes 12 - 15 source_size = infile.read(FIELD_SIZE) modtime = time.asctime(time.localtime(struct.unpack("=L", moddate))) source_size = struct.unpack("=L", source_size)
From byte 16 onwards it extracts the payload and reconstructs the program structure as a list of code objects (
frames). Code objects in
pyc files are serialised using an internal binary format. The
marshal module comes with utility functions to manipulate it.
# Payload : bytes 16 - ... code_obj = marshal.load(infile) frames = dump(code_obj)
Finally, it loops over the code objects and disassembles the binary.
for tpl in frames: dis.disassemble(tpl)
dump() is a utility functions that recursively (and naively) traverses the call stack and returns a list of objects sorted by their first line in the python source code (
def dump(code_obj): frames =  def ddump(code_obj): for const in code_obj.co_consts: if isinstance(const, CodeType): ddump(const) frames.append((code_obj.co_filename, code_obj.co_firstlineno, code_obj)) ddump(code_obj) frames.sort(key=lambda tpl: tpl) return frames
The code below shows the disasembler output executed on the compiled
$ python dump.py __pycache__/example.cpython-37.pyc File name: __pycache__/example.cpython-37.pyc Magic number: b'420d0d0a' Bit fieldb'\x00\x00\x00\x00' Modification time: Sat Jan 18 01:55:13 2020 Source size: (46,) 1 0 LOAD_CONST 0 ((1, 0)) 2 UNPACK_SEQUENCE 2 4 STORE_NAME 0 (a) 6 STORE_NAME 1 (b) 2 8 LOAD_NAME 0 (a) 10 POP_JUMP_IF_TRUE 16 12 LOAD_NAME 1 (b) 14 POP_JUMP_IF_FALSE 26 3 >> 16 LOAD_NAME 2 (print) 18 LOAD_CONST 1 ('Hello') 20 LOAD_NAME 0 (a) 22 CALL_FUNCTION 2 24 POP_TOP >> 26 LOAD_CONST 2 (None) 28 RETURN_VALUE
This article gave a high level overview of the CPython virtual machine, and some of the binary analysis tools available in the standard library. There are a ton of use cases and projects that greatly expand on this topic. Some of my favourite are: