Skip to content

Conversation

@ShaharNaveh
Copy link
Collaborator

@ShaharNaveh ShaharNaveh commented Nov 30, 2025

Note

This is very much at a POC stage ATM

The idea here is to auto-generate as much as possible, whether it's oparg field name (namei/consti/target/etc), the oparg type, oparg ID and opcode name.
+
An important feature using the config file is to have the ability to control every attribute in every opcode, we can exclude opcodes that are not supported yet or add new ones that do not appear in the targeted CPython version (not supported yet, but planning to add it).

This will ensure consistency for the enum variant.

lmk what you think:)

cc: @coolreader18 @youknowone

Summary by CodeRabbit

  • New Features

    • Enhanced bytecode instruction support with improved opcode handling and generation capabilities.
  • Chores

    • Integrated Python 3.13.9 compatibility and updated build tooling infrastructure.

✏️ Tip: You can customize this high-level summary in your review settings.

@ShaharNaveh ShaharNaveh marked this pull request as draft November 30, 2025 20:08
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 30, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds infrastructure for auto-generating Rust bytecode instruction enums from CPython opcode definitions. Changes include a Git submodule pointing to CPython 3.13.9, a TOML configuration file specifying opcode variants, a Python generation script that parses the configuration and synthesizes Rust code, and a refactored bytecode module with the generated instruction enum.

Changes

Cohort / File(s) Summary
Git infrastructure
.gitmodules, submodules/cpython-v3.13.9
Adds Git submodule entry for CPython v3.13.9 repository and updates submodule pointer.
Configuration and generation
crates/compiler-core/instructions.toml, crates/compiler-core/generate.py
Introduces TOML configuration file defining opcode variants with metadata (name, oparg properties, type mappings) and marks disabled opcodes. Adds Python generation script that parses the configuration, analyzes CPython instruction data, generates Rust enum variants, and formats output via rustfmt.
Rust bytecode module
crates/compiler-core/src/bytecode.rs, crates/compiler-core/src/bytecode/instruction.rs
Refactors bytecode module to declare a separate instruction submodule and re-exports public Instruction type. Removes in-file enum definition. Adds auto-generated Instruction enum in new instruction.rs with #[repr(u8)], deriving Clone, Copy, Debug, Eq, PartialEq, comprising comprehensive opcode variants with associated data fields using typed argument wrappers.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • crates/compiler-core/generate.py: Verify parsing logic, TOML deserialization, CPython data analysis, and Rust code synthesis correctness; ensure casing conversions and rustfmt integration work as intended.
  • crates/compiler-core/instructions.toml: Validate opcode names, discriminant mappings, oparg type definitions, and consistency with CPython 3.13.9 specification; confirm disabled opcodes list is intentional.
  • crates/compiler-core/src/bytecode.rs and instruction.rs: Confirm structural refactoring is complete; verify all existing references to Instruction still resolve correctly; check enum discriminant values align with CPython bytecode semantics.
  • submodules/cpython-v3.13.9: Confirm commit hash is correct and accessible.

Poem

🐰 Bytecode opcodes dancing free,
Generated from TOML and CPython spree,
From Python scripts to Rust so bright,
Instructions manifest—pure delight! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'POC: Configurable Instruction autogen' accurately describes the main change: introducing a proof-of-concept for auto-generating Instruction definitions with configurable opcodes via a new TOML configuration file.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
crates/compiler-core/src/bytecode.rs (1)

1163-1183: Generated Instruction enum incompatible with existing impl blocks.

The label_arg() method references variants and field names that don't exist in the generated instruction.rs:

  • JumpIfNotExcMatch, JumpIfTrueOrPop, JumpIfFalseOrPop, SetupExcept, SetupAsyncWith, Break, Continue - not in generated enum
  • Field names like target, handler, end don't match the generated fields (e.g., delta is used in generated code)
  • PopJumpIfTrue { target: l } but generated has PopJumpIfTrue { delta: Arg<Label> }

This pattern extends throughout stack_effect() (lines 1220-1344) and fmt_dis() (lines 1413-1522), which reference ~50+ variants not present in the generated enum.

Since this is a POC, consider either:

  1. Keeping the old Instruction enum alongside the generated one until feature parity is reached
  2. Updating the generator to produce all required variants and field names
  3. Adding a note in the PR description about the current compilation status
🧹 Nitpick comments (1)
crates/compiler-core/src/bytecode.rs (1)

15-17: Module declaration order is unconventional.

The re-export (pub use) appears before the module declaration (mod instruction). While this compiles, the conventional order is to declare modules first, then re-export from them.

-pub use crate::bytecode::instruction::Instruction;
-
 mod instruction;
+
+pub use crate::bytecode::instruction::Instruction;
📜 Review details

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8e0a86d and ce01644.

📒 Files selected for processing (6)
  • .gitmodules (1 hunks)
  • crates/compiler-core/generate.py (1 hunks)
  • crates/compiler-core/instructions.toml (1 hunks)
  • crates/compiler-core/src/bytecode.rs (1 hunks)
  • crates/compiler-core/src/bytecode/instruction.rs (1 hunks)
  • submodules/cpython-v3.13.9 (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.py: In most cases, Python code should not be edited; bug fixes should be made through Rust code modifications only
Follow PEP 8 style for custom Python code
Use ruff for linting Python code

Files:

  • crates/compiler-core/generate.py
**/*.rs

📄 CodeRabbit inference engine (.github/copilot-instructions.md)

**/*.rs: Follow the default rustfmt code style by running cargo fmt to format Rust code
Always run clippy to lint Rust code (cargo clippy) before completing tasks and fix any warnings or lints introduced by changes
Follow Rust best practices for error handling and memory management
Use the macro system (pyclass, pymodule, pyfunction, etc.) when implementing Python functionality in Rust

Files:

  • crates/compiler-core/src/bytecode/instruction.rs
  • crates/compiler-core/src/bytecode.rs
🪛 Flake8 (7.3.0)
crates/compiler-core/generate.py

[error] 137-137: undefined name 'analyser'

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Run tests under miri
  • GitHub Check: Run snippets and cpython tests (windows-latest)
  • GitHub Check: Run rust tests (windows-latest)
  • GitHub Check: Check Rust code with clippy
  • GitHub Check: Ensure compilation on various targets
🔇 Additional comments (6)
submodules/cpython-v3.13.9 (1)

1-1: No changes needed to submodule configuration.

The commit hash 8183fa5e3f78ca6ab862de7fb8b14f3d929421e0 is the official CPython 3.13.9 release commit, and pinning to it is an explicit, stable approach. Shallow cloning is appropriate here since code generation typically requires only the current state of opcode definitions, not repository history.

crates/compiler-core/instructions.toml (2)

17-23: LGTM!

The active opcode configurations with overrides are well-documented. The use of r#where for the Rust keyword escape and the comment explaining the 3.13 oparg change for YIELD_VALUE are helpful.


25-61: LGTM!

The disabled opcodes section appropriately excludes specialized optimization variants and instrumentation opcodes that are not yet implemented. This is a reasonable approach for the POC stage.

crates/compiler-core/generate.py (2)

33-66: LGTM!

The Inst class properly handles override configurations and falls back to analyzed properties. The variant generation correctly produces Rust enum syntax with optional oparg fields.


68-100: TODOs noted for future improvement.

The Oparg.from_properties method has several TODO comments for adding proper types (ConstIdx for uses_co_consts, etc.) and raising errors instead of falling back to u32. These are appropriate for a POC but should be tracked for completion before this feature is fully integrated.

Consider opening issues to track:

  1. Replace u32 fallback with proper ConstIdx type for uses_co_consts
  2. Add proper type for uses_locals
  3. Raise error instead of silent u32 fallback in the else branch
crates/compiler-core/src/bytecode/instruction.rs (1)

168-169: Verify unit variants for StoreAttr* instructions.

StoreAttrInstanceValue and StoreAttrSlot are defined as unit variants (no oparg), while similar LoadAttr* variants (lines 103, 111) have idx: Arg<u32> fields. This asymmetry may be intentional based on CPython's opcode definitions, but should be verified.

Comment on lines +137 to +138
def get_analysis() -> analyser.Analysis:
analysis = analyzer.analyze_files([DEFAULT_INPUT])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Typo in type annotation: analyser should be analyzer.

The function's return type annotation uses analyser.Analysis but the import on line 29 is analyzer. This will cause a NameError at runtime if type hints are evaluated.

-def get_analysis() -> analyser.Analysis:
+def get_analysis() -> analyzer.Analysis:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_analysis() -> analyser.Analysis:
analysis = analyzer.analyze_files([DEFAULT_INPUT])
def get_analysis() -> analyzer.Analysis:
analysis = analyzer.analyze_files([DEFAULT_INPUT])
🧰 Tools
🪛 Flake8 (7.3.0)

[error] 137-137: undefined name 'analyser'

(F821)

🤖 Prompt for AI Agents
In crates/compiler-core/generate.py around lines 137 to 138, the return type
annotation mistakenly references analyser.Analysis while the module is imported
as analyzer; update the annotation to analyzer.Analysis to match the import (or
change the import to analyser if that was intended) so the type name resolves
correctly; ensure any other occurrences use the same identifier to avoid
NameError.

Comment on lines +140 to +142
# We don't differentiate between real and pseudos yet
analysis.instructions |= analysis.pseudos
return analysis
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

#[repr(u8)] incompatible with pseudo-instruction IDs.

Line 141 merges analysis.pseudos into analysis.instructions, but pseudo-instructions have IDs > 255 (e.g., Jump = 256). Combined with the hardcoded #[repr(u8)] on line 150, this produces an enum that won't compile.

Either:

  1. Filter out instructions with IDs > 255:
if opcodes.get(name, {}).get("enabled", True) and analysis.opmap.get(name, 0) <= 255
  1. Or change the repr based on max ID:
max_id = max(inst.id for inst in instructions)
repr_type = "u16" if max_id > 255 else "u8"
# Then use {repr_type} in the template

Also applies to: 148-151

🤖 Prompt for AI Agents
crates/compiler-core/generate.py around lines 140-142 and 148-151: after merging
analysis.pseudos into analysis.instructions the enum is generated with a
hardcoded #[repr(u8)] which fails if any instruction id > 255 (pseudo IDs like
256); fix by computing the maximum instruction id after the merge (e.g., max_id
= max(inst.id for inst in analysis.instructions)) and choose repr_type = "u16"
if max_id > 255 else "u8", then inject {repr_type} into the template instead of
the hardcoded u8; alternatively, if you prefer to exclude pseudos from the enum,
filter analysis.pseudos when merging (only include ops with enabled and op id <=
255) before generating the enum.

Comment on lines 176 to 185
output = rustfmt(
f"""
// This file is generated by {script_path}
// Do not edit!

use crate::bytecode::{{Arg, Label, NameIdx}};

{generated}
"""
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Generated imports are incomplete.

The import statement on line 181 only includes Arg, Label, NameIdx, but the generated enum uses additional types from the bytecode module:

  • IntrinsicFunction1 (used in CallIntrinsic1)
  • IntrinsicFunction2 (used in CallIntrinsic2)
  • ResumeKind (used in Resume)

The generator should dynamically collect all oparg types used by instructions and include them in the import:

+def collect_oparg_types(instructions: list[Inst]) -> set[str]:
+    types = set()
+    for inst in instructions:
+        if inst.has_oparg and inst.oparg_typ.name not in ("u32",):
+            types.add(inst.oparg_typ.name)
+    return types

 def main():
     # ... existing code ...
+    oparg_types = collect_oparg_types(instructions)
+    imports = ", ".join(sorted(["Arg", "Label", "NameIdx"] + list(oparg_types)))
     output = rustfmt(
         f"""
     // This file is generated by {script_path}
     // Do not edit!

-    use crate::bytecode::{{Arg, Label, NameIdx}};
+    use crate::bytecode::{{{imports}}};

     {generated}
     """
     )

Committable suggestion skipped: line range outside the PR's diff.

Comment on lines 3 to 7
# The script will try to detect the properties for each opcode, but it can be useful to override some prooerties. So Ideally,
# we want to have this file empty because that means that we are fully aligned with CPython's opcodes.
#
# enabled : bool
# When false the opcode will not apear as a variant.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix typos in documentation comments.

There are two typos in the header comments:

  • Line 3: "prooerties" → "properties"
  • Line 7: "apear" → "appear"
-# The script will try to detect the properties for each opcode, but it can be useful to override some prooerties. So Ideally,
+# The script will try to detect the properties for each opcode, but it can be useful to override some properties. So ideally,
 # we want to have this file empty because that means that we are fully aligned with CPython's opcodes.
 #
 # enabled : bool
-#   When false the opcode will not apear as a variant.
+#   When false the opcode will not appear as a variant.
🤖 Prompt for AI Agents
crates/compiler-core/instructions.toml lines 3-7: fix typos in the header
comment by replacing "prooerties" with "properties" and "apear" with "appear" so
the documentation reads correctly.

Comment on lines 1 to 9
// This file is generated by crates/compiler-core/generate.py
// Do not edit!

use crate::bytecode::{Arg, Label, NameIdx};

/// A Single bytecode instruction.
#[repr(u8)]
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum Instruction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Missing imports for oparg types used in enum variants.

The generated code uses IntrinsicFunction1, IntrinsicFunction2, and ResumeKind as generic type parameters for Arg<T> (lines 32-33, 154), but these types are not imported. The import statement only includes Arg, Label, and NameIdx.

The generator script should be updated to include these types in the import statement:

-use crate::bytecode::{Arg, Label, NameIdx};
+use crate::bytecode::{Arg, IntrinsicFunction1, IntrinsicFunction2, Label, NameIdx, ResumeKind};

Or the generator should dynamically determine which oparg types are used and include them in the imports.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// This file is generated by crates/compiler-core/generate.py
// Do not edit!
use crate::bytecode::{Arg, Label, NameIdx};
/// A Single bytecode instruction.
#[repr(u8)]
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum Instruction {
// This file is generated by crates/compiler-core/generate.py
// Do not edit!
use crate::bytecode::{Arg, IntrinsicFunction1, IntrinsicFunction2, Label, NameIdx, ResumeKind};
/// A Single bytecode instruction.
#[repr(u8)]
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
pub enum Instruction {
🤖 Prompt for AI Agents
In crates/compiler-core/src/bytecode/instruction.rs around lines 1 to 9, the
generated file imports only Arg, Label, and NameIdx but the enum uses oparg
types IntrinsicFunction1, IntrinsicFunction2, and ResumeKind as generic
parameters for Arg later in the file; update the generator so the use statement
also imports those types (or make the generator detect which oparg types are
referenced and include them) so the produced file contains a single import
including Arg, Label, NameIdx, IntrinsicFunction1, IntrinsicFunction2, and
ResumeKind.

Comment on lines +7 to +8
#[repr(u8)]
#[derive(Clone, Copy, Debug, Eq, PartialEq)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

#[repr(u8)] incompatible with discriminant values exceeding 255.

The enum is declared with #[repr(u8)] but contains multiple variants with discriminant values that exceed the u8 range (0-255):

  • Jump { target: Arg<Label> } = 256
  • JumpNoInterrupt { target: Arg<Label> } = 257
  • LoadClosure { i: Arg<u32> } = 258
  • LoadMethod { namei: Arg<NameIdx> } = 259
  • LoadSuperMethod { namei: Arg<NameIdx> } = 260
  • LoadZeroSuperAttr { namei: Arg<NameIdx> } = 261
  • LoadZeroSuperMethod { namei: Arg<NameIdx> } = 262
  • PopBlock = 263
  • SetupCleanup = 264
  • SetupFinally = 265
  • SetupWith = 266
  • StoreFastMaybeNull { idx: Arg<u32> } = 267

This will fail to compile. The generator should either use #[repr(u16)] or filter out pseudo-instructions with IDs > 255.

Also applies to: 92-96, 114-114, 127-127, 132-134, 143-143, 164-166, 174-174

🤖 Prompt for AI Agents
In crates/compiler-core/src/bytecode/instruction.rs around lines 7-8 (and also
affecting the variant ranges listed at 92-96, 114, 127, 132-134, 143, 164-166,
174), the enum is annotated #[repr(u8)] but contains discriminant values >255
which is invalid; change the representation to #[repr(u16)] or otherwise ensure
generated discriminant values fit in u8 (e.g., filter/remove pseudo-instructions
with IDs >255) so the repr matches the largest discriminant; update the
attribute to #[repr(u16)] if you intend to keep those variants, or adjust the
generator to cap or exclude IDs above 255 to retain #[repr(u8)].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant