Technologies Used

OpenEnv ML

Building a Deterministic Simulation API: Techniques and Technologies Behind This OpenEnv Project

Most engineering blog posts start with the business problem. This one starts with the implementation.

This project is a good example of how to build a simulation system that is:
- Typed end to end
- Deterministic and testable
- API-first
- Easy to run locally and in containers

Below is a practical walkthrough of the core techniques and technology choices used in the codebase.

1. Architecture at a glance

At a high level, the project has five layers:

Contract layer: typed Action, Observation, and State models using Pydantic.
Runtime layer: a stateful environment class that handles reset/step logic.
Evaluation layer: deterministic graders with weighted scoring.
API layer: an OpenEnv/FastAPI server exposing /reset, /step, and /state.
Client/inference layer: an async client and a baseline runner using an LLM endpoint.

Why this matters: each layer has a narrow responsibility, so debugging and testing stay manageable.

2. Typed contracts with Pydantic (and strict validation)

The project uses Pydantic v2 to validate every action and observation shape. A key design choice here is strictness:

Payload models use extra="forbid" to reject unknown fields
Action-level validators enforce required fields per action type
Constrained numeric fields (for example sla_minutes range)

Example: action schema behavior

The action model supports classify, set_priority, route, draft_reply, and submit.

For non-submit actions, ticket_id is mandatory. On top of that, each action requires specific payload keys:

classify requires category
set_priority requires priority
route requires both route_queue and sla_minutes
draft_reply requires reply_text

This prevents ambiguous partial actions before they ever reach business logic.

Example action JSON

{
  "action_type": "route",
  "ticket_id": "T-MED-2407",
  "payload": {
    "route_queue": "billing-l2",
    "sla_minutes": 120
  }
}

If sla_minutes is missing, validation fails immediately and deterministically.

3. Deterministic fixtures as a source of truth

All task behavior comes from fixtures/tasks.json. This is an important engineering decision:

Scenarios are data, not hardcoded logic
Allowed values are explicit (categories, priorities, queues)
Answer keys and policy hints live in one place
Reproducibility is built in

This design lets you add or tune tasks by editing data files instead of rewriting environment code.

4. Grading design: weighted, bounded, and reproducible

The grader applies weighted criteria per difficulty and returns a normalized score in [0, 1].

Two techniques make the scoring robust:

Bounded criteria values (0.0 to 1.0)
An open-interval clamp (0.001 to 0.999) to avoid brittle exact-edge behavior

Example: weighted partial score

For the medium task, weights are:

category: 0.35
priority: 0.25
route_queue: 0.25
sla_minutes: 0.15

If only category and priority are correct:

0.35 + 0.25 = 0.60

That exact behavior is covered by tests (test_partial_medium_score_is_weighted).

Text requirement scoring technique

For hard tasks, reply quality is not binary. Instead, the grader computes phrase coverage:

Count matched required phrases
Divide by required phrase count
Fold that fraction into weighted score

This is a simple and explainable approach for policy-style checks.

5. Reward shaping beyond final score

Instead of waiting until the end to reward success, the environment uses shaped step rewards:

correctness_delta: reward only for incremental progress
policy_bonus: small bonus for substantive policy-style replies
repeat_penalty: penalize repeated/contradictory/no-progress behavior
invalid_penalty: penalize invalid actions
terminal_bonus: add a bonus at episode end (0.2 * final_score)

Why this is effective

This pattern encourages useful intermediate behavior and discourages loops. It is especially helpful when evaluating multi-step agents.

Example reward formula

reward =
  correctness_delta
  + policy_bonus
  + repeat_penalty
  + invalid_penalty
  + terminal_bonus

Because each component is explicit (reward_breakdown is returned in observations), tuning and analysis are straightforward.

6. State machine patterns in the environment runtime

The environment class (B2BSupportTriageEnvironment) uses a few practical techniques:

Immutable-before/after comparison using deepcopy for score delta
Action signatures to detect repeated identical actions
Contradiction detection when overwriting an existing decision with a different value
Stagnation counters with capped penalties
Explicit done conditions (submit or max_steps)

It also tracks a complete action history in state, which is useful for audits and debugging.

7. API-first integration with OpenEnv + FastAPI

Server wiring is intentionally minimal:

create_app(...) from openenv.core.env_server.http_server
typed action and observation classes passed directly
concurrency control via max_concurrent_envs=8

The API contract is standard:

POST /reset
POST /step
GET /state
plus health and schema endpoints

Example API flow

curl -X POST http://127.0.0.1:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy","seed":1}'

curl -X POST http://127.0.0.1:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action":{"action_type":"classify","ticket_id":"T-EASY-1001","payload":{"category":"billing"}}}'

Typed request/response data keeps client-server integration predictable.

8. Inference pipeline: resilient orchestration, not just model calls

inference.py demonstrates several production-friendly techniques:

Asynchronous task loop across fixed tasks/seeds
Strict logging format ([START], [STEP], [END]) for downstream parsing
Low-temperature model sampling (temperature=0.0) for consistency
JSON extraction fallback when model output is noisy
Action coercion into typed models with validation
Deterministic fallback policy when model action is invalid or mismatched

One subtle but useful guardrail: the run prefers deterministic policy actions unless the model proposes the same action type. This keeps execution stable while still exercising model output parsing.

Example structured log line

[STEP] step=3 action={"action_type":"route","ticket_id":"T-MED-2407","payload":{"route_queue":"billing-l2","sla_minutes":120}} reward=0.40 done=false error=null

This format is machine-friendly and easy to grep.

9. Testing strategy: deterministic assertions over snapshots

The test suite uses pytest and focuses on stable behavioral checks:

Reset initializes clean state
Invalid ticket IDs produce negative reward and deterministic error
Hard scenario can reach near-perfect score
Max-step termination behavior
Grader weighting math
Log format regex checks

This is a strong pattern for simulation systems: test semantics, not visual snapshots.

10. Packaging and runtime operations

The project is packaged for both local Python use and container deployment:

pyproject.toml with setuptools build metadata
runtime dependencies in server/requirements.txt
container image based on python:3.11-slim
health check endpoint wired in Dockerfile
helper scripts for full validation (run_all_checks.sh) and inference runs (run_inference.sh)

Example container run

docker build -t b2b_support_triage_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 b2b_support_triage_env-env:latest

The same image can be used for local verification and platform deployment, reducing environment drift.

11. Reusable takeaways for other projects

Even if you are not building an OpenEnv benchmark, these techniques transfer well:

Define strict typed contracts early
Separate fixtures (data) from execution logic
Make scoring deterministic and explainable
Expose reward/component breakdowns for tuning
Enforce structured logs from day one
Keep a deterministic fallback path for reliability
Package with health checks and repeatable scripts

Final thoughts

The strongest part of this project is not one framework or one model. It is the combination of:

Type safety
Deterministic evaluation
Explicit reward decomposition
Operationally simple deployment

That combination makes the system understandable to humans, stable for automation, and easy to evolve over time.

🚀 View Live Project 💻 View on GitHub

OpenEnv: Building a Deterministic AI Simulation API