Technologies Used
Building a Deterministic Simulation API: Techniques and Technologies Behind This OpenEnv Project
Most engineering blog posts start with the business problem. This one starts with the implementation.
This project is a good example of how to build a simulation system that is:
- Typed end to end
- Deterministic and testable
- API-first
- Easy to run locally and in containers
Below is a practical walkthrough of the core techniques and technology choices used in the codebase.
1. Architecture at a glance
At a high level, the project has five layers:
- Contract layer: typed
Action,Observation, andStatemodels using Pydantic. - Runtime layer: a stateful environment class that handles reset/step logic.
- Evaluation layer: deterministic graders with weighted scoring.
- API layer: an OpenEnv/FastAPI server exposing
/reset,/step, and/state. - Client/inference layer: an async client and a baseline runner using an LLM endpoint.
Why this matters: each layer has a narrow responsibility, so debugging and testing stay manageable.
2. Typed contracts with Pydantic (and strict validation)
The project uses Pydantic v2 to validate every action and observation shape. A key design choice here is strictness:
- Payload models use
extra="forbid"to reject unknown fields - Action-level validators enforce required fields per action type
- Constrained numeric fields (for example
sla_minutesrange)
Example: action schema behavior
The action model supports classify, set_priority, route, draft_reply, and submit.
For non-submit actions, ticket_id is mandatory. On top of that, each action requires specific payload keys:
classifyrequirescategoryset_priorityrequirespriorityrouterequires bothroute_queueandsla_minutesdraft_replyrequiresreply_text
This prevents ambiguous partial actions before they ever reach business logic.
Example action JSON
{
"action_type": "route",
"ticket_id": "T-MED-2407",
"payload": {
"route_queue": "billing-l2",
"sla_minutes": 120
}
}
If sla_minutes is missing, validation fails immediately and deterministically.
3. Deterministic fixtures as a source of truth
All task behavior comes from fixtures/tasks.json. This is an important engineering decision:
- Scenarios are data, not hardcoded logic
- Allowed values are explicit (
categories,priorities,queues) - Answer keys and policy hints live in one place
- Reproducibility is built in
This design lets you add or tune tasks by editing data files instead of rewriting environment code.
4. Grading design: weighted, bounded, and reproducible
The grader applies weighted criteria per difficulty and returns a normalized score in [0, 1].
Two techniques make the scoring robust:
- Bounded criteria values (
0.0to1.0) - An open-interval clamp (
0.001to0.999) to avoid brittle exact-edge behavior
Example: weighted partial score
For the medium task, weights are:
- category:
0.35 - priority:
0.25 - route_queue:
0.25 - sla_minutes:
0.15
If only category and priority are correct:
0.35 + 0.25 = 0.60
That exact behavior is covered by tests (test_partial_medium_score_is_weighted).
Text requirement scoring technique
For hard tasks, reply quality is not binary. Instead, the grader computes phrase coverage:
- Count matched required phrases
- Divide by required phrase count
- Fold that fraction into weighted score
This is a simple and explainable approach for policy-style checks.
5. Reward shaping beyond final score
Instead of waiting until the end to reward success, the environment uses shaped step rewards:
correctness_delta: reward only for incremental progresspolicy_bonus: small bonus for substantive policy-style repliesrepeat_penalty: penalize repeated/contradictory/no-progress behaviorinvalid_penalty: penalize invalid actionsterminal_bonus: add a bonus at episode end (0.2 * final_score)
Why this is effective
This pattern encourages useful intermediate behavior and discourages loops. It is especially helpful when evaluating multi-step agents.
Example reward formula
reward =
correctness_delta
+ policy_bonus
+ repeat_penalty
+ invalid_penalty
+ terminal_bonus
Because each component is explicit (reward_breakdown is returned in observations), tuning and analysis are straightforward.
6. State machine patterns in the environment runtime
The environment class (B2BSupportTriageEnvironment) uses a few practical techniques:
- Immutable-before/after comparison using
deepcopyfor score delta - Action signatures to detect repeated identical actions
- Contradiction detection when overwriting an existing decision with a different value
- Stagnation counters with capped penalties
- Explicit done conditions (
submitormax_steps)
It also tracks a complete action history in state, which is useful for audits and debugging.
7. API-first integration with OpenEnv + FastAPI
Server wiring is intentionally minimal:
create_app(...)fromopenenv.core.env_server.http_server- typed action and observation classes passed directly
- concurrency control via
max_concurrent_envs=8
The API contract is standard:
POST /resetPOST /stepGET /state- plus health and schema endpoints
Example API flow
curl -X POST http://127.0.0.1:8000/reset \
-H "Content-Type: application/json" \
-d '{"task_id":"easy","seed":1}'
curl -X POST http://127.0.0.1:8000/step \
-H "Content-Type: application/json" \
-d '{"action":{"action_type":"classify","ticket_id":"T-EASY-1001","payload":{"category":"billing"}}}'
Typed request/response data keeps client-server integration predictable.
8. Inference pipeline: resilient orchestration, not just model calls
inference.py demonstrates several production-friendly techniques:
- Asynchronous task loop across fixed tasks/seeds
- Strict logging format (
[START],[STEP],[END]) for downstream parsing - Low-temperature model sampling (
temperature=0.0) for consistency - JSON extraction fallback when model output is noisy
- Action coercion into typed models with validation
- Deterministic fallback policy when model action is invalid or mismatched
One subtle but useful guardrail: the run prefers deterministic policy actions unless the model proposes the same action type. This keeps execution stable while still exercising model output parsing.
Example structured log line
[STEP] step=3 action={"action_type":"route","ticket_id":"T-MED-2407","payload":{"route_queue":"billing-l2","sla_minutes":120}} reward=0.40 done=false error=null
This format is machine-friendly and easy to grep.
9. Testing strategy: deterministic assertions over snapshots
The test suite uses pytest and focuses on stable behavioral checks:
- Reset initializes clean state
- Invalid ticket IDs produce negative reward and deterministic error
- Hard scenario can reach near-perfect score
- Max-step termination behavior
- Grader weighting math
- Log format regex checks
This is a strong pattern for simulation systems: test semantics, not visual snapshots.
10. Packaging and runtime operations
The project is packaged for both local Python use and container deployment:
pyproject.tomlwithsetuptoolsbuild metadata- runtime dependencies in
server/requirements.txt - container image based on
python:3.11-slim - health check endpoint wired in Dockerfile
- helper scripts for full validation (
run_all_checks.sh) and inference runs (run_inference.sh)
Example container run
docker build -t b2b_support_triage_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 b2b_support_triage_env-env:latest
The same image can be used for local verification and platform deployment, reducing environment drift.
11. Reusable takeaways for other projects
Even if you are not building an OpenEnv benchmark, these techniques transfer well:
- Define strict typed contracts early
- Separate fixtures (data) from execution logic
- Make scoring deterministic and explainable
- Expose reward/component breakdowns for tuning
- Enforce structured logs from day one
- Keep a deterministic fallback path for reliability
- Package with health checks and repeatable scripts
Final thoughts
The strongest part of this project is not one framework or one model. It is the combination of:
- Type safety
- Deterministic evaluation
- Explicit reward decomposition
- Operationally simple deployment
That combination makes the system understandable to humans, stable for automation, and easy to evolve over time.