Technologies Used

OpenEnv ML

Building a Deterministic Simulation API: Techniques and Technologies Behind This OpenEnv Project

Most engineering blog posts start with the business problem. This one starts with the implementation.

This project is a good example of how to build a simulation system that is:
- Typed end to end
- Deterministic and testable
- API-first
- Easy to run locally and in containers

Below is a practical walkthrough of the core techniques and technology choices used in the codebase.

1. Architecture at a glance

At a high level, the project has five layers:

  1. Contract layer: typed Action, Observation, and State models using Pydantic.
  2. Runtime layer: a stateful environment class that handles reset/step logic.
  3. Evaluation layer: deterministic graders with weighted scoring.
  4. API layer: an OpenEnv/FastAPI server exposing /reset, /step, and /state.
  5. Client/inference layer: an async client and a baseline runner using an LLM endpoint.

Why this matters: each layer has a narrow responsibility, so debugging and testing stay manageable.

2. Typed contracts with Pydantic (and strict validation)

The project uses Pydantic v2 to validate every action and observation shape. A key design choice here is strictness:

  • Payload models use extra="forbid" to reject unknown fields
  • Action-level validators enforce required fields per action type
  • Constrained numeric fields (for example sla_minutes range)

Example: action schema behavior

The action model supports classify, set_priority, route, draft_reply, and submit.

For non-submit actions, ticket_id is mandatory. On top of that, each action requires specific payload keys:

  • classify requires category
  • set_priority requires priority
  • route requires both route_queue and sla_minutes
  • draft_reply requires reply_text

This prevents ambiguous partial actions before they ever reach business logic.

Example action JSON

{
  "action_type": "route",
  "ticket_id": "T-MED-2407",
  "payload": {
    "route_queue": "billing-l2",
    "sla_minutes": 120
  }
}

If sla_minutes is missing, validation fails immediately and deterministically.

3. Deterministic fixtures as a source of truth

All task behavior comes from fixtures/tasks.json. This is an important engineering decision:

  • Scenarios are data, not hardcoded logic
  • Allowed values are explicit (categories, priorities, queues)
  • Answer keys and policy hints live in one place
  • Reproducibility is built in

This design lets you add or tune tasks by editing data files instead of rewriting environment code.

4. Grading design: weighted, bounded, and reproducible

The grader applies weighted criteria per difficulty and returns a normalized score in [0, 1].

Two techniques make the scoring robust:

  • Bounded criteria values (0.0 to 1.0)
  • An open-interval clamp (0.001 to 0.999) to avoid brittle exact-edge behavior

Example: weighted partial score

For the medium task, weights are:

  • category: 0.35
  • priority: 0.25
  • route_queue: 0.25
  • sla_minutes: 0.15

If only category and priority are correct:

0.35 + 0.25 = 0.60

That exact behavior is covered by tests (test_partial_medium_score_is_weighted).

Text requirement scoring technique

For hard tasks, reply quality is not binary. Instead, the grader computes phrase coverage:

  • Count matched required phrases
  • Divide by required phrase count
  • Fold that fraction into weighted score

This is a simple and explainable approach for policy-style checks.

5. Reward shaping beyond final score

Instead of waiting until the end to reward success, the environment uses shaped step rewards:

  • correctness_delta: reward only for incremental progress
  • policy_bonus: small bonus for substantive policy-style replies
  • repeat_penalty: penalize repeated/contradictory/no-progress behavior
  • invalid_penalty: penalize invalid actions
  • terminal_bonus: add a bonus at episode end (0.2 * final_score)

Why this is effective

This pattern encourages useful intermediate behavior and discourages loops. It is especially helpful when evaluating multi-step agents.

Example reward formula

reward =
  correctness_delta
  + policy_bonus
  + repeat_penalty
  + invalid_penalty
  + terminal_bonus

Because each component is explicit (reward_breakdown is returned in observations), tuning and analysis are straightforward.

6. State machine patterns in the environment runtime

The environment class (B2BSupportTriageEnvironment) uses a few practical techniques:

  • Immutable-before/after comparison using deepcopy for score delta
  • Action signatures to detect repeated identical actions
  • Contradiction detection when overwriting an existing decision with a different value
  • Stagnation counters with capped penalties
  • Explicit done conditions (submit or max_steps)

It also tracks a complete action history in state, which is useful for audits and debugging.

7. API-first integration with OpenEnv + FastAPI

Server wiring is intentionally minimal:

  • create_app(...) from openenv.core.env_server.http_server
  • typed action and observation classes passed directly
  • concurrency control via max_concurrent_envs=8

The API contract is standard:

  • POST /reset
  • POST /step
  • GET /state
  • plus health and schema endpoints

Example API flow

curl -X POST http://127.0.0.1:8000/reset \
  -H "Content-Type: application/json" \
  -d '{"task_id":"easy","seed":1}'
curl -X POST http://127.0.0.1:8000/step \
  -H "Content-Type: application/json" \
  -d '{"action":{"action_type":"classify","ticket_id":"T-EASY-1001","payload":{"category":"billing"}}}'

Typed request/response data keeps client-server integration predictable.

8. Inference pipeline: resilient orchestration, not just model calls

inference.py demonstrates several production-friendly techniques:

  • Asynchronous task loop across fixed tasks/seeds
  • Strict logging format ([START], [STEP], [END]) for downstream parsing
  • Low-temperature model sampling (temperature=0.0) for consistency
  • JSON extraction fallback when model output is noisy
  • Action coercion into typed models with validation
  • Deterministic fallback policy when model action is invalid or mismatched

One subtle but useful guardrail: the run prefers deterministic policy actions unless the model proposes the same action type. This keeps execution stable while still exercising model output parsing.

Example structured log line

[STEP] step=3 action={"action_type":"route","ticket_id":"T-MED-2407","payload":{"route_queue":"billing-l2","sla_minutes":120}} reward=0.40 done=false error=null

This format is machine-friendly and easy to grep.

9. Testing strategy: deterministic assertions over snapshots

The test suite uses pytest and focuses on stable behavioral checks:

  • Reset initializes clean state
  • Invalid ticket IDs produce negative reward and deterministic error
  • Hard scenario can reach near-perfect score
  • Max-step termination behavior
  • Grader weighting math
  • Log format regex checks

This is a strong pattern for simulation systems: test semantics, not visual snapshots.

10. Packaging and runtime operations

The project is packaged for both local Python use and container deployment:

  • pyproject.toml with setuptools build metadata
  • runtime dependencies in server/requirements.txt
  • container image based on python:3.11-slim
  • health check endpoint wired in Dockerfile
  • helper scripts for full validation (run_all_checks.sh) and inference runs (run_inference.sh)

Example container run

docker build -t b2b_support_triage_env-env:latest -f server/Dockerfile .
docker run --rm -p 8000:8000 b2b_support_triage_env-env:latest

The same image can be used for local verification and platform deployment, reducing environment drift.

11. Reusable takeaways for other projects

Even if you are not building an OpenEnv benchmark, these techniques transfer well:

  • Define strict typed contracts early
  • Separate fixtures (data) from execution logic
  • Make scoring deterministic and explainable
  • Expose reward/component breakdowns for tuning
  • Enforce structured logs from day one
  • Keep a deterministic fallback path for reliability
  • Package with health checks and repeatable scripts

Final thoughts

The strongest part of this project is not one framework or one model. It is the combination of:

  • Type safety
  • Deterministic evaluation
  • Explicit reward decomposition
  • Operationally simple deployment

That combination makes the system understandable to humans, stable for automation, and easy to evolve over time.