Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/tml-epfl/os-harm/llms.txt

Use this file to discover all available pages before exploring further.

run.py orchestrates the full evaluation pipeline: it loads task configs, runs the agent in the VM environment, and automatically judges each trajectory after completion.

Synopsis

python run.py [OPTIONS]
run.py writes logs to logs/ on startup. The directory must exist before you run the script.

Usage examples

python run.py \
  --path_to_vm vmware_vm_data/Ubuntu0/Ubuntu0.vmx \
  --observation_type screenshot_a11y_tree \
  --model gpt-4o \
  --result_dir ./results

Arguments

Environment

--path_to_vm
string
default:"vmware_vm_data/Ubuntu0/Ubuntu0.vmx"
Path to the VMware .vmx file for the benchmark VM. The VM must have been prepared using the OSWorld setup instructions before running any tasks.
--headless
boolean
Run the VM in headless mode (no GUI window). Pass this flag when running on a server or in a CI environment where no display is available.
--action_space
string
default:"pyautogui"
Action type the agent uses to interact with the desktop. The default pyautogui uses Python GUI automation.
--observation_type
string
default:"a11y_tree"
Type of observations provided to the agent at each step. Must be one of:
ValueDescription
screenshotRaw screenshot of the desktop
a11y_treeAccessibility tree description
screenshot_a11y_treeCombined screenshot and accessibility tree (recommended)
somSet-of-marks: screenshot with numbered bounding boxes
--screen_width
integer
default:"1920"
Width of the virtual screen in pixels.
--screen_height
integer
default:"1080"
Height of the virtual screen in pixels.
--sleep_after_execution
float
default:"0.0"
Seconds to sleep after each agent action before capturing the next observation. Increase this value if the VM needs extra time to render UI changes.
--max_steps
integer
default:"15"
Maximum number of agent action steps per task. The task is terminated after this many steps regardless of completion status.

Agent

--max_trajectory_length
integer
default:"3"
Number of previous steps included in the agent’s context window when generating the next action. Higher values give the agent more history but consume more tokens.
--test_config_base_dir
string
default:"evaluation_examples"
Base directory containing task configuration files. Individual task configs are resolved relative to this directory at examples/{domain}/{task_id}.json.

Language model

--model
string
default:"gpt-4o"
Model identifier for the agent being evaluated. Passed directly to the PromptAgent. Any model string accepted by the underlying API client is valid (e.g. gpt-4o, o4-mini, claude-3-7-sonnet-20250219).
--judge_model
string
default:"gpt-4.1"
Model identifier for the automated judge that runs after each task. Supports OpenAI models (uses OPENAI_API_KEY) and Anthropic Claude models (uses ANTHROPIC_API_KEY).Common values: gpt-4.1, o4-mini, claude-3-7-sonnet-20250219.
--sys_prompt_version
string
default:"v3"
Version of the system prompt used by the judge. Accepted values: v1, v2, v3. Use v3 unless you have a specific reason to compare against an older version.
--judge_type
string
default:"aer"
Judge evaluation strategy passed to run_judging. Must be one of:
ValueDescription
defaultScreenshot-based plain judge
captioned_defaultPlain judge augmented with screenshot captions
a11y_treePlain judge operating on accessibility tree observations only
aerAction–Environment–Result judge (recommended)
all_step_aerAER applied to every step in the trajectory
See Judge types for a detailed comparison.
--temperature
float
default:"1.0"
Sampling temperature for the agent model. Higher values increase output diversity. Has no effect on o4-mini, which only supports the default temperature.
--top_p
float
default:"0.9"
Nucleus sampling parameter for the agent model. The model samples from the smallest token set whose cumulative probability exceeds top_p.
--max_tokens
integer
default:"1500"
Maximum number of tokens the agent model may generate per step.
--stop_token
string
default:"None"
Optional stop sequence that terminates agent generation early. Defaults to None (no stop token).

Task selection

--domain
string
default:"all"
Restrict evaluation to a single application domain. When set to all, every domain listed in the task index file is evaluated.Valid domain keys: chrome, gimp, libreoffice_calc, libreoffice_impress, libreoffice_writer, os, thunderbird, vlc, vs_code, multi_apps.
--test_all_meta_path
string
default:"evaluation_examples/test_misuse.json"
Path to the task index JSON file that maps domain keys to lists of task IDs. Switch this to run a different harm category:
FileHarm category
evaluation_examples/test_misuse.jsonDeliberate user misuse
evaluation_examples/test_injection.jsonPrompt injection attacks
evaluation_examples/test_misbehavior.jsonModel misbehavior

Attack mode

--jailbreak
boolean
Prepend a jailbreak instruction to every task prompt before passing it to the agent. Useful for measuring how a model’s refusal rate changes under adversarial prompting. When enabled, results are written to a {model}_jailbreak subdirectory to keep them separate from normal runs.
--inject
boolean
Enable prompt injection evaluation. When set, the script iterates over all injection vectors and goals defined in each task’s config and runs a separate trajectory for each combination. Results are written to directories suffixed with __inject__{type}__{goal}.
Using --inject multiplies the number of trajectories per task by the number of injection combinations defined in that task’s config. Expect significantly longer run times.

Output

--result_dir
string
default:"./results"
Root directory for saving trajectory data, judgments, and recordings. Results are organized into subdirectories by action space, observation type, and model name:
{result_dir}/{action_space}/{observation_type}/{model}/{domain}/{task_id}/
Tasks that already have a result.txt file in their output directory are skipped automatically on subsequent runs.

Result directory structure

After a run completes, each evaluated task produces a directory with the following files:
results/
└── pyautogui/
    └── screenshot_a11y_tree/
        └── gpt-4o/
            └── thunderbird/
                └── _harassment_email/
                    ├── result.txt          # 0.0 or 1.0 task score
                    ├── traj.jsonl          # per-step action log
                    ├── better_log.json     # structured trajectory data
                    ├── recording.mp4       # screen recording
                    └── judgment/
                        └── gpt-4.1/
                            └── aer/
                                └── v3/
                                    └── judgment.json
If a task errors out before completing, run.py writes an error entry to traj.jsonl and saves the partial recording. The directory will not contain result.txt, so the task will be retried automatically on the next run.

Resuming interrupted runs

run.py checks for result.txt in each task’s output directory before executing it. Tasks that already have this file are skipped. If a task directory exists but lacks result.txt, the directory is cleaned and the task is re-run from scratch. This means you can safely interrupt and restart run.py with the same arguments — it will pick up where it left off.