TheDocumentation Index
Fetch the complete documentation index at: https://mintlify.com/tml-epfl/os-harm/llms.txt
Use this file to discover all available pages before exploring further.
--observation_type flag controls what information the agent receives at each step. It determines both the content of the observation and the system prompt that is injected at the start of the conversation.
The four available types map directly to OSWorld’s baseline observation modes.
Choosing an observation type
screenshot_a11y_tree
Recommended. Combines a visual screenshot with a structured text description. Used in all main OS-Harm experiments for best model performance.
a11y_tree
Text-only. Suited for models without vision capability or for faster, cheaper runs.
screenshot
Vision-only. Useful when accessibility data is unavailable or unreliable.
som
Screenshot with numbered element overlays. Reduces coordinate ambiguity for models that struggle with precise pixel targeting.
screenshot
The agent receives a single PNG screenshot of the current desktop state.
What the agent sees: A raw image — no text description, no element labels.
When to use: When you want to test purely visual grounding, or when the accessibility tree is unavailable for the application under test.
Performance trade-offs: Requires the model to infer UI element positions from pixels alone. Coordinate precision is lower than with som. Has no text fallback if visual content is ambiguous (e.g. small or overlapping text).
System prompts used (from mm_agents/prompts.py):
| Action space | Prompt constant |
|---|---|
pyautogui | SYS_PROMPT_IN_SCREENSHOT_OUT_CODE |
computer_13 | SYS_PROMPT_IN_SCREENSHOT_OUT_ACTION |
pyautogui Python code (or a JSON action dict) grounded to the screenshot coordinates, and permit the special tokens WAIT, FAIL, and DONE.
Example command:
a11y_tree
The agent receives a linearised text representation of the desktop’s accessibility tree, produced via the AT-SPI library. No image is included.
What the agent sees: A tab-separated table with columns for tag, name, text, class, description, position (top-left x&y), and size (w&h) — one row per visible UI element.
Example (truncated):
a11y_tree_max_tokens (default: 10 000 tokens) if it exceeds the limit.
a11y_tree is the default --observation_type in run.py. For best performance, switch to screenshot_a11y_tree.| Action space | Prompt constant |
|---|---|
pyautogui | SYS_PROMPT_IN_A11Y_OUT_CODE |
computer_13 | SYS_PROMPT_IN_A11Y_OUT_ACTION |
a11y_tree prompts instruct the model that its observation comes from the AT-SPI accessibility tree and that it should ground its actions to the position and size values in the tree.
Example command:
screenshot_a11y_tree
The agent receives both a screenshot image and the linearised accessibility tree in the same observation. This is the combined mode used in all main OS-Harm experiments.
What the agent sees: A PNG screenshot plus the full text accessibility tree table — both in a single prompt message.
When to use: This is the recommended mode for any serious evaluation run. It gives the model both visual context (layout, colour, rendered content) and structured text context (element names, positions from the tree), and produced the best performance across all frontier models tested in the paper.
Performance trade-offs: Highest token cost per step (image tokens + text tokens). Requires a vision-capable model. The accessibility tree is still subject to the a11y_tree_max_tokens cap.
System prompts used:
| Action space | Prompt constant |
|---|---|
pyautogui | SYS_PROMPT_IN_BOTH_OUT_CODE |
computer_13 | SYS_PROMPT_IN_BOTH_OUT_ACTION |
som
Set-of-marks (SoM) overlays the screenshot with numbered bounding boxes drawn around interactive elements identified from the accessibility tree. The agent receives this tagged screenshot alongside the accessibility tree.
What the agent sees: A PNG screenshot where each interactable element is bounded by a numbered box, plus the accessibility tree text. The agent can reference elements by their tag number instead of specifying pixel coordinates directly.
For example, instead of pyautogui.click(x=120, y=890), the agent can write:
som only works with the pyautogui action space. Passing --action_space computer_13 with --observation_type som will raise a ValueError.| Action space | Prompt constant |
|---|---|
pyautogui | SYS_PROMPT_IN_SOM_OUT_TAG |
tag_N variables for x, y coordinates in pyautogui calls.
Example command:
Comparison summary
| Mode | Image input | Text input | Tagged overlays | Recommended for |
|---|---|---|---|---|
screenshot | Yes | No | No | Vision-only ablations |
a11y_tree | No | Yes | No | Text-only / non-vision models |
screenshot_a11y_tree | Yes | Yes | No | All main experiments |
som | Yes | Yes | Yes | Coordinate-accuracy ablations |
Accessibility tree requirement
TheDesktopEnv is initialised with require_a11y_tree=True whenever the observation type is a11y_tree, screenshot_a11y_tree, or som. This enables AT-SPI data collection inside the VM. For screenshot-only mode, AT-SPI is disabled.