Documentation Index
Fetch the complete documentation index at: https://mintlify.com/tml-epfl/os-harm/llms.txt
Use this file to discover all available pages before exploring further.
PromptAgent is the core agent class in OS-Harm, adapted from the OSWorld baseline. It wraps any supported LLM and drives it through desktop computer tasks by converting observations into messages, calling the model, and parsing the response back into executable actions.
Constructor
Parameters
The LLM to use. Controls which API client is invoked and which system prompts are compatible. See Supported models for the full list. Defaults to
"gpt-4-vision-preview" in the class definition; the run.py CLI defaults to "gpt-4o".Maximum number of tokens in the model’s response. Passed directly to the API. Note: for
o-series OpenAI models (e.g. o4-mini), this parameter is dropped entirely — those models manage output length internally.Nucleus sampling probability. Values between 0 and 1. Dropped for
o-series OpenAI models.Sampling temperature. The
run.py CLI overrides this to 1.0. Dropped for o-series OpenAI models.Controls how actions are formatted in the system prompt and how responses are parsed. Two values are supported:
pyautogui— the agent outputs Python code using thepyautoguilibrary. Recommended for all experiments; this is the default inrun.py.computer_13— the agent outputs structured JSON from a fixed 13-action vocabulary. Not compatible withsomobservation type.
Determines what the environment sends to the agent each step and which system prompt is selected. Four values are supported:
screenshot— a PNG screenshot only.a11y_tree— a linearized accessibility tree (text only, no image).screenshot_a11y_tree— both screenshot and accessibility tree.som— a tagged screenshot with numbered bounding boxes drawn over interactive elements, plus the accessibility tree. Requiresaction_space="pyautogui".
run.py CLI defaults to "a11y_tree".The maximum number of previous (observation, thought, action) turns included in the context sent to the LLM. When the trajectory exceeds this length, only the most recent
max_trajectory_length turns are kept. Set to 0 to send only the current step with no history.Maximum tokens allowed for the linearized accessibility tree text. Accessibility trees are tokenized with
tiktoken (using the gpt-4 encoding) and truncated to this limit before being included in the prompt. A [...] marker is appended when truncation occurs.The target OS platform. Affects how the accessibility tree is linearized (namespace resolution). Supported values:
"ubuntu", "windows".System prompts
The constructor selects a system prompt constant frommm_agents/prompts.py based on the combination of observation_type and action_space:
observation_type | action_space | Prompt constant |
|---|---|---|
screenshot | pyautogui | SYS_PROMPT_IN_SCREENSHOT_OUT_CODE |
screenshot | computer_13 | SYS_PROMPT_IN_SCREENSHOT_OUT_ACTION |
a11y_tree | pyautogui | SYS_PROMPT_IN_A11Y_OUT_CODE |
a11y_tree | computer_13 | SYS_PROMPT_IN_A11Y_OUT_ACTION |
screenshot_a11y_tree | pyautogui | SYS_PROMPT_IN_BOTH_OUT_CODE |
screenshot_a11y_tree | computer_13 | SYS_PROMPT_IN_BOTH_OUT_ACTION |
som | pyautogui | SYS_PROMPT_IN_SOM_OUT_TAG |
som | computer_13 | (raises ValueError) |
OUT_CODE prompts instruct the model to return pyautogui Python code inside a fenced code block, and allow three special responses: WAIT, DONE, and FAIL. The OUT_ACTION prompts instruct the model to return a JSON object from the 13-action vocabulary. The SOM prompt allows the model to refer to tagged elements by tag_N variables (e.g. pyautogui.click(tag_2)).
At predict time, the task instruction is appended to the system message:
The predict() method
predict() takes the task instruction and the current observation dict, builds a full message list for the LLM, calls the model, and parses the response into a list of actions.
Observation dict format
Theobs dict must contain at minimum:
Message construction
System message
The system prompt plus the task instruction is placed as a
role: system message.For computer-use-preview, the content type is input_text; for all other models, it is text.Trajectory history
The last
max_trajectory_length turns from self.observations, self.actions, and self.thoughts are appended as alternating user / assistant messages. If the trajectory is shorter than max_trajectory_length, all turns are included.For a11y_tree observation type, messages contain only text. For screenshot, screenshot_a11y_tree, and som, messages include both text and a base64-encoded image_url (or input_image for computer-use-preview).Current observation
The current observation is appended as a
user message using the same format as the history messages. Accessibility trees are linearized from raw XML into a tab-separated table and then trimmed to a11y_tree_max_tokens using tiktoken.For som, bounding boxes are drawn on the screenshot first using draw_bounding_boxes(), and the resulting masks list is passed to the action parser.LLM call and retry behavior
The agent callsself.call_llm() with the constructed payload. This method is decorated with @backoff.on_exception using constant backoff:
- Interval: 30 seconds between retries
- Max tries: 10
- Retried exceptions:
SSLError,openai.RateLimitError,openai.BadRequestError,openai.InternalServerError, GoogleInvalidArgument,ResourceExhausted,InternalServerError,BadRequest
context_length_exceeded error, the agent automatically retries with a shortened context: only the system message and the most recent user message are kept.
Action parsing
After the LLM responds,parse_actions() converts the raw text to a list of actions:
pyautoguiaction space: callsparse_code_from_string(), which extracts fenced Python code blocks. Forsom, callsparse_code_from_som_string()instead, which prependstag_N = (x, y)variable assignments resolved from the bounding box mask coordinates.computer_13action space: callsparse_actions_from_string(), which extracts JSON from fenced blocks and parses it as a dict.- Special tokens (
WAIT,DONE,FAIL): detected both inside and outside code blocks and returned as plain strings.
Return value
self.observations, self.actions, and self.thoughts respectively.
Resetting the agent
Callagent.reset() between evaluation examples to clear the trajectory history:
