Skip to main content

Documentation Index

Fetch the complete documentation index at: https://mintlify.com/tml-epfl/os-harm/llms.txt

Use this file to discover all available pages before exploring further.

PromptAgent selects its API backend based on the model name string passed to --model. Each provider requires a specific environment variable. The sections below list each supported provider, the model names used in the OS-Harm paper, and any special handling in the code.
Text-only models (mistral, llama3-70b) only work with --observation_type a11y_tree. They cannot process screenshots.

OpenAI

Required env var: OPENAI_API_KEY Any model whose name starts with gpt or o is routed to the OpenAI Chat Completions API at https://api.openai.com/v1/chat/completions.
ModelNotes
gpt-4oDefault model in run.py. Supports screenshots and accessibility trees.
gpt-4.1Default judge model in run.py. Supports screenshots and accessibility trees.
o4-minio-series model — max_tokens, top_p, and temperature are not sent to the API.
All o-series models (names starting with o) have sampling parameters stripped from the payload:
# in call_llm(), for model names starting with "o":
del payload['max_tokens']
del payload['top_p']
del payload['temperature']
Context length fallback: if the API returns a context_length_exceeded error, the agent automatically retries with only the system message and the most recent user turn.

computer-use-preview

The computer-use-preview model is handled separately from other OpenAI models. It uses the Responses API endpoint (https://api.openai.com/v1/responses) with a computer_use_preview tool definition.
computer-use-preview ignores trajectory history — only the system message and the most recent observation are sent each step, regardless of max_trajectory_length.
python run.py --model computer-use-preview --observation_type screenshot

Anthropic

Required env var: ANTHROPIC_API_KEY Any model whose name starts with claude is routed to https://api.anthropic.com/v1/messages using the Anthropic Messages API (version 2023-06-01).
ModelNotes
claude-3-7-sonnet-20250219Claude 3.7 Sonnet. Supports screenshots via base64-encoded image blocks.
Because Anthropic’s API does not accept a system role as a standalone message, the agent prepends the system prompt as an additional text block in the first user message:
# System content is inserted at the start of the first user message
if claude_messages[0]['role'] == "system":
    claude_system_message_item = claude_messages[0]['content'][0]
    claude_messages[1]['content'].insert(0, claude_system_message_item)
    claude_messages.pop(0)
Images are sent as base64-encoded image/png source blocks.
python run.py --model claude-3-7-sonnet-20250219 --observation_type screenshot_a11y_tree

Google Gemini

Required env var: GENAI_API_KEY Any model whose name starts with gemini is routed through the google-generativeai Python SDK.
ModelNotes
gemini-2.5-proSupports multimodal input (screenshots + text).
gemini-2.5-flashFaster, lower-cost Gemini model.
The system prompt is passed as the system_instruction parameter to genai.GenerativeModel() rather than as a chat turn. Safety filters are all set to block_none so that desktop task instructions are not refused. Important: Gemini only uses the image from the most recent message in multi-turn history. Earlier screenshots in the trajectory are dropped.
# Safety settings used for all Gemini calls
safety_settings = {
    "harassment": "block_none",
    "hate": "block_none",
    "sex": "block_none",
    "danger": "block_none",
}
python run.py --model gemini-2.5-pro --observation_type screenshot

Alibaba DashScope (Qwen)

Required env var: DASHSCOPE_API_KEY (set via the dashscope SDK’s default key lookup) Model names starting with qwen are routed through the dashscope Python SDK. Two API methods are used depending on the model:
ModelAPI methodNotes
qwen-vl-plusMultiModalConversation.callMultimodal — accepts images.
qwen-vl-maxMultiModalConversation.callMultimodal — accepts images.
qwen-turboGeneration.callText only.
qwen-plusGeneration.callText only.
qwen-maxGeneration.callText only.
qwen-max-0428Generation.callText only.
qwen-max-0403Generation.callText only.
qwen-max-0107Generation.callText only.
qwen-max-longcontextGeneration.callText only.
Images are saved to a temporary file and passed as file:// URIs. If the context is too long, the agent progressively trims the most recent message by dropping 500 words at a time (up to 20 retries).
python run.py --model qwen-vl-max --observation_type screenshot

Groq

Required env var: GROQ_API_KEY The llama3-70b model name is routed to the Groq API using the groq Python SDK. The underlying model sent to Groq is llama3-70b-8192.
llama3-70b supports text-only input. You must use --observation_type a11y_tree.
Like the Qwen and Mistral backends, Groq uses a progressive context-trimming fallback: on the first failure the history is truncated to only the system message and the last turn; on subsequent failures, 500 words are dropped from the last message per attempt (up to 20 total retries).
python run.py --model llama3-70b --observation_type a11y_tree

OpenAI-compatible endpoints (Mistral / Together AI)

Required env var: TOGETHER_API_KEY Model names starting with mistral are routed to https://api.together.xyz using the openai Python SDK’s base_url override.
Mistral models support text-only input. You must use --observation_type a11y_tree.
python run.py --model mistral-7b-instruct --observation_type a11y_tree
To point at a different OpenAI-compatible endpoint, you can adapt the mistral branch in call_llm() to set a different base_url and api_key.

CogAgent (local server)

Model names starting with THUDM (e.g. THUDM/cogagent-chat-hf) are routed to a local OpenAI-compatible server at http://127.0.0.1:8000. No environment variable is needed; you must run the model server yourself before starting run.py.
# Start the local model server first, then:
python run.py --model THUDM/cogagent-chat-hf --observation_type screenshot

Sampling parameter defaults

The run.py CLI sets the following defaults, which override the class-level constructor defaults:
Parameterrun.py defaultClass default
temperature1.00.5
top_p0.90.9
max_tokens15001500
For o-series models, temperature, top_p, and max_tokens are all dropped from the API request regardless of what is configured.

Specifying the model

Pass the model name with the --model flag:
python run.py \
  --model gpt-4o \
  --observation_type screenshot_a11y_tree \
  --action_space pyautogui \
  --temperature 1.0
The judge model (used by the automated evaluation step) is set separately with --judge_model:
python run.py \
  --model gpt-4o \
  --judge_model gpt-4.1