Supported models

PromptAgent selects its API backend based on the model name string passed to --model. Each provider requires a specific environment variable. The sections below list each supported provider, the model names used in the OS-Harm paper, and any special handling in the code.

Text-only models (mistral, llama3-70b) only work with --observation_type a11y_tree. They cannot process screenshots.

OpenAI

Required env var: OPENAI_API_KEY Any model whose name starts with gpt or o is routed to the OpenAI Chat Completions API at https://api.openai.com/v1/chat/completions.

Model	Notes
`gpt-4o`	Default model in `run.py`. Supports screenshots and accessibility trees.
`gpt-4.1`	Default judge model in `run.py`. Supports screenshots and accessibility trees.
`o4-mini`	`o`-series model — `max_tokens`, `top_p`, and `temperature` are not sent to the API.

All o-series models (names starting with o) have sampling parameters stripped from the payload:

# in call_llm(), for model names starting with "o":
del payload['max_tokens']
del payload['top_p']
del payload['temperature']

Context length fallback: if the API returns a context_length_exceeded error, the agent automatically retries with only the system message and the most recent user turn.

computer-use-preview

The computer-use-preview model is handled separately from other OpenAI models. It uses the Responses API endpoint (https://api.openai.com/v1/responses) with a computer_use_preview tool definition.

computer-use-preview ignores trajectory history — only the system message and the most recent observation are sent each step, regardless of max_trajectory_length.

python run.py --model computer-use-preview --observation_type screenshot

Anthropic

Required env var: ANTHROPIC_API_KEY Any model whose name starts with claude is routed to https://api.anthropic.com/v1/messages using the Anthropic Messages API (version 2023-06-01).

Model	Notes
`claude-3-7-sonnet-20250219`	Claude 3.7 Sonnet. Supports screenshots via base64-encoded image blocks.

Because Anthropic’s API does not accept a system role as a standalone message, the agent prepends the system prompt as an additional text block in the first user message:

# System content is inserted at the start of the first user message
if claude_messages[0]['role'] == "system":
    claude_system_message_item = claude_messages[0]['content'][0]
    claude_messages[1]['content'].insert(0, claude_system_message_item)
    claude_messages.pop(0)

Images are sent as base64-encoded image/png source blocks.

python run.py --model claude-3-7-sonnet-20250219 --observation_type screenshot_a11y_tree

Google Gemini

Required env var: GENAI_API_KEY Any model whose name starts with gemini is routed through the google-generativeai Python SDK.

Model	Notes
`gemini-2.5-pro`	Supports multimodal input (screenshots + text).
`gemini-2.5-flash`	Faster, lower-cost Gemini model.

The system prompt is passed as the system_instruction parameter to genai.GenerativeModel() rather than as a chat turn. Safety filters are all set to block_none so that desktop task instructions are not refused. Important: Gemini only uses the image from the most recent message in multi-turn history. Earlier screenshots in the trajectory are dropped.

# Safety settings used for all Gemini calls
safety_settings = {
    "harassment": "block_none",
    "hate": "block_none",
    "sex": "block_none",
    "danger": "block_none",
}

python run.py --model gemini-2.5-pro --observation_type screenshot

Alibaba DashScope (Qwen)

Required env var: DASHSCOPE_API_KEY (set via the dashscope SDK’s default key lookup) Model names starting with qwen are routed through the dashscope Python SDK. Two API methods are used depending on the model:

Model	API method	Notes
`qwen-vl-plus`	`MultiModalConversation.call`	Multimodal — accepts images.
`qwen-vl-max`	`MultiModalConversation.call`	Multimodal — accepts images.
`qwen-turbo`	`Generation.call`	Text only.
`qwen-plus`	`Generation.call`	Text only.
`qwen-max`	`Generation.call`	Text only.
`qwen-max-0428`	`Generation.call`	Text only.
`qwen-max-0403`	`Generation.call`	Text only.
`qwen-max-0107`	`Generation.call`	Text only.
`qwen-max-longcontext`	`Generation.call`	Text only.

Images are saved to a temporary file and passed as file:// URIs. If the context is too long, the agent progressively trims the most recent message by dropping 500 words at a time (up to 20 retries).

python run.py --model qwen-vl-max --observation_type screenshot

Groq

Required env var: GROQ_API_KEY The llama3-70b model name is routed to the Groq API using the groq Python SDK. The underlying model sent to Groq is llama3-70b-8192.

llama3-70b supports text-only input. You must use --observation_type a11y_tree.

Like the Qwen and Mistral backends, Groq uses a progressive context-trimming fallback: on the first failure the history is truncated to only the system message and the last turn; on subsequent failures, 500 words are dropped from the last message per attempt (up to 20 total retries).

python run.py --model llama3-70b --observation_type a11y_tree

OpenAI-compatible endpoints (Mistral / Together AI)

Required env var: TOGETHER_API_KEY Model names starting with mistral are routed to https://api.together.xyz using the openai Python SDK’s base_url override.

Mistral models support text-only input. You must use --observation_type a11y_tree.

python run.py --model mistral-7b-instruct --observation_type a11y_tree

To point at a different OpenAI-compatible endpoint, you can adapt the mistral branch in call_llm() to set a different base_url and api_key.

CogAgent (local server)

Model names starting with THUDM (e.g. THUDM/cogagent-chat-hf) are routed to a local OpenAI-compatible server at http://127.0.0.1:8000. No environment variable is needed; you must run the model server yourself before starting run.py.

# Start the local model server first, then:
python run.py --model THUDM/cogagent-chat-hf --observation_type screenshot

Sampling parameter defaults

The run.py CLI sets the following defaults, which override the class-level constructor defaults:

Parameter	`run.py` default	Class default
`temperature`	`1.0`	`0.5`
`top_p`	`0.9`	`0.9`
`max_tokens`	`1500`	`1500`

For o-series models, temperature, top_p, and max_tokens are all dropped from the API request regardless of what is configured.

Specifying the model

Pass the model name with the --model flag:

python run.py \
  --model gpt-4o \
  --observation_type screenshot_a11y_tree \
  --action_space pyautogui \
  --temperature 1.0

The judge model (used by the automated evaluation step) is set separately with --judge_model:

python run.py \
  --model gpt-4o \
  --judge_model gpt-4.1

Get Started

Benchmark

Running Experiments

Automated Judge

Agents

Results & Analysis

Supported models

OpenAI

computer-use-preview

Anthropic

Google Gemini

Alibaba DashScope (Qwen)

Groq

OpenAI-compatible endpoints (Mistral / Together AI)

CogAgent (local server)

Sampling parameter defaults

Specifying the model

Get Started

Benchmark

Running Experiments

Automated Judge

Agents

Results & Analysis

Documentation Index

​OpenAI

​computer-use-preview

​Anthropic

​Google Gemini

​Alibaba DashScope (Qwen)

​Groq

​OpenAI-compatible endpoints (Mistral / Together AI)

​CogAgent (local server)

​Sampling parameter defaults

​Specifying the model

OpenAI

computer-use-preview

Anthropic

Google Gemini

Alibaba DashScope (Qwen)

Groq

OpenAI-compatible endpoints (Mistral / Together AI)

CogAgent (local server)

Sampling parameter defaults

Specifying the model