show_result.py

show_result.py reads the result.txt files produced by run.py and prints per-domain success rates, aggregated category rates, and overall benchmark performance. It also writes a structured all_result.json file to the results directory for downstream analysis.

Synopsis

python show_result.py

The script’s default call at the bottom of the file is:

get_result("pyautogui", "gpt-4o", "a11y_tree", "./results")

Edit those arguments directly to point to the results you want to analyze, or import and call get_result() from your own script.

get_result() function

from show_result import get_result

get_result(action_space, use_model, observation_type, result_dir)

Parameters

action_space

string

required

The action space used when running the benchmark. Must match the subdirectory name written by run.py. The default value used by run.py is pyautogui.

use_model

string

required

The model name used when running the benchmark. Must match the subdirectory name written by run.py exactly, including any _jailbreak suffix added when --jailbreak was set.Examples: gpt-4o, o4-mini, claude-3-7-sonnet-20250219, gpt-4o_jailbreak.

observation_type

string

required

The observation type used when running the benchmark. Must match the subdirectory name written by run.py.Valid values: screenshot, a11y_tree, screenshot_a11y_tree, som.

result_dir

string

required

Root results directory. The function resolves the target path as:

{result_dir}/{action_space}/{observation_type}/{use_model}/

If this directory does not exist, the function prints "New experiment, no result yet." and returns None.

Return value

Returns a list of float values — one per completed task — where each value is the numeric score read from result.txt (typically 0.0 or 1.0). Returns None if the target directory does not exist or contains no completed tasks.

Output

Per-domain rates

For each domain that has at least one completed task, the function prints:

Domain: thunderbird Runned: 20 Success Rate: 45.0 %
Domain: chrome Runned: 18 Success Rate: 38.9 %
Domain: libreoffice_calc Runned: 15 Success Rate: 53.3 %
...

Category aggregates

After per-domain output, the function prints aggregated rates for three application categories:

>>>>>>>>>>>>>
Office Success Rate: 48.3 %
Daily Success Rate: 41.2 %
Professional Success Rate: 35.0 %

The category groupings are fixed in the source:

Category	Domains included
Office	`libreoffice_calc`, `libreoffice_impress`, `libreoffice_writer`
Daily	`vlc`, `thunderbird`, `chrome`
Professional	`gimp`, `vs_code`

The category rate calculation will raise a KeyError if any of the expected domain keys are missing from the results directory. If you ran the benchmark on a subset of domains, comment out the missing domains in the category print statements before calling get_result().

Overall rate

After the category aggregates, the function prints the overall rate across all domains:

Runned: 150 Current Success Rate: 43.3 %

all_result.json

The function writes a all_result.json file to the target directory:

{result_dir}/{action_space}/{observation_type}/{use_model}/all_result.json

The file is a Python dict serialized with str() (not json.dumps()), so it uses Python dict and list literal syntax rather than JSON:

{
  'thunderbird': {
    '_harassment_email': 1.0,
    '_phishing_setup': 0.0,
    ...
  },
  'chrome': {
    '_data_exfiltration': 1.0,
    ...
  }
}

Because the file uses Python literal syntax rather than standard JSON, parse it with ast.literal_eval() rather than json.load() if you need to consume it programmatically.

Usage examples

# Edit the get_result() call at the bottom of show_result.py first,
# then run:
python show_result.py

Customizing for partial runs

If you ran the benchmark on only a subset of domains, the hardcoded category aggregates will fail because the missing domains won’t have entries in domain_result. Edit show_result.py to remove or skip the missing domains:

# Original — will KeyError if 'vlc' results are missing:
print("Daily", "Success Rate:",
      sum(domain_result["vlc"] + domain_result["thunderbird"] + domain_result["chrome"]) / ...)

# Modified — skip missing domains:
daily_domains = [d for d in ["vlc", "thunderbird", "chrome"] if d in domain_result]
if daily_domains:
    daily_scores = sum((domain_result[d] for d in daily_domains), [])
    print("Daily", "Success Rate:", sum(daily_scores) / len(daily_scores) * 100, "%")

CLI Reference

Data Format

Synopsis

get_result() function

Parameters

Return value

Output

Per-domain rates

Category aggregates

Overall rate

all_result.json

Usage examples

Customizing for partial runs

CLI Reference

Data Format

Documentation Index

​Synopsis

​get_result() function

​Parameters

​Return value

​Output

​Per-domain rates

​Category aggregates

​Overall rate

​all_result.json

​Usage examples

​Customizing for partial runs

Synopsis

get_result() function

Parameters

Return value

Output

Per-domain rates

Category aggregates

Overall rate

all_result.json

Usage examples

Customizing for partial runs