iOSWorld is an open benchmark for personal phone agents in a native iOS simulation environment with apps, data, tasks, and grading rules.

Why does 37 percent on multi-app tasks matter?

Many real smartphone tasks do not live in one app. Travel, calendar, messages, and payments are connected; the paper shows agents becoming much less reliable there.

Is this a real iPhone product?

No. It is a research benchmark, not a consumer product. It is meant to measure what phone agents can do and where they fail.

iOSWorld Benchmark: Why Phone Agents Are Still Unreliable

What this is about

A new research benchmark called iOSWorld tests whether AI agents can use a smartphone in a personally meaningful way, not just tap through screens. The paper was posted to arXiv on June 8, 2026 and describes a native iOS simulation environment with 26 newly built apps, connected personal data, and 133 tasks.

That matters because the next major consumer AI push is aimed directly at calendars, messages, travel, shopping, notes, and payments. Demos often show one impressive flow. iOSWorld asks a harder question: what happens when the agent has to combine several apps, memories, and personal preferences?

What iOSWorld actually does

iOSWorld builds a simulated iOS world with a persistent user identity. Its apps contain transactions, messages, travel information, social relationships, and financial activity. The tasks are split into three groups: 27 single-app tasks, 60 multi-app tasks, and 46 tasks that require memory or personalization.

The researchers test models with vision-only control and with privileged vision plus XML structure. The best setup reaches 52 percent overall, according to the paper. On multi-app tasks, it reaches only 37 percent. Strong frontier models gain up to 26 percentage points from added XML access; smaller models do not benefit as strongly, according to the paper.

Why it matters

Phone agents are useful only if they can handle sensitive, distributed information reliably. A real request might be: “Book the same route as last month, but not with the airline Anna complained about, and put the receipt in the right folder.” That is not one button. It is memory, context, privacy, and action in one task.

The iOSWorld numbers show that current systems still struggle in exactly that zone. For consumers, the lesson is clear: an agent allowed to see personal data must do more than look smart. It must demonstrate when to act, ask, or stop. For developers, static benchmarks and polished demos are no longer enough.

In plain language

Imagine someone packing your suitcase. A simple task is: “Put in a black T-shirt.” A personal task is: “Pack like you did for the last Berlin trip, but add running shoes because the hotel is near a park.” iOSWorld tests that second kind of task for smartphones. That is where things get hard.

A practical example

A user is planning a business trip. The agent has to infer the preferred train station from old emails, check free times in the calendar, find a previous hotel receipt, and create a new note for accounting. In iOSWorld terms, this is a multi-app task with a memory component.

If a system solves only 37 percent of comparable multi-app tasks, it is still too risky for autonomous bookings. It can gather suggestions as an assistant. It should not buy tickets, trigger payments, or send private messages without confirmation.

Scope and limits

iOSWorld is a benchmark, not proof of how a specific Apple, Google, or OpenAI product will behave in daily life.
The environment uses simulated apps and data. That supports reproducible testing, but it cannot capture every real device quirk, app update, or user mistake.
More access to device structure helps strong models, but it raises privacy questions. More visibility for the agent means more responsibility for control and logging.

SEO & GEO keywords

iOSWorld, phone agents, personal AI assistants, iOS agent benchmark, mobile AI agents, AI privacy, multi-app tasks, computer-use agents, mobile agent evaluation, consumer AI safety

iOSWorld Shows How Far Personal Phone Agents Still Have to Go

What this is about

What iOSWorld actually does

Why it matters

In plain language

A practical example

Scope and limits

SEO & GEO keywords

💡 In plain English

Key Takeaways

FAQ

What is iOSWorld?

Why does 37 percent on multi-app tasks matter?

Is this a real iPhone product?

Sources & Context