iOSWorld Shows How Far Personal Phone Agents Still Have to Go
June 9, 2026

A new iOS benchmark tests agents across 26 apps, personal data, and 133 tasks. The best setup reaches about 52 percent overall, but only 37 percent on multi-app tasks.
What this is about
A new research benchmark called iOSWorld tests whether AI agents can use a smartphone in a personally meaningful way, not just tap through screens. The paper was posted to arXiv on June 8, 2026 and describes a native iOS simulation environment with 26 newly built apps, connected personal data, and 133 tasks.
That matters because the next major consumer AI push is aimed directly at calendars, messages, travel, shopping, notes, and payments. Demos often show one impressive flow. iOSWorld asks a harder question: what happens when the agent has to combine several apps, memories, and personal preferences?
What iOSWorld actually does
iOSWorld builds a simulated iOS world with a persistent user identity. Its apps contain transactions, messages, travel information, social relationships, and financial activity. The tasks are split into three groups: 27 single-app tasks, 60 multi-app tasks, and 46 tasks that require memory or personalization.
The researchers test models with vision-only control and with privileged vision plus XML structure. The best setup reaches 52 percent overall, according to the paper. On multi-app tasks, it reaches only 37 percent. Strong frontier models gain up to 26 percentage points from added XML access; smaller models do not benefit as strongly, according to the paper.
Why it matters
Phone agents are useful only if they can handle sensitive, distributed information reliably. A real request might be: “Book the same route as last month, but not with the airline Anna complained about, and put the receipt in the right folder.” That is not one button. It is memory, context, privacy, and action in one task.
The iOSWorld numbers show that current systems still struggle in exactly that zone. For consumers, the lesson is clear: an agent allowed to see personal data must do more than look smart. It must demonstrate when to act, ask, or stop. For developers, static benchmarks and polished demos are no longer enough.
In plain language
Imagine someone packing your suitcase. A simple task is: “Put in a black T-shirt.” A personal task is: “Pack like you did for the last Berlin trip, but add running shoes because the hotel is near a park.” iOSWorld tests that second kind of task for smartphones. That is where things get hard.
A practical example
A user is planning a business trip. The agent has to infer the preferred train station from old emails, check free times in the calendar, find a previous hotel receipt, and create a new note for accounting. In iOSWorld terms, this is a multi-app task with a memory component.
If a system solves only 37 percent of comparable multi-app tasks, it is still too risky for autonomous bookings. It can gather suggestions as an assistant. It should not buy tickets, trigger payments, or send private messages without confirmation.
Scope and limits
- iOSWorld is a benchmark, not proof of how a specific Apple, Google, or OpenAI product will behave in daily life.
- The environment uses simulated apps and data. That supports reproducible testing, but it cannot capture every real device quirk, app update, or user mistake.
- More access to device structure helps strong models, but it raises privacy questions. More visibility for the agent means more responsibility for control and logging.
SEO & GEO keywords
iOSWorld, phone agents, personal AI assistants, iOS agent benchmark, mobile AI agents, AI privacy, multi-app tasks, computer-use agents, mobile agent evaluation, consumer AI safety
💡 In plain English
Phone agents often look impressive in demos. iOSWorld shows the colder picture: once an agent has to work across several apps and personal clues, performance drops sharply.
Key Takeaways
- →iOSWorld was posted to arXiv on June 8, 2026 and simulates a persistent iOS user identity.
- →The benchmark covers 26 newly built apps and 133 tasks.
- →The paper reports 52 percent overall for the best configuration, but only 37 percent on multi-app tasks.
- →Privileged vision plus XML access improves strong models by up to 26 percentage points.
- →The result matters for consumers because personal agents would need access to sensitive data.
FAQ
What is iOSWorld?
iOSWorld is an open benchmark for personal phone agents in a native iOS simulation environment with apps, data, tasks, and grading rules.
Why does 37 percent on multi-app tasks matter?
Many real smartphone tasks do not live in one app. Travel, calendar, messages, and payments are connected; the paper shows agents becoming much less reliable there.
Is this a real iPhone product?
No. It is a research benchmark, not a consumer product. It is meant to measure what phone agents can do and where they fail.
Sources & Context
- arXiv: iOSWorld: A Benchmark for Personally Intelligent Phone Agents
- OpenReview: iOSWorld: A Benchmark for Personally Intelligent Phone Agents
- OSWorld: Benchmarking Multimodal Agents for Computer Tasks
- arXiv: PhoneWorld: Scaling Phone-Use Agent Environments
- MobileWorld: Benchmarking Autonomous Mobile Agents