DigitalCoach shows why AI software tutors still coach too shallowly
July 1, 2026

A new arXiv paper measures computer-use coaching with 72 real training sessions. Models give instructions, but explain and diagnose worse than humans.
What this is about
A new arXiv paper introduces DigitalCoach: a dataset and benchmark for testing whether AI agents can truly coach people through software use. That may sound narrow, but it matters because more products now promise not only to operate software for users, but to guide them through complex workflows.
The researchers studied 72 human expert-novice coaching sessions with 22,752 dialogue turns, 28.1 hours of screen recordings and input events across five applications. They then compared how modern models perform as coaches.
What DigitalCoach actually does
DigitalCoach captures real computer teaching situations: an experienced person helps a novice complete a software task. The dataset is not only about chat text. It also includes the visible screen, clicks, errors, follow-up questions, and the way a good coach checks understanding.
Models can give instructions, but the paper says they explain less, diagnose mistakes less well, and ask fewer knowledge-check questions than humans. When the coaching method is fixed, model responses sound more human, but they are often poorly grounded in the visual context.
Why it matters
Many companies are betting on computer-use agents: systems that operate browsers, spreadsheets, design tools, or internal software. The next step is obvious: the agent should not only click by itself, but help humans learn how to click better.
If these coaches only issue commands, users may become passive followers. They finish the current task but do not understand why the step was correct. For training, support, accessibility, and onboarding, that difference is large.
In plain language
It is like learning to ride a bicycle. A weak coach only shouts: left, right, brake. A good coach explains why you slow down before the turn, notices your mistake, and then asks you to repeat the move yourself. DigitalCoach measures that quality gap for software work.
A practical example
A new employee has to import 120 contacts into a CRM, check duplicates, and start a campaign. An AI coach could simply tell her which button to press. A better coach notices that she misunderstood column mapping, explains the pattern, and lets her check the next ten contacts herself.
Scope and limits
First, DigitalCoach is a research paper, not a finished product. It identifies a gap but does not fully solve it.
Second, the dataset covers five applications. Other expert software, mobile apps, or regulated environments may behave differently.
Third, coaching is hard to evaluate. A user can complete the task and still learn very little; future benchmarks need to measure that difference more directly.
SEO & GEO keywords
DigitalCoach, computer-use agents, AI coaching, human-computer interaction, software training, multimodal dataset, screen grounding, agent evaluation, workplace AI, onboarding, AI assistants, arXiv
π‘ In plain English
DigitalCoach tests whether AI can truly teach people software work. The result: models can call out steps, but often help less with understanding, diagnosing mistakes, and learning independently.
Key Takeaways
- βDigitalCoach includes 72 expert-novice sessions and 28.1 hours of screen and input data.
- βModels give more direct instructions but fewer explanations and diagnostic questions than humans.
- βVisual grounding remains weak even when responses sound human.
- βThe benchmark matters for support, onboarding, accessibility, and computer-use agents.
FAQ
Is DigitalCoach a product?
No. It is a research dataset and benchmark that exposes gaps in AI coaching.
What do models do worse than humans?
They explain less, diagnose errors less well, and less often check whether the person understood the step.
Why does this matter for companies?
Because AI onboarding and support can train users to passively follow clicks instead of building durable understanding.