What does DukaanBench measure?

It measures how well a model operates a simulated kirana shop over 30 days, including inventory, cash, service, trust and executable actions.

Why is this better than a normal chat test?

Because every decision has downstream effects. A polished plan is not enough if the shop then gets stockouts, trust loss or invalid actions.

Can a model use this to run real shops?

No, the sources do not support that. DukaanBench is a research preview in simulation, not permission for autonomous shop operation.

Which source is primary?

The primary source is the Hugging Face community article with project and Arena links.

DukaanBench tests AI agents in a simulated small shop

What this is about

DukaanBench was published as a Hugging Face community article on June 27, 2026. The benchmark asks a simple but unusually useful question: can a language model run a small Indian kirana shop for 30 simulated days without destroying customer trust?

That is fresher than many agent demos because the test does not stop at text answers. Every morning, a model receives the shop state: inventory, cash, sales, missed demand, weather, customer relationships, khata credit, marketing and local signals. It then has to return one executable JSON action. After that, the benchmark simulates customers, stockouts, payments, trust, waste and reward.

What DukaanBench actually does

One run lasts 30 simulated shop days. The model acts once per day before the shop opens. It can order goods, remove products, set discounts, trigger khata reminders, plan marketing actions, set a cash reserve and allocate fridge space.

The important point is this: the rationale does not count if the action is missing. If a model writes that it wants to run a WhatsApp campaign but leaves marketingActions empty, no campaign happens. DukaanBench therefore separates clean intention from actual operational execution.

The shop is fictional but fixed: Shree Shyam Bhandar on a street with apartments, a school, a bus stop, regular customers and walk-ins. Every model faces the same starting world, the same 30-day horizon and the same action contract.

Why it matters

Many agent benchmarks test whether a model can verbally solve a task. DukaanBench tests whether decisions have consequences. A model can make profit and still lose trust. It can have good marketing ideas and still create demand the shop cannot serve because inventory is missing. It can describe a smart strategy and still fail the JSON contract.

The first published leaderboard shows those differences. The article lists GPT-5.5 as the top result with a reward of +2,294, final cash of 50,184 rupees, trust at 100 and a 97.5 percent service rate. Gemini 3.1 Pro is close on business health but needed more validation retries. Gemini 3.1 Flash Lite does not win, but its 2.4 second average latency, zero fallbacks and high trust make it interesting as a fast baseline.

For real people, this matters because many AI products promise to "take over processes". A shop is small enough to understand, but complex enough to expose typical agent problems: scarce resources, returning customers, delayed consequences and the hard boundary between a plan and an executed action.

In plain language

Imagine someone is not just asked to write a recipe, but to run a small kitchen for 30 days. They have to shop, avoid waste, keep regular guests happy, set prices and at the end the food cannot merely sound good. Enough food must actually have been cooked.

DukaanBench does that with AI agents. It does not only ask: "Does the plan sound good?" It asks: "After 30 days, does the shop still have money, stock and trust?"

A practical example

A model sees on Monday morning: 12 liters of milk left, 9 loaves of bread, 30 eggs, 800 rupees of free reserve, rain likely and many school customers expected. It orders 30 liters of milk, 15 loaves, starts a banana discount and reserves too little fridge space. During the day, the shop sells a lot of milk, but the banana campaign attracts customers while bread runs out.

The score would not only count revenue. It would also count missed units, stockouts, lower trust among regulars and the quality of the action. That is what reveals whether the model is running a business or only talking plausibly about one.

Scope and limits

First, DukaanBench is described by its own article as a Part 1 research preview. The public training dataset and a small shopkeeper model experiment are planned for Part 2.

Second, the environment is simulated. The results do not mean a model should run a real shop tomorrow without human control. Real suppliers, theft, disputes, taxes and local exceptions are harder.

Third, the first numbers come from the project itself. They are useful for inspecting model behavior, but they are not yet an independently reproduced industry standard.

SEO & GEO keywords

DukaanBench, AI agents, kirana shop, agent benchmark, Hugging Face, Capabl Machines, retail AI, inventory management, customer trust, GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.8

DukaanBench tests whether AI can run a small shop