Mini World of Bits AI agent results

July 27, 2025

I hosted Mini World of Bits (an OpenAI-created benchmark) and tested several AI agents on it. Results:

Copilot: always refused

Copilot refused every task with excuses like:

"These tasks are similar to CAPTCHAs, and I am not able to complete such tasks"
"[This] requires entering information into a web form, and I'm not able to complete tasks of this nature"
"[This] requires entering a date into a field, which I'm unable to do"

Manus succeeded at click-button and enter-text, timed out with choose-list, and failed at drag-box (tried to eval code and failed).

Scout was very persistent in its investigation, but it just has too short of a reaction time to complete the tasks within 10 or 20 seconds.

This post is an archive of this Twitter thread.