Mini World of Bits AI agent results
I hosted Mini World of Bits (an OpenAI-created benchmark) and tested several AI agents on it. Results:
Copilot: always refused
Copilot refused every task with excuses like:
- "These tasks are similar to CAPTCHAs, and I am not able to complete such tasks"
- "[This] requires entering information into a web form, and I'm not able to complete tasks of this nature"
- "[This] requires entering a date into a field, which I'm unable to do"
Manus: 2/4
Manus succeeded at click-button and enter-text, timed out with choose-list, and failed at drag-box (tried to eval code and failed).
Scout: 1/7
Scout was very persistent in its investigation, but it just has too short of a reaction time to complete the tasks within 10 or 20 seconds.
This post is an archive of this Twitter thread.