Mini World of Bits AI agent results

I hosted Mini World of Bits (an OpenAI-created benchmark) and tested several AI agents on it. Results:

Copilot: always refused

Copilot refused every task with excuses like:

Manus: 2/4

Manus succeeded at click-button and enter-text, timed out with choose-list, and failed at drag-box (tried to eval code and failed).

Scout: 1/7

Scout was very persistent in its investigation, but it just has too short of a reaction time to complete the tasks within 10 or 20 seconds.

This post is an archive of this Twitter thread.

More posts