I couldn't RL the problem of coverage

May 23, 2026

Convert

Puffers need to convert to a new color by touching a star that maches their current color.

Observations:

dx and dy for each color star (-1 to 1)

current heading (0 to 1)

current reward (0 or 1)

x (0 to 1)

y (0 to 1)

current star goal (one hot encoded)

Actions:

heading (9 choices from -1/3 radians to +1/3 radians)

speed (5 choices from -2 to +2)

Rewards:

+1 upon hitting star

Has objectively correct solution:

Yes

That's info about the RL environment I used in the background image. PufferLib has many envs like it (albeit they're trying to feature easy envs less), and for good reason - being able to turn "whether or not I hit the star" to "I always hit the star" is important. But what happens when you make an env with no objectively correct solution? What happens when you try to make an environment where the agent is a Roomba, the world is a dirty floor, and the reward is coverage? Well, unfortunately and similarly to my diffusion SLAM efforts, things went wrong for me.

Going in a circle

Loop de loop

It's absolutely clueless, driving away from the dirt

It's worth mentioning that I got the absolute simplest case to work, where it was snapped to a grid with discrete outputs, and all it had to do was call the combination "left and go forward" command at the bottom and the combination "right and go forward" command at the top.

Demo of the simple boustrophedon case

But then I couldn't get the second simplest case to work, where it had discrete inputs (what the optimal algorithm would do) and outputs (what it does) on a continuous surface, and just needed to mimic the optimal algorithm. It just went off course because nothing immediately stopped it from that. From an old journal:

I’m truly perplexed. A manual algorithm simple as
env->actions[0] = 1;
  env->actions[1] = 1;
  if (env->y < ROBOT_RADIUS+60 && env->bearing < PI/2) {
      env->actions[1] = -0.5;
  }
  if (env->y > env->height - (ROBOT_RADIUS+60) && env->bearing > -PI/2) {
      env->actions[0] = -0.5;
  }
reaches perf=1, coverage=0.919, and total_reward=10.8, but when I provide those two if conditions as the ONLY observations it goes no further than perf=0.799, coverage=0.913, and total_reward=6.067. Perhaps this is due to my reward shaping or the fact that it’s directly driving the wheels, because if you remember my simple boustrophedon experiment that worked fine.

I also couldn't just run an LLM on loop to find a key fix - it was able to get to high levels of coverage, but not 100% (see image). It also wouldn't sim2real properly.

If you plan to take this on: coverage is a very open ended, planning-heavy problem. I tried to apply the bitter lesson many times, increasing network size or switching to RNNs, to no avail. If you have a larger budget than me, you can always throw compute at the problem (eg get your LLM to implement stricter rewards, get your LLM to try alternative approaches like full 2D movement, and run giant sweep runs).