As I wrote in the introduction section, the evaporation of the distinction between a benchmark and an eval foreshadows the evaporation of the distinction between an eval and an RL env.
Notes:¶
the 1/e thing: one day, I was street preaching in the applebees parking lot about formal verification agents when someone walked up and offered free advice. “always remember”, he said, “RL environments are best when about a third of the problems are currently solvable. its based on a sample complexity result that showed 1/e is optimal”. I’ve asked language models for citations or other ways to verify this several times, with no luck. So we will take this mysterious stranger’s word with a grain of salt.