2025-01-27

Reflections on Agents and RL

Value Functions and Personal Long-Termism

I have been trying to think about agents less as a buzzword and more as a useful analogy for behavior.

The part that interests me is not the full architecture. It is the split between immediate reward and a value function. A system can react well to local prompts and still fail if it is not optimizing for a larger trajectory.

That distinction exposed something uncomfortable in my own planning.

For a long time, I was moving between short-term feedback loops: messages, project feedback, periodic reflections, other people's reactions. These signals were useful, but they were local. They told me whether a step felt good or looked good. They did not define the larger system I was trying to move toward.

I had not actually written the value function.

That made some of my behavior predictable. In places where I did not have a clear top-down objective, social feedback filled the gap. People-pleasing could override what I wanted because there was no stronger value function to resist it. Supportive environments can make this harder to notice: nothing looks obviously wrong, but the policy is still being trained by the nearest reward.

This week, I removed a few social media surfaces that distort self-perception and replaced them with denser information sources. The change was immediate: less ambient comparison, more room to think. Under external pressure, I also started taking long-term planning more seriously, not as a productivity exercise but as a way to define what I want my decisions to serve.

For humans, "objective function" is too crude. Biology gives us some broad drives, but they are not enough to explain a life. The interesting part is the value function we choose to build: what we repeatedly privilege, what we refuse, what future state we are willing to trade for.

That is why the agent/RL metaphor still matters to me. Not because people are literally RL systems, and not because the analogy is complete. It gives me a sharper question:

When I act, am I optimizing for the closest reward signal, or for the system I actually want to become?