Human in the loop

In my opinion, 2025 really was the year of agents. Definitely many crappy ones, but also some good ones. I’ve been developing Wasy, an ambitious AI-augmented CRM for Latam based businesses every night for over a year now, and seen the cycle from every angle. There’s of course a lot that I couldn’t go deep into, but in general, I feel like working on this project got me extremely close to LLMs, research, techniques, and some of their shortcomings.

Although we give users the ability to change their assistant’s underlying model, most leave it on the default subsidized model, gpt-4o-mini. 4o-mini was released in July of 2024. So, the whole idea the industry had of “the models getting better,” to make the technology viable wasn’t even the problem. Really, we all needed more time to figure out how to make these things accurate. Prompt/context engineering truly has been a frustrating grind. Testing is a beast, observability is bespoke and constantly needs tweaking, and affordable LLMs have trouble following instructions.

Instruction following has been the biggest challenge over the past year, and we tried many approaches before finally settling on a solution that’s working cleanly. Some models are much better than others, but in general, it seems mainly centered around keeping the context clean and minimal. Basically, make a really good search that works for you, and then use that to get clean info into any decision making process. The hard question is, “how do we get the most important information into the assistant’s brain at the right moment, but nothing else?”

From this experience, I wasn’t very interested in trying out hyped protocols like MCP or “agents frameworks” or fancy searching techniques. I tested out some popular tools, but they seemed to do exactly what I expected: flood the context with not the most important information or overly abstract everything into a mess. For an LLM assistant, the context is its most precious resource, so keeping it utterly spotless and cleanly managed with a mix of traditional defined state machine logic and LLM as judge techniques has been a large chunk of the work. It has to be debuggable too, because results are still non-deterministic. The thing is, there’s not really a one-size solution for this type of stuff, since the search needs to take in a few parameters that aren’t necessarily naive.

My cofounder had extremely ambitious goals from the start, practically describing “agents” in early 2024 before that term took off in the LLM world. We needed tons of features that honestly stressed me out at first:

Convert and finalize sales end to end over text/voice message
The messages are extremely vague, and likely will never explicitly say the actual product or any real details about the product.
Customers frequently make mistakes on their address, confirm an incorrect order, or need changes after the fact
We can’t be too robotic and ask too many clarification questions, or we blow a sale
Need to maintain higher chat to sale conversion rates than human sellers
Labor is cheap in the region, so we must be more reliable than actual humans or risk quick churn
Logistics tracking needs to be integrated, with e2e sale being taken care of strictly by the AI assistant
Needs to be free, with a business only paying based on commissions. But LLMs are expensive!

We made loads of custom architecture to realize this dream. We stayed consistent, building and cleaning up piece by piece. There were many uninspiring moments, like running into reliability issues with older customers who only use voice messages, aren’t transcribed perfectly, and send in awkward intervals.

These common failures that we didn’t anticipate resulted in incorrect logistics information and duplicated sales. But we’re commissions based! So this turned into a mess, and we had to make a whole modified system to accept and expect some level of failure and give the user a way to manage it. This is where I kind of changed my outlook on everything related to LLMs, and noticed clearly that other major corporations have done the same.

“Human in the loop,” seems like the best way to use AI tools for the time being. I imagine for at the very least least another year, but I’m just guessing, because some folks seem to think that full LLM autonomy is feasible or generalizable. I don’t think I do, but honestly, for some tasks, it really might be feasible. Last year, I experimented with fully agentic coding, and it turned quickly into an unmaintainable mess, and I have never since tried that again. Now, I review every change, and use AI coding tools after having planned and decided what I want. I see the LLM as the syntax assistant, feed it very well defined principles, and critique it until it makes clean testable code that I like.

I guess, after everything I’ve seen with daily work and grind on a relatively defined scope LLM augmented project, I will not personally trust “fully agentic” for probably a long time. This isn’t even a bad thing though. It just changed how I work, and for the better in my case. I personally don’t enjoy syntax or semantics. I like product building and deciding “what” and “how” but not necessarily obsessed with the code. I do know a lot of people that love writing clean perfect code, and feel bad, because it feels like we are shifting further and further away from that reality, with code being more of a disposable detail that we review than the work itself. Definitely cases where that’s not true at all, but I don’t work in those areas. In much of web, app, consumer, b2b general SaaS, this does feel true to me.

Although the human in the loop perspective wasn’t necessarily the goal for Wasy, I think it’s worth it overall. In our case, someone eventually has to ship orders or fulfill them, so a human has to come in and review orders. Since that’s already the case, we’ve improved UX all over the dashboard to make human actions easier, like manually chatting if needed, correcting, modifying orders, and exporting data. I imagine this will be the case for many “AI first” products over the next few years, and AI will become a nice feature that helps augment our work at the “code” level of whatever the task is. In our case, “fully agentic” is possible, but we design for failure, expecting that it will happen, but ensuring it is not disastrous.

I like what Linear has done with AI by not making it anything other than a clean augmentation. “Create issue out of text,” for example. Not some bloated junk that pops up all over the place telling you to use it.

AI should make something better and not serve as a marketing gimmick

Karri Saarinen

I’m hoping this is where we continue to move as an industry in the near term. Augmentation > agentic in my eyes. Make a delightful experience that “just works,” and it doesn’t need to shove the features down the user’s throat, it should just cleanly make us more effective and shift our focus to the grand problem over the implementation details.

Human in the loop

Comments