Hamelshreya

Chapters

# Zigging vs. zagging: How HubSpot built a $30B company | Dharmesh Shah (co-foun

What’s cool about this is you don’t need to do this many, many times. For most p

People have been burned by evals in the past. People have done evals badly, so t

A term that you used in your posts that I love is this idea of a benevolent dict

Thank you for having us.

Sure. Evals is a way to systematically measure and improve an AI application, an

So just to make very real, so imagining this real estate agent, maybe they’re he

Okay. I like what you said first, which is we had a very broad definition. Evals

And so we have all the different components and pieces and information that the

Amazing. That’s so cool.

Yeah. Yeah, it’s really cool. And you see all of these different sort of feature

Hamel Husain (00:18:32):

Yeah, and you don’t have to do it for all of your data. You sample your data and

Yeah, so what do you do with something like that?

This is more of, “Hey, we’re not handling this interaction correctly. This is mo

It’s amazing you’re catching that, too, here. Otherwise, you’d have no idea this

Yeah, it’s supposed to be chill. Just don’t overthink it. And there’s a way to d

One common question that we get from people at this stage is, “Okay, I understan

Do you think we’ll get to a place where an agent can do this, where it has that

Lenny Rachitsky (00:25:17):

Okay, maybe Hamel cover that, actually.

And so benevolent dictator is just a catchy term for the fact that when you’re d

Hamel Husain (00:27:42):

Yeah. It’s going to be okay. It’s not perfection. You’re just trying to make pro

Hamel Husain (00:29:14):

So, okay. So let’s say we do, and Shreya and I, we recommend doing at least 100

… in data analysis and qualitative analysis called theoretical saturation. So

Shreya Shankar (00:31:34):

Yeah. Okay. So you did 100 of these. Now you have all these notes. So this is wh

Just reviewing traces. At least there’s one job left for now. Great.

Yes. Creating axial codes, so what it does is-

Lenny Rachitsky (00:34:39):

And this is what LLMS are really good at, taking a bunch of information and synt

Lenny Rachitsky (00:36:43):

It’s interesting the tools don’t do this, or do they try and they just don’t do

”… do this.” I do think it’s a little bit hard, right? Part of this whole expe

Amazing. Okay. What’s funny about you guys doing this is I just want to go do th

Yeah. So I pulled up a video just to drive home Shreya’s point. We are not inven

… be really fun. Two, I love that my podcast episode just came out today is in

Okay. So you can do this through anything, and the same thing works just fine in

And so basically, what you could do is you can categorize your traces into one o

Yeah. Or have it with 10 other words.

Yeah, okay. What are some of those other words that people often use that you th

Lenny Rachitsky (00:43:17):

It’s in the loop. Still space for us. Great.

Lenny Rachitsky (00:44:04):

Yeah. It’s absurd to feel like you wouldn’t know this is happening. Watching thi

Okay. So here’s sort of the big unveil. This is the magic moment right now. So w

So just to try to mirror back what you’re describing, you want to test what your

Absolutely. You nailed it.

And the goal here is just to have a suite of tests that run before you ship to p

Lenny Rachitsky (00:52:04):

Awesome. Okay. Hamel’s got an example of an actual LLM as a judge eval here, so

It’s wild how much drama there is in the evals space. We’re going to get to that

You’re going through manually, you do that.

As a product manager or someone, even if you’re not doing this calculation yours

Lenny Rachitsky (01:00:56):

That is interesting. Your advice is not skip straight to evals and LLM as judge

This is one of the coolest research reports you can possibly read if you want to

That’s the best name for a researcher.

We did this super fun study when we were doing user studies with people who were

Yeah, okay, great. You still got to do product the same way, but now you have th

It’s not that many, because a lot of the failure modes, as Hamel said earlier, c

Probably the ones that are most risky to your business if they say something lik

But it’s a lot of one-time cost. Right now, forever, you can run this on your ap

What comes next after you’ve built your LLM judge? Well, we find that people jus

Okay, great segue to a debate that we got pulled into that was happening on X th

I think that works. There’s two things to that, right? One is they’re standing o

We’ll also say that coding agents are fundamentally very different than other AI

The other thing is, yeah, engineers have a dogfooding personality. There are ple

Dogfooding is a dangerous one, only because a lot of people will say they’re dog

Yeah, okay. What I’m hearing is you consider A-B tests as part of the suite of e

Just to add to the previous question a little bit, why is there this debate, A-B

If you just call it, “We’re just doing error analysis, doing data science to und

Yeah, they don’t correlate with math problem-solving, sorry to say.

The fact that your course on Maven is the number one highest grossing course in

It gets me every time. The Internet’s so inconsistent. My favorite thing was yes

Shoot, many humans are still great. I think that’s great news.

Those are the top two? Okay.

Oh, those are definitely… Then, I guess the third one I would add is, there’s

Sweet, so don’t be scared. Use LLMs as much as you can throughout the process.

Yeah. Let me actually share my screen, because I want to show something. To pigg

Amazing. A question I didn’t ask, but this is I think something people are think

Yeah, it’s really not that much time. I think people just get overwhelmed by how

Something I want to make sure we cover before we get to a very exciting lightnin

Yeah, I can talk about the syllabus a little bit, and then Hamel can talk about

Hamel Husain (01:36:20):

I have no idea. I just take one month at a time. I don’t know where we’re going

Yeah, maybe 30 seconds. Do you guys train it on the voice mode, by the way? That

Yeah, sign up for the course and then you’ll get a bunch of emails. Everything w

Bittersweet, bittersweet. Incredible. Okay. With that, we’ve reached our very ex

I like to recommend a fiction book because life is about more than evals. Recent

They’re down the street, him and Berkeley.

Super cool. Oh, man, nerds, I love it. Okay, next question. Favorite recent movi

Lenny Rachitsky (01:40:30):

I feel like everyone goes through that. Eventually in their life they decide, I

Lenny Rachitsky (01:40:58):

Worth it. Okay, next question. Do you have a favorite product you’ve recently di

Yeah, I really like Claude Code and I like it because I feel like the UX is outs

There we go. Okay, two more questions. Hamel, do you have a favorite life motto

I like that. For me, it’s to always try to think about the other side’s argument

Amazing. Final question. When I have two guests on, I always like to ask this qu

Yeah. My favorite thing about Hamel is his energy. I don’t know anybody who cons

Yeah, it’s pretty easy to find me. My website is Hamel.dev. I’ll give you the li

My pleasure. Bye everyone. Thank you so much for listening. If you found this va

Key Concepts