Hamelshreya
Chapters
# Zigging vs. zagging: How HubSpot built a $30B company | Dharmesh Shah (co-foun
What’s cool about this is you don’t need to do this many, many times. For most p
People have been burned by evals in the past. People have done evals badly, so t
A term that you used in your posts that I love is this idea of a benevolent dict
Thank you for having us.
Sure. Evals is a way to systematically measure and improve an AI application, an
So just to make very real, so imagining this real estate agent, maybe they’re he
Okay. I like what you said first, which is we had a very broad definition. Evals
Amazing. That’s so cool.
Yeah. Yeah, it’s really cool. And you see all of these different sort of feature
Hamel Husain (00:18:32):
Yeah, and you don’t have to do it for all of your data. You sample your data and
Yeah, so what do you do with something like that?
This is more of, “Hey, we’re not handling this interaction correctly. This is mo
It’s amazing you’re catching that, too, here. Otherwise, you’d have no idea this
Yeah, it’s supposed to be chill. Just don’t overthink it. And there’s a way to d
One common question that we get from people at this stage is, “Okay, I understan
Do you think we’ll get to a place where an agent can do this, where it has that
Lenny Rachitsky (00:25:17):
Okay, maybe Hamel cover that, actually.
And so benevolent dictator is just a catchy term for the fact that when you’re d
Hamel Husain (00:27:42):
Yeah. It’s going to be okay. It’s not perfection. You’re just trying to make pro
Hamel Husain (00:29:14):
So, okay. So let’s say we do, and Shreya and I, we recommend doing at least 100
… in data analysis and qualitative analysis called theoretical saturation. So
Shreya Shankar (00:31:34):
Yeah. Okay. So you did 100 of these. Now you have all these notes. So this is wh
Just reviewing traces. At least there’s one job left for now. Great.
Yes. Creating axial codes, so what it does is-
Lenny Rachitsky (00:34:39):
Lenny Rachitsky (00:36:43):
”… do this.” I do think it’s a little bit hard, right? Part of this whole expe
Amazing. Okay. What’s funny about you guys doing this is I just want to go do th
Yeah. So I pulled up a video just to drive home Shreya’s point. We are not inven
… be really fun. Two, I love that my podcast episode just came out today is in
Okay. So you can do this through anything, and the same thing works just fine in
And so basically, what you could do is you can categorize your traces into one o
Yeah. Or have it with 10 other words.
Yeah, okay. What are some of those other words that people often use that you th
Lenny Rachitsky (00:43:17):
It’s in the loop. Still space for us. Great.
Lenny Rachitsky (00:44:04):
Yeah. It’s absurd to feel like you wouldn’t know this is happening. Watching thi
Okay. So here’s sort of the big unveil. This is the magic moment right now. So w
So just to try to mirror back what you’re describing, you want to test what your
Absolutely. You nailed it.
And the goal here is just to have a suite of tests that run before you ship to p
Lenny Rachitsky (00:52:04):
Awesome. Okay. Hamel’s got an example of an actual LLM as a judge eval here, so
It’s wild how much drama there is in the evals space. We’re going to get to that
You’re going through manually, you do that.
As a product manager or someone, even if you’re not doing this calculation yours
Lenny Rachitsky (01:00:56):
That is interesting. Your advice is not skip straight to evals and LLM as judge
This is one of the coolest research reports you can possibly read if you want to
That’s the best name for a researcher.
We did this super fun study when we were doing user studies with people who were
Yeah, okay, great. You still got to do product the same way, but now you have th
It’s not that many, because a lot of the failure modes, as Hamel said earlier, c
Probably the ones that are most risky to your business if they say something lik
But it’s a lot of one-time cost. Right now, forever, you can run this on your ap
What comes next after you’ve built your LLM judge? Well, we find that people jus
Okay, great segue to a debate that we got pulled into that was happening on X th
I think that works. There’s two things to that, right? One is they’re standing o
We’ll also say that coding agents are fundamentally very different than other AI
The other thing is, yeah, engineers have a dogfooding personality. There are ple
Dogfooding is a dangerous one, only because a lot of people will say they’re dog
Yeah, okay. What I’m hearing is you consider A-B tests as part of the suite of e
Just to add to the previous question a little bit, why is there this debate, A-B
If you just call it, “We’re just doing error analysis, doing data science to und
Yeah, they don’t correlate with math problem-solving, sorry to say.
The fact that your course on Maven is the number one highest grossing course in
It gets me every time. The Internet’s so inconsistent. My favorite thing was yes
Shoot, many humans are still great. I think that’s great news.
Those are the top two? Okay.
Oh, those are definitely… Then, I guess the third one I would add is, there’s
Sweet, so don’t be scared. Use LLMs as much as you can throughout the process.
Yeah. Let me actually share my screen, because I want to show something. To pigg
Amazing. A question I didn’t ask, but this is I think something people are think
Yeah, it’s really not that much time. I think people just get overwhelmed by how
Something I want to make sure we cover before we get to a very exciting lightnin
Yeah, I can talk about the syllabus a little bit, and then Hamel can talk about
Hamel Husain (01:36:20):
I have no idea. I just take one month at a time. I don’t know where we’re going
Yeah, maybe 30 seconds. Do you guys train it on the voice mode, by the way? That
Yeah, sign up for the course and then you’ll get a bunch of emails. Everything w
Bittersweet, bittersweet. Incredible. Okay. With that, we’ve reached our very ex
I like to recommend a fiction book because life is about more than evals. Recent
They’re down the street, him and Berkeley.
Super cool. Oh, man, nerds, I love it. Okay, next question. Favorite recent movi
Lenny Rachitsky (01:40:30):
I feel like everyone goes through that. Eventually in their life they decide, I
Lenny Rachitsky (01:40:58):
Worth it. Okay, next question. Do you have a favorite product you’ve recently di
Yeah, I really like Claude Code and I like it because I feel like the UX is outs
There we go. Okay, two more questions. Hamel, do you have a favorite life motto
I like that. For me, it’s to always try to think about the other side’s argument
Amazing. Final question. When I have two guests on, I always like to ask this qu
Yeah. My favorite thing about Hamel is his energy. I don’t know anybody who cons
Yeah, it’s pretty easy to find me. My website is Hamel.dev. I’ll give you the li
My pleasure. Bye everyone. Thank you so much for listening. If you found this va
Key Concepts
- Absolutely
- And there’s a way to do this
- Andrew
- Another
- Anthropic
- Apple
- Boss
- But it’s a lot of one-time cost
- But then at the end, you could probably pull from
- Bye everyone
- Certainly
- ChatGPT
- Claude
- Code
- Codex
- Cool
- Cursor
- Discord
- Do you guys train it on the voice mode, by the way
- Dscout
- Evals
- Figure
- Google
- Hamel
- Hamel Husain (01:06:30):
Okay, data analysis is su
- Hamel Husain (01:25:00):
The second one that I see
- Husain
- I think that’s great news
- Is that right
- Julius
- Just don’t overthink it
- Lenny
- Lenny Rachitsky (00:31:39):
Yeah
- Lenny Rachitsky (01:14:31):
Because you’re seeing
- Let’s
- Maven
- Maybe
- Mercury
- My pleasure
- Nurture
- Okay
- OpenAI
- People
- Phoenix
- Rachitsky
- Right now, forever, you can run this on your appli
- Shankar
- Shoot, many humans are still great
- Shreya
- Shreya Shankar (00:31:34):
And promise, at some po
- Shreya Shankar (00:42:57):
I don’t think it’s spec
- Shreya?
- Statsig
- Thank you so much for listening
- That is interesting
- That’s
- That’s my favorite feature of Delphi’s product
- Then
- There
- There’s
- These
- We’ll also say that coding agents are fundamentall
- We’re
- What are some of those other words that people oft
- Where
- Yeah
- Yeah, it’s supposed to be chill
- Yeah, maybe 30 seconds
- You can collapse the activities, you don’t need as
- You’re
- Your advice is not skip straight to evals and LLM