Google has published a video of a snippet of an internal search quality meeting on their blog yesterday. It is pretty amazing to watch, even though everyone knows they are being recorded and it is only a small snippet of the meeting.
Here is the video:
Who are the main players at this table?
- Left - Ben Azose Search Quality Analyst
- Middle - Amit Singhal in charge of search
- Right - Matt Cutts
- Left - Scott Huffman testing man
- Middle - Panda Nayak (um, Panda update)
- Right - Paul Haahr Ranking Lead
- Left - Matt Cutts
- Middle - Ben Gomes
- Right - Lars Hellsten Engineer Spell Correction Team
Here is the transcript:
0:02Singhal: Everyone, thank you for setting this up,
0:04and guys, thank you for putting up
0:05with all the inconvenience we are putting you through.
0:09It so happens that this meeting is the heart of what we do.
0:15What we approve, how we run Search.
0:18This is an experiment.
0:20We will see how the tape comes out.
0:23If I look bad, we will not put it out.
0:30If Gomes looks bad,
0:31we will put it on the front page.
0:37Huffman: All right. Spell-correcting long queries.
0:43Lars: So, to keep our latency low,
0:45spelling has always just corrected
0:4810 terms in long queries,
0:50and we decided to use the first 10 terms,
0:55which was sort of arbitrary,
0:57and so this is a change by Euro in Zurich,
0:59who decided that we could be
1:01a little bit more intelligent about this.
1:03And so we're going to pick the two words
1:06that we think are most likely to be misspelled in the query,
1:09and form intervals of five words around each,
1:12so we're still correcting only 10 words.
1:14And this is just a smarter way
1:18of deciding which words to correct.
1:20Gomes: So, your context is the five words
1:22rather than the whole 10 words.
1:24So, you're more likely to find a match.
1:25man: Well, in general, the context is only three words,
1:27because we use trigrams for this thing,
1:29but they correct five words at a time,
1:32rather than simply the first 10 words.
1:35man: So, if you take a look at the mean scores...
1:36man: This is huge.
1:38man: This is very, very positive.
1:39man: We send both fragments to spelling separately,
1:42or is it strung together?
1:43Lars: No, they're sent together.
1:45We have a way of marking which terms we correct,
1:47and which terms we won't correct.
1:48Cutts: But roughly what percentage
1:49of queries have more than 10 terms?
1:51Lars: Not a lot. So...
1:54man: But it is very annoying
1:56when your misspelling is towards the end of a long query,
2:00and you don't-- you don't get it.
2:01And it's so obviously wrong.
2:03Paul: We've seen these where it was pasted quotes
2:04and the last word is mangled.
2:06Singhal: Why would anything ever go wrong with this?
2:08man: Yeah. man: It does, because you try
2:11to correct something late in the query,
2:13and you'll see some examples
2:15where early in the query
2:16there's also a misspelling which you failed to correct.
2:18Singhal: Oh, so, because of your two-word selection,
2:21you end up picking--
2:22if there are more than two misspellings in a query...
2:24man: Or there's a very rare word that makes you believe
2:27that that's a potential misspelling.
2:29'Cause you don't know it's a misspelling.
2:30Gomes: Why wouldn't you apply the misspelling
2:32across the whole query?
2:33The same misspelling, you're saying,
2:35would get corrected in one place, because of context?
2:36man: No, no, no, it's a different
2:38misspelling at the beginning.
2:39The problem is we--if we could just correct the whole thing,
2:42but then you'd pay in cost.
2:44Right, latency and things, so they don't want to do that.
2:46man: It's mostly the latency, right?
2:48Like, why? I don't know, it seems a little like--
2:50We could do, you know, hundreds of--thousands of QPS,
2:53right, why can't we send--
2:55break the query up into multiple chunks,
2:56and then send them all through parallel so that--
2:59so we can correct the entire query, right?
3:00Lars: we could do that, but I think the traffic effect
3:03would just be a really small slice.
3:06But why not just do that right?
3:08Mean, like, take overlapping five-word windows,
3:11and send runs of 10-word queries,
3:14as many as you can make out of a query,
3:16and send them all in parallel?
3:18man: Actually 'cause there's only a 0.1% change.
3:22Singhal: And, you know-- And by the way,
3:24in most cases, you'll be pretty much done.
3:26You will cover up to 15-word queries
3:30with just two.
3:32Paul: I think we should certainly launch this.
3:35I think [indistinct] gets points for a clever idea on it,
3:36but I think it is driving the same--
3:38the idea of splitting it.
3:41That's probably more infrastructure work.
3:43man: I'm sorry, I just want to jump back to this problem
3:45with the beginnings of the queries.
3:47So, these situations where--
3:50if you look at the second one in the second block there.
3:52"Int he book 'Julius Caesar,'" et cetera, et cetera, et cetera,
3:57we don't catch--we catch all sorts of misspellings
3:59about Caesar and differences,
4:01but we miss the fact that "int he" should be "in the."
4:05We have another query
4:09about sponsoring a child living in Tenerife,
4:11and we want to figure out
4:14whether "Tenerife" is misspelled,
4:15but we miss the fact that it's "cam" instead of "can."
4:20Gomes: By the way, are you doing this--
4:21but in the course of Suggest--
4:22So, the same thing will work with Suggest?
4:24When we have live-spelling Suggest?
4:27man: I'm sure if-- once you launch this,
4:29Suggest will do the same thing, right?
4:31Gomes: So, Suggest will be actually all from--
4:32man: This is all inside the Spell server,
4:34so there are no multiple calls being made.
4:36It's all embedded inside the Spell server.
4:39Singhal: So, on the sponsor, did we send the context,
4:41left and right?
4:43man: We did.
4:44man: And then why didn't we correct the context?
4:47Lars: Actually, this is sort of
4:48an issue with the current implementation.
4:50If there are--
4:52if there are two intervals that are close enough together,
4:55then we merge them into one,
4:57so what's actually happening is we're correcting, I think,
5:00from "I" to "credit."
5:03Paul: So, we just missed one.
5:05Lars: Yeah, so...
5:06Paul: Picked--We picked slightly the wrong window.
5:09Look, that's gonna happen with any of these.
5:11man: I mean, certainly, the original thing
5:13of picking the first 10 was missing a lot of words.
5:15man: That's right, that's right.
5:17Paul: The averages say this is clearly an improvement.
5:20Cutts: But if this is, like, .01% of queries,
5:23why not just correct--
5:24man: No, it's .1. Not .01, it's .1.
5:27Cutts: But how much, resource-wise--
5:30Paul: I think it's more just the--
5:32the infrastructure work on doing it.
5:34Because you now have to have the Spell servers call out
5:35to other Spell servers.
5:40man: It seems good.
5:42Gomes: I mean, to a large extent, you will be seeing
5:43those spell corrections happening in Suggest,
5:44because you're going to get that initial window...
5:47Paul: I think a lot of these are just pasted queries, though.
5:49Singhal: This is cut and pasted.
5:51These are cut and paste. No one's typing these.
5:54man: We're seeing a lot of people's--
5:56Paul: "Cam I sponsor--" man: "Cam I sponsor."
5:59Cutts: The Caesar one is--that's a kid just doing his homework.
6:02Paul: "Stein, S. et al amino acid analysis"
6:04is a pasted query.
6:08Paul: So, I mean--So... man: Not all of them.
6:10Paul: But not all of these.
6:11Cutts: Like "how long do you have to wait
6:13to wash your hair after a perm?"
6:14man: "Int he book" is almost certainly not.
6:15Singhal: It may hap--
6:17plenty of pastes do all kinds of funky things.
6:19man: And if you look at the wins,
6:20a lot of those are definitely typed queries.
6:23Anyway, I--Look, this is clearly a good change.
6:26man: Maybe a great change.
6:28Paul: Let's give a recommendation to the team
6:29to actually give up the 10-word limits.
6:31Singhal: No, but I want some follow-up
6:32on that recommendation.
6:35Singhal: So, how are we gonna get that follow-up?
6:37man: Your recommendation is issuing multiple...
6:40Singhal: Just do it. All of it.
6:41Right in that, by chunking.
6:42man: Yeah, I just think we should have some system
6:43that can handle 100-word queries, right?
6:46I don't know. Singhal: Yeah.
6:48man: We shouldn't die on, like, the hardest query.
6:50Paul: Right, but I think we end up doing that on the front end
6:52and not in--
6:53Singhal: No, but I don't--Paul.
6:54Paul: We don't care. Singhal: I'm sorry.
6:56You're defending something that--
6:57you know, the design is not perfect.
6:58Don't defend it.
7:01Paul: I think it's fine to do the recommendation,
7:03but I think this is a good step,
7:04and I think Euro gets points
7:05for getting us to look at that again.
7:07Singhal: No, that's fine, but, you know,
7:08I want to make sure that the team comes back,
7:10or we put some kind of exploding deadline
7:12that, you know, we won't do this.
7:13If you don't do it right within, say, three months,
7:17Gomes: Your Spell server is being used
7:19for running text in other places too?
7:21man: Now they're being used in--
7:24for red underline also, isn't that right?
7:27Lars: We don't use--
7:28We use the same servers-- Yes and no.
7:30man: But a different set-up?
7:32Lars: For part of--
7:33one of the red underline clients is using them, yeah.
7:35man: So, that must be much longer chunks of text.
7:37man: No, no, but I think they break it up into smaller chunks.
7:40Singhal: Treat this as someone's typing an email.
7:42man: Email, right. That's what I was...
7:43man: And bring in all the red underlines.
7:49Singhal: Okay, we can launch this, but...
7:52man: I mean, remember,
7:53we may still have problems even there
7:55because of context and things, if you break things up.
7:58So, they'll always be--
8:01Gomes: Yeah, treat it as running text, right?
8:03Singhal: Okay. man: Okay.