Archive

Archive for April, 2007

Programming Puzzles for Screening Engineers

April 29th, 2007

Hiring great engineers is perhaps the toughest and most important challenge that any tech startup faces. You want smart, energetic, adaptable people with a broad range of skills and interests, the ability to finish what they start, and who fit into the culture you’re trying to establish or maintain. There’s no foolproof set of techniques for predicting whether or not a candidate will have the right characteristics in the right proportion once hired into your environment. Sometimes it’s hard to tell even after you’ve hired someone and have had a chance to observe their performance for many months!

Many tech startups (justifiably!) focus lots of energy on assessing smartness in the context of basic computer science and engineering. It is widely accepted in tech that job candidates are ‘treated’ to lots of brain teasers and problem solving ‘opportunities’ during interviews. Another mechanism that I’ve noticed recently is the use of programming puzzles as a pre-interview filter. Google tried this in 2004 with the Google Labs Aptitude Tests. Recently I’ve seen some buzz around the programming puzzles that Facebook is using to gather applications.

It’s hard to say how predictive these puzzles are of future success in startup engineers. Regardless, they can be fun and in and of themselves can perhaps attract applications from good people who otherwise wouldn’t apply to your company. I even got sucked into one of the Facebook puzzles, Prime Bits which has a trivial solution whose slowness may at first be hard to see, and at least one fast solution which might not at first be obvious. That, by my estimation, makes this a pretty good puzzle. I will also note that someone at Facebook must have a thing for combinatorial puzzles. :-)

Nerdness, The Tech Biz

More Fun With Bloom Filters

April 28th, 2007

In my previous Bloom filter post, I talked about using Bloom filters for spell checking in memory constrained environments. With respect to problems like spell checking, contemporary machines can’t exactly be described as memory constrained. One can easily load a comprehensive language dictionary into a few megs of memory and store it in an ordinary hash table, trie, or other structure for efficient lookups.

That said, even modern machines with multiple gigs of RAM can choke on the sizes of some datasets for which we’d like to support efficient lookup. Also, there are many instances where we need to exchange sets of values between networked machines. Bloom filters can help us conserve precious network bandwidth and reduce latency.

Two application areas where huge datasets must be efficiently queried and/or exchanged are networking and databases. Andrei Broder and Michael Mitzenmacher wrote an excellent survery of Bloom filter uses in networking. I’ll talk a little bit about database applications since the literature is a bit older and the only survey I can find is a paid download from the ACM Digital Library.

In databases, Bloom filters can be used to process distributed joins. A distributed join allows us to join relations (tables) that reside on different machines. To make this concrete, suppose that we have two tables PurchaseStats and PageViewStats that reside on different database servers, and we want to execute the following SQL to get a list of all page views that have an associated purchase.

SELECT views.TimeOfDay, views.PageURL, purchases.ItemName,
  purchases.Price
FROM PageViewStats as views, PurchaseStats as purchases
WHERE views.EventId = purchases.EventId; 

Using a Bloom filter, we can execute this query as follows:

  • Choose one of the tables, say PurchaseStats, and compute a Bloom filter of all EventIds in the table.
  • Send the Bloom filter F to the remote database server hosting PageViewStats
  • For each record in PageViewStats, add EventId, TimeOfDay, and PageURL to a temporary table T for every PageViewStats EventId in the Bloom filter F.
  • Send T back to the database server hosting PurchaseStats.
  • Perform an ordinary join on PurchaseStats and T.

The reason that we use a Bloom filter here is to minimize the amount of data we’re transmitting over the network in order to execute the join. Now, since we’re using a Bloom filter, the temporary table T might have more records in it than necessary due to false positives. However, the cost of sending those additional records almost always pales in comparison to the data transmission reduction we get from using the Bloom filter in the first place.

Obviously implementations of Bloom joins in actual distributed databases must deal with complexities that I haven’t mentioned here. For example, from which table do you compute the Bloom filter from? How do you handle multi-table joins? What happens when the join expression is more complicated than a single equality term? For answers to these questions, have a look at the excellent survey Join Processing in Relational Databases by Priti Mishra and Margaret Eich.

Coding

Interesting Googler Papers in WWW2007

April 27th, 2007

For folks working in search or doing research about search, one of the biggest conferences every year is WWW. The program is out now, and several Googlers have interesting papers among those accepted.

Tech

Two Quick Things

April 27th, 2007

First, the Computer Language Benchmarks Game is an interesting approach to resolving the old-as-computers question “which programming language is faster?” Many folks (self included) will tell you that the question is mostly nonsense. But, if you insist on asking it and then answering it with benchmarks, the CLBG is a great way to remove some of the experimental biases that commonly plague that approach.

The second thing, and perhaps even more interesting than the first, is how I came across the CLBG. The link was presented to me as a recommendation from Google search personalization due to some Python-related searches that I had done recently. I was dubious about how effective query-less searches would be. The degree to which this result matches my interests seems to indicate that I should have fewer doubts.

Nerdness, Tech

Jessica Livingston @ Google on “Founders At Work”

April 25th, 2007

Jessica Livingston is a partner at Y Combinator and author of “Founders At Work”, the latest hot read in the Silicon Valley. “Founders At Work” is a collection of interviews with founders of several high-profile, successful startups. Jessica was at Google on April 19th to give a talk about the book. She distilled much of the essence of the book in a 50 minute presentation. And, Google taped the talk and has made it available to the public (with Jessica’s consent of course) on YouTube. Definitely worth a watch.

The Tech Biz