Archive

Archive for the ‘Coding’ Category

iPhone First Impressions

July 1st, 2007

I think that the news media and blogosphere are doing an adequate job with iPhone coverage on the device’s first weekend of commercial availability. I doubt that I’m going to add substantially to what’s already been said.

I will say that I’m exceedingly pleased with the iPhone and have only three non-trivial gripes. One is AT&T’s pokey EDGE network. I hope that Apple is getting the message loud and clear that EDGE is unacceptably slow and that they’ll do something about this in the next generation of the device.

My second two gripes are software absences that I hope will get fixed soon. I wish that the iPhone had an RSS reader and a real chat client. For feeds, using Safari and reader.mac.com is better than nothing. But typing in the feed URL of every RSS feed you want to read isn’t great. My quick fix for this problem was a Python hack to convert OPML into an iPhone-ized blogroll. The code is available here. To use it, do the following:

  • Use the export OPML feature of your RSS reader to produce an XML file containing your subscriptions. We’ll assume this is called export.xml
  • Run ‘python iphone_blogroll.py export.html > blogroll.html’
  • Copy blogroll.html to some convenient place on the web reachable by your iPhone

The resulting blogroll should be properly formatted for the iPhone Safari browser and uses reader.mac.com for actual feed reading. My iPhone-ized blogroll is available for those who are curious about how it looks.

[NOTE: you'll need to download and unpack Juri Pakaste's python-opml library before my script will work.]

Coding, Nerdness

Locality Sensitive Hashing

May 8th, 2007

If you happened to have perused the paper Detecting Near-Duplicates for Web Crawling which I mentioned in an earlier post, you may be curious about exactly why the simhash algorithm actually works. The details of simhash are left for the reader to discover in a cited paper by Moses Charikar entitled “Similarity Estimation Techniques from Rounding Algorithms”. Unfortunately, Charikar doesn’t use the term simhash in his paper, and it might not be obvious to the reader how to connect the dots.

Connecting those dots isn’t necessarily easy. But, here are a few bits and pieces of information that might help.

In classical information retrieval, one often measures the similarity of two documents by converting those documents into so-called ‘feature vectors’ and then computing the angle in Euclidean space between those two vectors. If the documents are identical, the angle between their feature vectors will be 0. The more significant the differences are between the documents, the larger the angle between their feature vectors.

You can think of a feature vector as a point in a (potentially) high-dimensional space where each dimension corresponds to a unique feature value. To make things concrete, let’s suppose that our features are English words. The dimensions then would have labels like ‘quick’, ‘brown’, ‘fox’–rather than the more familiar x, y, and z–and the dimensionality of the vectors would be the number of words in the English language.

Given a vector representing a document in such a space, the ith component corresponding to word w could take on the following values:

  • 0 if word w isn’t present in the document,
  • 1 if the word w is present in the document at least once, or
  • k if the word w is present in the document k times, or we want it’s presence to represent k units of significance

Given two vectors, you can compute the angle between them in the usual way (arccos of their dot product.)

In Charikar’s paper, he’s interested in computing ’sketches’ of feature vectors such that similarity between two sketches indicates similarity between the two corresponding vectors. A sketch is simply a compact representation of an object, e.g., the hash of string. Given that the sketches are compact, they’re more practical to store and manipulate. In section 3, titled ‘Random Hyperplane Bashed Hash Functions for Vectors’, Charikar describes a way to sketch n-dimensional feature vectors into k-dimensional bit vectors (n >> k). You do this by choosing k random vectors r_i in the given feature vector space, and defining k one bit hash functions as follows:

  • h_i(v) = 1, if dot(r_i, v) >= 0
  • h_i(v) = 0, if dot(r_i, v) < 0

The k-bit sketch is then trivially arrived at by passing a vector v through each of the k hash functions and concatenating the resulting bits.

It’s actually pretty easy to show that using the sketching technique above that the probability that sketch(u) is equal to sketch(v) is 1 – theta / pi where theta is the angle between feature vectors u and v.

Now, the final dot left to be connected is answering the question: how is simhash related to sketch? The answer is that they are more or less the same. There’s no explicit proof of this equivalence in Charikar’s paper, and I’m not going to try to manufacture one right here given that it would make an already long and potentially confusing blog post that much longer and more confusing. I’ll endeavor to clean up this last detail in a future post. If you’re curious now, plunging one level deeper and reading the papers that Charikar cites in Section 3 will give you the information you need to construct a proof on your own.

Coding

More Fun With Bloom Filters

April 28th, 2007

In my previous Bloom filter post, I talked about using Bloom filters for spell checking in memory constrained environments. With respect to problems like spell checking, contemporary machines can’t exactly be described as memory constrained. One can easily load a comprehensive language dictionary into a few megs of memory and store it in an ordinary hash table, trie, or other structure for efficient lookups.

That said, even modern machines with multiple gigs of RAM can choke on the sizes of some datasets for which we’d like to support efficient lookup. Also, there are many instances where we need to exchange sets of values between networked machines. Bloom filters can help us conserve precious network bandwidth and reduce latency.

Two application areas where huge datasets must be efficiently queried and/or exchanged are networking and databases. Andrei Broder and Michael Mitzenmacher wrote an excellent survery of Bloom filter uses in networking. I’ll talk a little bit about database applications since the literature is a bit older and the only survey I can find is a paid download from the ACM Digital Library.

In databases, Bloom filters can be used to process distributed joins. A distributed join allows us to join relations (tables) that reside on different machines. To make this concrete, suppose that we have two tables PurchaseStats and PageViewStats that reside on different database servers, and we want to execute the following SQL to get a list of all page views that have an associated purchase.

SELECT views.TimeOfDay, views.PageURL, purchases.ItemName,
  purchases.Price
FROM PageViewStats as views, PurchaseStats as purchases
WHERE views.EventId = purchases.EventId; 

Using a Bloom filter, we can execute this query as follows:

  • Choose one of the tables, say PurchaseStats, and compute a Bloom filter of all EventIds in the table.
  • Send the Bloom filter F to the remote database server hosting PageViewStats
  • For each record in PageViewStats, add EventId, TimeOfDay, and PageURL to a temporary table T for every PageViewStats EventId in the Bloom filter F.
  • Send T back to the database server hosting PurchaseStats.
  • Perform an ordinary join on PurchaseStats and T.

The reason that we use a Bloom filter here is to minimize the amount of data we’re transmitting over the network in order to execute the join. Now, since we’re using a Bloom filter, the temporary table T might have more records in it than necessary due to false positives. However, the cost of sending those additional records almost always pales in comparison to the data transmission reduction we get from using the Bloom filter in the first place.

Obviously implementations of Bloom joins in actual distributed databases must deal with complexities that I haven’t mentioned here. For example, from which table do you compute the Bloom filter from? How do you handle multi-table joins? What happens when the join expression is more complicated than a single equality term? For answers to these questions, have a look at the excellent survey Join Processing in Relational Databases by Priti Mishra and Margaret Eich.

Coding

The Python Challenge

April 22nd, 2007

Whatever you do, please do not click on this link. I was at level 6 before I took fingers off the keyboard. I can guarantee you, not the right thing at all for improving your relationship with your spouse, girlfriend, boyfriend, etc., etc.

:-)

Coding

Bloom Filters

April 15th, 2007

A Bloom filter is a data structure for managing sets of values. Bloom filters provide O(1) lookups and insertion and, perhaps most importantly, provide an extremely compact representation of the set of values being stored. The trade-off for this compact representation is that the lookup operation can have false positives. In other words, lookup(x) may return true even when x isn’t in the set.

You might be wondering why we’d be willing or able to tolerate false positives in set lookups. There are actually lots of scenarios where this makes sense. For instance, the original application of Bloom filters–spell checking on limited-memory machines–remains a fine motivating example.

In spell checking, a Bloom filter is used to store a dictionary of correctly-spelled words. If lookup(word) returns false, the spell checker flags word as a misspelling. False positives in this application, e.g., lookup(‘notaword’) == True, results in some misspellings going unnoticed. That might seem to be a bad thing, but it’s all about balancing trade-offs. Bloom filters allow the spell checking application to load a comprehensive dictionary into a small amount of memory and makes spell checking fast enough that users can run the checker often. The small memory footprint can be achieved with a false positive rate that results in approximately 1 in 100 misspellings going undetected. Other trade-offs, like using a smaller dictionary or running the spell checker less frequently, might result in even higher error rates.

Bloom filters were invented by Burton Bloom in 1970 and described in his seminal paper Space/time Trade-offs in Hash Coding With Allowable Errors. Even though they’ve been around for 37 years now, are straighforward to implement, and have many, many practical uses, you typically don’t find Bloom filters described in data structures textbooks or taught in University undergraduate data structures courses. That’s something that should probably change IMO.

So, how do Bloom filters work? The concept is relatively simple, assuming that you’re already familiar with hashing and some simple probability. Recall that in traditional hashing you have a function h(val) that maps val onto an index in a table. Ideally, you want a function h() such that

  1. h(val1) == h(val2) when val1 == val2
  2. h(val1) != h(val2) when val1 != val2

In practice it’s vary hard to ensure condition 2 given fixed table sizes, so you have to relax the condition to something like “h(val1) is unlikely to equal h(val2) when val1 != val2″. This means that you occasionally get collisions where distinct values get the same hash value. There’s a vast literature out there about designing hash functions and sizing tables to minimize collisions, and describing how to handle collisions when they do occur. The salient point for this discussion is that these ‘traditional’ hashing techniques require that the complete value being hashed, or a unique proxy of that value, be stored in the hash table so that collisions can actually be detected.

This is where Bloom filters differ and the reason why they have false positives on lookup. When adding a value to a Bloom filter you compute k hashes which gives you k indices into an m-bit bit vector. The k bit vector entries for those indices are set to 1. To look up a value, you also compute k hashes to get k indices. If all k bit vector entries are 1 then you return true indicating that the value was found, otherwise, you return false. Here’s how you would do this in Python:

import BitVector

class BloomFilter:
  def __init__(self, m, k):
    """Instantiate an m-bit Bloom filter using k hash indices
    per value."""
    self.n = 0
    self.m = m
    self.k = k
    self.bv = BitVector.BitVector(size = self.m)

  def Insert(self, s):
    for i in self._HashIndices(s):
      self.bv[i] = 1
    self.n += 1

  def Lookup(self, s):
    for i in self._HashIndices(s):
      if self.bv[i] != 1:
        return False
    return True

[Note: this snippet uses Avinash Kak's BitVector module.]

Both m and k are user configurable values in this code, and we haven’t really said how to choose either. If we’re going to store n values in a Bloom filter, the probability of a false positive on any given lookup is given by pow(1 – exp(-kn/m)), k). In other words, the false positive rate increases as k and n get bigger, and as m gets smaller. If you have m and n, then a reasonable starting point for k is 0.7(m/n). The derivation for these formula are an exercise in simple probability. However, I’m going to leave it as an exercise given that this post is already running long.

Now, the next big question we have to answer in order to complete our implementation is how exactly to do the hashing. For a given population of n values, we want our k hash functions to distribute bits uniformly over the m indices in the bit vector. One way to do this is to choose k mutually independent hash functions. Alternately, if you have a hash function that produces a large, uniformly distributed set of bits for each value, you can chop the hash output into k buckets and use each bucket as a hash value.

When m is reasonably big (more than a few ten thousand bits) then you can get away with just having two independent hash functions using a technique discovered by Kirsch and Mitzenmacher. You can compute the k hash values as h_i(val) = (h1(val) + i * h2(val)) % m. Here’s our implementation.

def _HashIndices(self, s):
  indices = []
  for i in xrange(1, self.k + 1):
    indices.append((hash(s) + i * hashpjw(s)) % self.m)
  return indices

Notice that for h1() we’ve used Python’s built in hash function. For h2() I’ve used a simple but very effective string hashing function due to Peter Weinberger. The implementation of that function follows.

def hashpjw(s):
  val = 0
  for c in s:
    val = (val << 4) + ord(c)
    tmp = val & 0xf0000000
    if tmp != 0:
      val = val ^ (tmp >> 24)
      val = val ^ tmp
  return val

Now all that’s left to do is to measure our performance. I’ve written a little chunk of test code that loads approximately 90% of the words from GNU aspell’s English language dictionary into a Bloom filter. The remaining 10% of the words are used as a holdback to check the false positive rate of the filter. Here’s the code.

def TestFilter():
  import random

  # holdback will record words not added to the Bloom filter
  holdback = set()

  # Instantiate an ~1Mbit bloom filter with k=8
  bf = BloomFilter(1090177, 8 )

  # Open file with one English word per line
  f = open('data/words.dat')

  # Add each line to either holdback or Bloom filter
  for line in f:
    val = line.rstrip()
    if random.random() <= 0.10:
      # Add ~10% of values to holdback
      holdback.add(val)
    else:
      # Add ~90% of values to Bloom filter
      bf.Insert(val)
  f.close() 

  # Print information about current state of Bloom filter
  bf.PrintStats()

  # Count false positives -- # holdback items in the Bloom filter
  num_false_positives = 0
  for val in holdback:
    if bf.InFilter(val):
      num_false_positives += 1

  # Compute false positive rate and print
  rate = 100.0 * float(num_false_positives) / float(len(holdback))
  print "Actual false positive rate = %.2f%% (%d of %d)" % (rate,
      num_false_positives, len(holdback))

There’s one more function that I haven’t shown that prints some statistics about the current state of the Bloom filter. That code follows.

def PrintStats(self):
  k = float(self.k)
  m = float(self.m)
  n = float(self.n)
  p_fp = math.pow(1.0 - math.exp(-(k * n) / m), k) * 100.0
  compression_ratio = float(self.bits_in_inserted_values) / m
  print "Number of filter bits (m) : %d" % self.m
  print "Number of filter elements (n) : %d" % self.n
  print "Number of filter hashes (k) : %d" % self.k
  print "Predicted false positive rate = %.2f" % p_fp
  print "Compression ratio = %.2f" % compression_ratio

And finally, here’s the output of a single run of TestBloomFilter(). Notice that the Bloom filter is about 1Mbit, contains ~126k words and achieved a compression ratio of 7.97 (over just storing the bits of the original ~126k words). The predicted false positive rate (1.81%) was very close to the measured rate (1.87%). If we wanted to reduce this false positive rate further, holding the number of words in the dictionary constant, we could increase either m or k or both.

Number of filter bits (m) : 1090177
Number of filter elements (n) : 126733
Number of filter hashes (k) : 8
Predicted false positive rate = 1.81%
Compression ratio = 7.97
Actual false positive rate = 1.87% (265 of 14178)

Full code is available here. Don’t forget to grab BitVector module if you don’t already have it.

Next time, I’ll talk about some contemporary uses for Bloom filters.

Coding