If you happened to have perused the paper Detecting Near-Duplicates for Web Crawling which I mentioned in an earlier post, you may be curious about exactly why the simhash algorithm actually works. The details of simhash are left for the reader to discover in a cited paper by Moses Charikar entitled “Similarity Estimation Techniques from Rounding Algorithms”. Unfortunately, Charikar doesn’t use the term simhash in his paper, and it might not be obvious to the reader how to connect the dots.
Connecting those dots isn’t necessarily easy. But, here are a few bits and pieces of information that might help.
In classical information retrieval, one often measures the similarity of two documents by converting those documents into so-called ‘feature vectors’ and then computing the angle in Euclidean space between those two vectors. If the documents are identical, the angle between their feature vectors will be 0. The more significant the differences are between the documents, the larger the angle between their feature vectors.
You can think of a feature vector as a point in a (potentially) high-dimensional space where each dimension corresponds to a unique feature value. To make things concrete, let’s suppose that our features are English words. The dimensions then would have labels like ‘quick’, ‘brown’, ‘fox’–rather than the more familiar x, y, and z–and the dimensionality of the vectors would be the number of words in the English language.
Given a vector representing a document in such a space, the ith component corresponding to word w could take on the following values:
- 0 if word w isn’t present in the document,
- 1 if the word w is present in the document at least once, or
- k if the word w is present in the document k times, or we want it’s presence to represent k units of significance
Given two vectors, you can compute the angle between them in the usual way (arccos of their dot product.)
In Charikar’s paper, he’s interested in computing ’sketches’ of feature vectors such that similarity between two sketches indicates similarity between the two corresponding vectors. A sketch is simply a compact representation of an object, e.g., the hash of string. Given that the sketches are compact, they’re more practical to store and manipulate. In section 3, titled ‘Random Hyperplane Bashed Hash Functions for Vectors’, Charikar describes a way to sketch n-dimensional feature vectors into k-dimensional bit vectors (n >> k). You do this by choosing k random vectors r_i in the given feature vector space, and defining k one bit hash functions as follows:
- h_i(v) = 1, if dot(r_i, v) >= 0
- h_i(v) = 0, if dot(r_i, v) < 0
The k-bit sketch is then trivially arrived at by passing a vector v through each of the k hash functions and concatenating the resulting bits.
It’s actually pretty easy to show that using the sketching technique above that the probability that sketch(u) is equal to sketch(v) is 1 – theta / pi where theta is the angle between feature vectors u and v.
Now, the final dot left to be connected is answering the question: how is simhash related to sketch? The answer is that they are more or less the same. There’s no explicit proof of this equivalence in Charikar’s paper, and I’m not going to try to manufacture one right here given that it would make an already long and potentially confusing blog post that much longer and more confusing. I’ll endeavor to clean up this last detail in a future post. If you’re curious now, plunging one level deeper and reading the papers that Charikar cites in Section 3 will give you the information you need to construct a proof on your own.