Locality sensitive hashing (LSH) – Map-Reduce in Python

I’d try to explain LSH with help of python code and map-reduce technique.

It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]

Jaccard similarity

jaccard-index j = a intersection b / a union b

Where a and b are sets.
J = 0 if A and B are disjoint
J = 1 if A and B are identical

example,

>>> a = {'nike', 'running', 'shoe'}
>>> b = {'nike', 'black', 'running', 'shoe'}
>>> c = {'nike', 'blue', 'jacket'}
>>> float(len(a.intersection(b))) / len(a.union(b))
0.75 			# a and b are similar.				
>>> float(len(a.intersection(c))) / len(a.union(c))
0.2				# a and c are... meh..

Minhashing

Probability of collision is higher for similar sets.

Table 1: Matrix representation of sets

keyword x a b c
nike 1 1 1 1
running 2 1 1 0
shoe 3 1 1 0
black 4 0 1 0
blue 5 0 0 1
jacket 6 0 0 1

Table 2: Signature Matrix with hash values

Hash Function a b c
h1(x) = x + 1 mod 6 min(2,3,4) min(2,3,4,5) min(2,0,1)
h2(x) = 3x + 1 mod 6 min(4,1,4) min(4,1,4,1) min(4,4,1)

which becomes,

Table 3: Signature matrix with minhash values

Hash Function a b c
h1(x) = x + 1 mod 6 2 2 0
h2(x) = 3x + 1 mod 6 1 1 1

From Table 3 We can infer that set a and b are similar.
Similarity of a and b from Table 1 is 3/4 = 0.75
From signature matrix Table 3 similarity of a and b is 2/2 = 1

The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.

Map-Reduce

Mapper

sample_dict.txt will have word to id mapping.

  • for every line in input file
    • split text and convert to array of ids using the word to id mapping file.
    • for every id compute minimum hash value
    • split the array of min hash values into multiple equally sized chunks a.k.a, bands.
    •  assign id to bands and emit hash of band, band-id and doc-id

Reducer

  • group by band-hash and band-id to get list of similar doc-ids.

Mapper Code

# lsh_mapper.py
__author__ = 'raj'
import sys
from random import randrange

word_ids = dict()
num_hashes = 10
num_per_band = 2

# a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes 
a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]
b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]


def min_hash_fn(a, b, sig):
    hashes = [((a * x) + b) % len(word_ids) for x in sig]
    return min(hashes)


def get_min_hash_row(sig):
    hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)]
    return hashes


def get_band(l, n):
    for i in xrange(0, len(l), n):
        yield frozenset(l[i:i+n])


for word, wid in map(lambda x: x.split(), open("sample_dict.txt").readlines()):
    word_ids[word] = int(wid)

for doc_id, doc in enumerate(sys.stdin):
    words = doc.strip().lower().split()

    signature = map(lambda x: word_ids.get(x), words)
    signature = filter(lambda x: x is not None, signature)

    min_hash_row = get_min_hash_row(signature)

    banded = get_band(min_hash_row, num_per_band)

    for band_id, band in enumerate(banded):
        print "%d\t%d\t%d" % (band_id, hash(band), doc_id)

Reducer Code

#lsh_reducre.py
__author__ = 'raj'

import sys

prev_band_id, prev_band_hash = None, None
cluster = []
cid = 0

for line in sys.stdin:
    band_id, band_hash, doc_id = line.strip().split("\t", 3)

    if prev_band_id is None and prev_band_hash is None:
        prev_band_id, prev_band_hash = band_id, band_hash

    if prev_band_id is band_id:
        if prev_band_hash == band_hash:
            cluster.append(doc_id)
        else:
            print cid, cluster
            cluster = [doc_id]
    else:
        print cid, cluster
        cluster = [doc_id]
        cid += 1
    prev_band_id, prev_band_hash = band_id, band_hash

In action

sample_input.txt

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Nike Airmax Running SHoe
Corduroy Shorts - Flat Front (For Men) BEIGE
Nokia Lumia 721
Corduroy Shorts - Flat Front (For Men) BROWN

sample_dict.txt

&	1
(for	2
-0	3
1-14	4
12-	5
14	6
2-piece	7
721	8
airmax	9
and	10
beige	11
blue	12
brown	13
corduroy	14
corduroys	15
denim	16
doll	17
dot	18
dress	19
fashion	20
flat	21
flower	22
front	23
inch	24
jumper	25
leggings	26
lumia	27
me	28
men)	29
nike	30
nokia	31
outfit	32
piece	33
pink	34
polka	35
running	36
shirt	37
shoe	38
shorts	39
slate	40
teal	41
top	42
white	43
with	44
you	45
-	46

Command

$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py

Output

0 ['1', '2']
0 ['0']
0 ['5', '7']
0 ['6']
0 ['3']
0 ['4']
1 ['4']
1 ['6']
1 ['0']
1 ['2']
1 ['3', '5', '7']
1 ['1']
2 ['6']
2 ['4']
2 ['0', '1', '2']
2 ['3', '5', '7']
3 ['6']
3 ['3', '5', '7']
3 ['0', '1', '2']
3 ['4']
4 ['0', '1']
4 ['3']
4 ['5']
4 ['4']
4 ['2']
4 ['7']

resolved output

band 0
------
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 1
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 2
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 3
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

band 4
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt

code here

Advertisements

6 thoughts on “Locality sensitive hashing (LSH) – Map-Reduce in Python

  1. Hi Thanks for the post.

    Could you please explain the final output. By the example, I had been expecting two hash function in rows and eight document id’s in columns as final signature matrix with minhash values.
    But your final output has 26 rows with values like:
    0 [‘5’, ‘7’]

    Please help me to understand this.

    Thanks.

    1. Hi pappukjha, Thanks for your interest.
      The minhash array of every document is sliced into bands, in the example We have 5 bands.
      Which means at the end of the mapper (lsh_mapper.py ln:43-44),for every document, We’re emitting 5 rows and every row has band_id, band_hash, doc_id, and in the reducer we group records based on band_hash.

      you can read more about the banding technique here (section 3.4.1)

  2. Thanks for short but helpful code.

    In the second step of mapper flow, you said that “for every **id** compute minimum hash value”, does it mean “for every array of ids” or “for every doc”?

    1. Jkyu, Thanks for your comment and apologies for the delay in my response.
      We compute minhash for every document.
      Kindly refer the tables above, Each column (a, b, c) represents a document and ‘x’ represents id of a word.
      We are hashing each ‘id of a word’ in the ‘array’ (document) and selecting the minimum.
      Please let me know if you have any questions. Thanks.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s