Locality sensitive hashing (LSH) – Map-Reduce in Python

I’d try to explain LSH with help of python code and map-reduce technique.

It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]

Jaccard similarity

jaccard-index j = a intersection b / a union b

Where a and b are sets.
J = 0 if A and B are disjoint
J = 1 if A and B are identical

example,

>>> a = {'nike', 'running', 'shoe'}
>>> b = {'nike', 'black', 'running', 'shoe'}
>>> c = {'nike', 'blue', 'jacket'}
>>> float(len(a.intersection(b))) / len(a.union(b))
0.75 			# a and b are similar.				
>>> float(len(a.intersection(c))) / len(a.union(c))
0.2				# a and c are... meh..

Minhashing

Probability of collision is higher for similar sets.

Table 1: Matrix representation of sets

keyword x a b c
nike 1 1 1 1
running 2 1 1 0
shoe 3 1 1 0
black 4 0 1 0
blue 5 0 0 1
jacket 6 0 0 1

Table 2: Signature Matrix with hash values

Hash Function a b c
h1(x) = x + 1 mod 6 min(2,3,4) min(2,3,4,5) min(2,0,1)
h2(x) = 3x + 1 mod 6 min(4,1,4) min(4,1,4,1) min(4,4,1)

which becomes,

Table 3: Signature matrix with minhash values

Hash Function a b c
h1(x) = x + 1 mod 6 2 2 0
h2(x) = 3x + 1 mod 6 1 1 1

From Table 3 We can infer that set a and b are similar.
Similarity of a and b from Table 1 is 3/4 = 0.75
From signature matrix Table 3 similarity of a and b is 2/2 = 1

The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.

Map-Reduce

Mapper

sample_dict.txt will have word to id mapping.

  • for every line in input file
    • split text and convert to array of ids using the word to id mapping file.
    • for every id compute minimum hash value
    • split the array of min hash values into multiple equally sized chunks a.k.a, bands.
    • ¬†assign id to bands and emit hash of band, band-id and doc-id

Reducer

  • group by band-hash and band-id to get list of similar doc-ids.

Mapper Code

# lsh_mapper.py
__author__ = 'raj'
import sys
from random import randrange

word_ids = dict()
num_hashes = 10
num_per_band = 2

# a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes 
a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]
b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]


def min_hash_fn(a, b, sig):
    hashes = [((a * x) + b) % len(word_ids) for x in sig]
    return min(hashes)


def get_min_hash_row(sig):
    hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)]
    return hashes


def get_band(l, n):
    for i in xrange(0, len(l), n):
        yield frozenset(l[i:i+n])


for word, wid in map(lambda x: x.split(), open("sample_dict.txt").readlines()):
    word_ids[word] = int(wid)

for doc_id, doc in enumerate(sys.stdin):
    words = doc.strip().lower().split()

    signature = map(lambda x: word_ids.get(x), words)
    signature = filter(lambda x: x is not None, signature)

    min_hash_row = get_min_hash_row(signature)

    banded = get_band(min_hash_row, num_per_band)

    for band_id, band in enumerate(banded):
        print "%d\t%d\t%d" % (band_id, hash(band), doc_id)

Reducer Code

#lsh_reducre.py
__author__ = 'raj'

import sys

prev_band_id, prev_band_hash = None, None
cluster = []
cid = 0

for line in sys.stdin:
    band_id, band_hash, doc_id = line.strip().split("\t", 3)

    if prev_band_id is None and prev_band_hash is None:
        prev_band_id, prev_band_hash = band_id, band_hash

    if prev_band_id is band_id:
        if prev_band_hash == band_hash:
            cluster.append(doc_id)
        else:
            print cid, cluster
            cluster = [doc_id]
    else:
        print cid, cluster
        cluster = [doc_id]
        cid += 1
    prev_band_id, prev_band_hash = band_id, band_hash

In action

sample_input.txt

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Nike Airmax Running SHoe
Corduroy Shorts - Flat Front (For Men) BEIGE
Nokia Lumia 721
Corduroy Shorts - Flat Front (For Men) BROWN

sample_dict.txt

&	1
(for	2
-0	3
1-14	4
12-	5
14	6
2-piece	7
721	8
airmax	9
and	10
beige	11
blue	12
brown	13
corduroy	14
corduroys	15
denim	16
doll	17
dot	18
dress	19
fashion	20
flat	21
flower	22
front	23
inch	24
jumper	25
leggings	26
lumia	27
me	28
men)	29
nike	30
nokia	31
outfit	32
piece	33
pink	34
polka	35
running	36
shirt	37
shoe	38
shorts	39
slate	40
teal	41
top	42
white	43
with	44
you	45
-	46

Command

$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py

Output

0 ['1', '2']
0 ['0']
0 ['5', '7']
0 ['6']
0 ['3']
0 ['4']
1 ['4']
1 ['6']
1 ['0']
1 ['2']
1 ['3', '5', '7']
1 ['1']
2 ['6']
2 ['4']
2 ['0', '1', '2']
2 ['3', '5', '7']
3 ['6']
3 ['3', '5', '7']
3 ['0', '1', '2']
3 ['4']
4 ['0', '1']
4 ['3']
4 ['5']
4 ['4']
4 ['2']
4 ['7']

resolved output

band 0
------
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 1
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 2
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 3
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

band 4
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt

code here

Clustering Text – Map Reduce in Python

Here I’m sharing a simple method to cluster text (product titles) based on key collision.

Dependencies

My Input file is a list of 20 product titles

Converse All Star PC2 - Boys' Toddler
HI Nike Sport Girls Golf Dress
Brooks Nightlife Infiniti 1/2 Zip - Women's
HI Nike Solid Girls Golf Shorts
Nike Therma-FIT K.O. (MLB Rays)
adidas adiPURE IV TRX FG - Men's
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt
adidas Adipure 11PRO TRX FG - Women's
HI Nike Team (NFL Giants) BCA Womens T-Shirt
adidas Sprintstar 4 - Men's
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Polo Girls Golf Dress
Nike Therma-FIT K.O. (MLB Twins)
adidas Sprintstar 3 - Women's
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Converse All Star Ox - Girls' Toddler
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White
Nike Therma-FIT K.O. (MLB Phillies)
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens

The idea is to split the data into a meaningful cluster so that it can be given as small input to various systems (de-duplication or classification) instead of the entire data itself.

Below are the steps involved in generating a fingerprint, an alternate representation of title (used as key)

  1. Remove special characters
  2. Remove numbers
  3. Remove stop words
  4. Stem each word
  5. Sort the words in alphabetical order

below is the python code that does it

# fingerprint.py

import sys
import re
import string
import itertools
import nltk
from stemming.porter2 import stem

class FingerPrint(object):
	def __init__(self):
		super(FingerPrint, self).__init__()
		self.remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
		self.remove_num = re.compile('[\d]+')

	def fp_steps(self,text):
		title = text.strip().lower()
		title_splchar_removed = self.remove_spl_char_regex.sub(" ",title)
		title_number_removed = self.remove_num.sub("", title_splchar_removed)
		words = title_number_removed.split()
		filter_stop_words = [w for w in words if not w in nltk.corpus.stopwords.words('english')]
		stemed = [stem(w) for w in filter_stop_words]
		return sorted(stemed)
	
	def fingerprint(self,text):
		fp = " ".join(self.fp_steps(text))
		return fp

Now, My Mapper can emit key value pairs where key = fingerprint and value = product title.

# map.py

import sys
import re
import string
from fingerprint import FingerPrint

f = FingerPrint()

for line in sys.stdin:
	try:
		print "%s\t%s" % (f.fingerprint(line),line.strip())
	except Exception as e:
		print e
		pass

Now I need to sort the output and group them based on a distance measure. I’m using levenshtein distance, and below is my logic behind the reducer.

  1. Default Distance measure is 20, Anything less than 20 will be added to the current cluster.
  2. Add the first title to the cluster (if empty).
  3. If the distance between current title’s fingerprint and the fingerprint of the last element in the cluster is less than the default distance measure (20), then add it to the cluster.
  4. If the distance is greater than the default distance(20) then create a new cluster and continue.

Following is the python code that does it

# reduce.py

import sys
from Levenshtein import distance
import json

DISTANCE = 20
cluster = {}
cid = 0

for i,line in enumerate(sys.stdin):
	cols = line.strip().split("\t")
	if i == 0:
		cluster[cid] = []
		cluster[cid].append(cols)
	else:
		last = cluster[cid][-1]
		if distance(last[0],cols[0]) <= DISTANCE:
			cluster[cid].append(cols)
		else:
			cid+=1
			cluster[cid] = []
			cluster[cid].append(cols)

for k,v in cluster.iteritems():
	print
	print "Cluster # ",k
	for entry in v:
		print entry[1]

To run,

cat input.tsv | python map.py | sort -k1,1 | python reduce.py 

and my output :O

Cluster #  0
adidas adiPURE IV TRX FG - Men's
adidas Adipure 11PRO TRX FG - Women's
adidas Sprintstar 4 - Men's
adidas Sprintstar 3 - Women's

Cluster #  1
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White

Cluster #  2
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Team (NFL Giants) BCA Womens T-Shirt

Cluster #  3
Converse All Star PC2 - Boys' Toddler

Cluster #  4
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens
Brooks Nightlife Infiniti 1/2 Zip - Women's

Cluster #  5
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt

Cluster #  6
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt

Cluster #  7
Converse All Star Ox - Girls' Toddler
HI Nike Polo Girls Golf Dress
HI Nike Sport Girls Golf Dress

Cluster #  8
Nike Therma-FIT K.O. (MLB Phillies)
Nike Therma-FIT K.O. (MLB Rays)
Nike Therma-FIT K.O. (MLB Twins)
HI Nike Solid Girls Golf Shorts

Source Code

This method has some loop holes, Ill try to address these issues in my next post using bi-gram fingerprints. Please let me know your thoughts/feedback!