I’d try to explain LSH with help of python code and map-reduce technique.
It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]
Jaccard similarity
Where a and b are sets.
J = 0 if A and B are disjoint
J = 1 if A and B are identical
example,
>>> a = {'nike', 'running', 'shoe'} >>> b = {'nike', 'black', 'running', 'shoe'} >>> c = {'nike', 'blue', 'jacket'} >>> float(len(a.intersection(b))) / len(a.union(b)) 0.75 # a and b are similar. >>> float(len(a.intersection(c))) / len(a.union(c)) 0.2 # a and c are... meh..
Minhashing
Probability of collision is higher for similar sets.
Table 1: Matrix representation of sets
keyword | x | a | b | c |
---|---|---|---|---|
nike | 1 | 1 | 1 | 1 |
running | 2 | 1 | 1 | 0 |
shoe | 3 | 1 | 1 | 0 |
black | 4 | 0 | 1 | 0 |
blue | 5 | 0 | 0 | 1 |
jacket | 6 | 0 | 0 | 1 |
Table 2: Signature Matrix with hash values
Hash Function | a | b | c |
---|---|---|---|
h1(x) = x + 1 mod 6 | min(2,3,4) | min(2,3,4,5) | min(2,0,1) |
h2(x) = 3x + 1 mod 6 | min(4,1,4) | min(4,1,4,1) | min(4,4,1) |
which becomes,
Table 3: Signature matrix with minhash values
Hash Function | a | b | c |
---|---|---|---|
h1(x) = x + 1 mod 6 | 2 | 2 | 0 |
h2(x) = 3x + 1 mod 6 | 1 | 1 | 1 |
From Table 3 We can infer that set a and b are similar.
Similarity of a and b from Table 1 is 3/4 = 0.75
From signature matrix Table 3 similarity of a and b is 2/2 = 1
The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.
Map-Reduce
Mapper
sample_dict.txt will have word to id mapping.
- for every line in input file
- split text and convert to array of ids using the word to id mapping file.
- for every id compute minimum hash value
- split the array of min hash values into multiple equally sized chunks a.k.a, bands.
- assign id to bands and emit hash of band, band-id and doc-id
Reducer
- group by band-hash and band-id to get list of similar doc-ids.
Mapper Code
# lsh_mapper.py __author__ = 'raj' import sys from random import randrange word_ids = dict() num_hashes = 10 num_per_band = 2 # a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)] b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)] def min_hash_fn(a, b, sig): hashes = [((a * x) + b) % len(word_ids) for x in sig] return min(hashes) def get_min_hash_row(sig): hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)] return hashes def get_band(l, n): for i in xrange(0, len(l), n): yield frozenset(l[i:i+n]) for word, wid in map(lambda x: x.split(), open("sample_dict.txt").readlines()): word_ids[word] = int(wid) for doc_id, doc in enumerate(sys.stdin): words = doc.strip().lower().split() signature = map(lambda x: word_ids.get(x), words) signature = filter(lambda x: x is not None, signature) min_hash_row = get_min_hash_row(signature) banded = get_band(min_hash_row, num_per_band) for band_id, band in enumerate(banded): print "%d\t%d\t%d" % (band_id, hash(band), doc_id)
Reducer Code
#lsh_reducre.py __author__ = 'raj' import sys prev_band_id, prev_band_hash = None, None cluster = [] cid = 0 for line in sys.stdin: band_id, band_hash, doc_id = line.strip().split("\t", 3) if prev_band_id is None and prev_band_hash is None: prev_band_id, prev_band_hash = band_id, band_hash if prev_band_id is band_id: if prev_band_hash == band_hash: cluster.append(doc_id) else: print cid, cluster cluster = [doc_id] else: print cid, cluster cluster = [doc_id] cid += 1 prev_band_id, prev_band_hash = band_id, band_hash
In action
sample_input.txt
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) SLATE BLUE Nike Airmax Running SHoe Corduroy Shorts - Flat Front (For Men) BEIGE Nokia Lumia 721 Corduroy Shorts - Flat Front (For Men) BROWN
sample_dict.txt
& 1 (for 2 -0 3 1-14 4 12- 5 14 6 2-piece 7 721 8 airmax 9 and 10 beige 11 blue 12 brown 13 corduroy 14 corduroys 15 denim 16 doll 17 dot 18 dress 19 fashion 20 flat 21 flower 22 front 23 inch 24 jumper 25 leggings 26 lumia 27 me 28 men) 29 nike 30 nokia 31 outfit 32 piece 33 pink 34 polka 35 running 36 shirt 37 shoe 38 shorts 39 slate 40 teal 41 top 42 white 43 with 44 you 45 - 46
Command
$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py
Output
0 ['1', '2'] 0 ['0'] 0 ['5', '7'] 0 ['6'] 0 ['3'] 0 ['4'] 1 ['4'] 1 ['6'] 1 ['0'] 1 ['2'] 1 ['3', '5', '7'] 1 ['1'] 2 ['6'] 2 ['4'] 2 ['0', '1', '2'] 2 ['3', '5', '7'] 3 ['6'] 3 ['3', '5', '7'] 3 ['0', '1', '2'] 3 ['4'] 4 ['0', '1'] 4 ['3'] 4 ['5'] 4 ['4'] 4 ['2'] 4 ['7']
resolved output
band 0 ------ You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 1 ------ Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 2 ------ You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN band 3 ------ Corduroy Shorts - Flat Front (For Men) SLATE BLUE Corduroy Shorts - Flat Front (For Men) BEIGE Corduroy Shorts - Flat Front (For Men) BROWN You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink band 4 ------ You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
code here
Hi Thanks for the post.
Could you please explain the final output. By the example, I had been expecting two hash function in rows and eight document id’s in columns as final signature matrix with minhash values.
But your final output has 26 rows with values like:
0 [‘5’, ‘7’]
Please help me to understand this.
Thanks.
Hi pappukjha, Thanks for your interest.
The minhash array of every document is sliced into bands, in the example We have 5 bands.
Which means at the end of the mapper (lsh_mapper.py ln:43-44),for every document, We’re emitting 5 rows and every row has band_id, band_hash, doc_id, and in the reducer we group records based on band_hash.
you can read more about the banding technique here (section 3.4.1)
“ValueError: min() arg is an empty sequence”
It gives an above error……….
Somenath Das, Apologies for the late reply. Im not able to reproduce this issue. can you send the data to rajesh.manikka@live.com?
Thanks for short but helpful code.
In the second step of mapper flow, you said that “for every **id** compute minimum hash value”, does it mean “for every array of ids” or “for every doc”?
Jkyu, Thanks for your comment and apologies for the delay in my response.
We compute minhash for every document.
Kindly refer the tables above, Each column (a, b, c) represents a document and ‘x’ represents id of a word.
We are hashing each ‘id of a word’ in the ‘array’ (document) and selecting the minimum.
Please let me know if you have any questions. Thanks.