Here I’m sharing a simple method to cluster text (product titles) based on key collision.
Dependencies
My Input file is a list of 20 product titles
Converse All Star PC2 - Boys' Toddler HI Nike Sport Girls Golf Dress Brooks Nightlife Infiniti 1/2 Zip - Women's HI Nike Solid Girls Golf Shorts Nike Therma-FIT K.O. (MLB Rays) adidas adiPURE IV TRX FG - Men's Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt adidas Adipure 11PRO TRX FG - Women's HI Nike Team (NFL Giants) BCA Womens T-Shirt adidas Sprintstar 4 - Men's HI Nike Attitude (NFL Titans) BCA Womens T-Shirt HI Nike Polo Girls Golf Dress Nike Therma-FIT K.O. (MLB Twins) adidas Sprintstar 3 - Women's Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White Converse All Star Ox - Girls' Toddler HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White Nike Therma-FIT K.O. (MLB Phillies) Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens
The idea is to split the data into a meaningful cluster so that it can be given as small input to various systems (de-duplication or classification) instead of the entire data itself.
Below are the steps involved in generating a fingerprint, an alternate representation of title (used as key)
- Remove special characters
- Remove numbers
- Remove stop words
- Stem each word
- Sort the words in alphabetical order
below is the python code that does it
# fingerprint.py import sys import re import string import itertools import nltk from stemming.porter2 import stem class FingerPrint(object): def __init__(self): super(FingerPrint, self).__init__() self.remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters self.remove_num = re.compile('[\d]+') def fp_steps(self,text): title = text.strip().lower() title_splchar_removed = self.remove_spl_char_regex.sub(" ",title) title_number_removed = self.remove_num.sub("", title_splchar_removed) words = title_number_removed.split() filter_stop_words = [w for w in words if not w in nltk.corpus.stopwords.words('english')] stemed = [stem(w) for w in filter_stop_words] return sorted(stemed) def fingerprint(self,text): fp = " ".join(self.fp_steps(text)) return fp
Now, My Mapper can emit key value pairs where key = fingerprint and value = product title.
# map.py import sys import re import string from fingerprint import FingerPrint f = FingerPrint() for line in sys.stdin: try: print "%s\t%s" % (f.fingerprint(line),line.strip()) except Exception as e: print e pass
Now I need to sort the output and group them based on a distance measure. I’m using levenshtein distance, and below is my logic behind the reducer.
- Default Distance measure is 20, Anything less than 20 will be added to the current cluster.
- Add the first title to the cluster (if empty).
- If the distance between current title’s fingerprint and the fingerprint of the last element in the cluster is less than the default distance measure (20), then add it to the cluster.
- If the distance is greater than the default distance(20) then create a new cluster and continue.
Following is the python code that does it
# reduce.py import sys from Levenshtein import distance import json DISTANCE = 20 cluster = {} cid = 0 for i,line in enumerate(sys.stdin): cols = line.strip().split("\t") if i == 0: cluster[cid] = [] cluster[cid].append(cols) else: last = cluster[cid][-1] if distance(last[0],cols[0]) <= DISTANCE: cluster[cid].append(cols) else: cid+=1 cluster[cid] = [] cluster[cid].append(cols) for k,v in cluster.iteritems(): print print "Cluster # ",k for entry in v: print entry[1]
To run,
cat input.tsv | python map.py | sort -k1,1 | python reduce.py
and my output :O
Cluster # 0 adidas adiPURE IV TRX FG - Men's adidas Adipure 11PRO TRX FG - Women's adidas Sprintstar 4 - Men's adidas Sprintstar 3 - Women's Cluster # 1 Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White Cluster # 2 HI Nike Attitude (NFL Titans) BCA Womens T-Shirt HI Nike Team (NFL Giants) BCA Womens T-Shirt Cluster # 3 Converse All Star PC2 - Boys' Toddler Cluster # 4 Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens Brooks Nightlife Infiniti 1/2 Zip - Women's Cluster # 5 HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt Cluster # 6 Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt Cluster # 7 Converse All Star Ox - Girls' Toddler HI Nike Polo Girls Golf Dress HI Nike Sport Girls Golf Dress Cluster # 8 Nike Therma-FIT K.O. (MLB Phillies) Nike Therma-FIT K.O. (MLB Rays) Nike Therma-FIT K.O. (MLB Twins) HI Nike Solid Girls Golf Shorts
This method has some loop holes, Ill try to address these issues in my next post using bi-gram fingerprints. Please let me know your thoughts/feedback!
Great post!
thanks 🙂
No problem man. It looks like you’ve got a lot of great posts and I just wanted to let you know that you’re doing a great job 🙂
thanks for your kind support blenz3 🙂
No problem my man! Credit where credit is due!
I’ve been looking for a solution to this problem for a long time. This is simple, Python-only (no need to install a huge package), and it works. Great job!
Thank you ,:)
I have followed the instruction above and there are no errors, but I can not see the results. Is there any idea about that? Thank you
Thanks Agung. can you please share you input and output? If you have used the sample input, please make sure that you have included *sort* in between mapper and reducer.
The input is input.tsv and I have used *sort*. After I executed the command *cat input.tsv | python map.py | sort -k1,1 | python reduce.py*, it did not show the results. I confuse because there is no warning error. Thank you
Hi Rajesh! Please, how can I run your code in a Windows enviroment, under a cmd prompt? Tks!
@Ailton Cygwin has a set of cmd tools similar to Linux for Windows, pls try and let me know! Thanks! https://www.cygwin.com/
hey, I can’t see any output produced. Can you run the code and verify that it produces results?
thanks!
Hi Leon,
Apologies for not including the required dependencies to run the code and silently ignoring the exception thrown in map.py.
Please install the following dependencies and try again.
python-levenshtein
stemming
NLTK
corpora/stopwords
Thank you!
Simple and Pure Python Solution. Works clean.
Unable to get the desired output
cat input.tsv | python map.py | sort -k1,1 | python reduce.py
Cluster # 0
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
[“‘module’ object is not callable”]
Hi chaibapat, you are trying to call something thats not callable using ‘()’. can you share your reducer code if its different?
It is exactly same as the one you mentioned