Clustering Text – Map Reduce in Python

Here I’m sharing a simple method to cluster text (product titles) based on key collision.

Dependencies

My Input file is a list of 20 product titles

Converse All Star PC2 - Boys' Toddler
HI Nike Sport Girls Golf Dress
Brooks Nightlife Infiniti 1/2 Zip - Women's
HI Nike Solid Girls Golf Shorts
Nike Therma-FIT K.O. (MLB Rays)
adidas adiPURE IV TRX FG - Men's
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt
adidas Adipure 11PRO TRX FG - Women's
HI Nike Team (NFL Giants) BCA Womens T-Shirt
adidas Sprintstar 4 - Men's
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Polo Girls Golf Dress
Nike Therma-FIT K.O. (MLB Twins)
adidas Sprintstar 3 - Women's
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Converse All Star Ox - Girls' Toddler
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White
Nike Therma-FIT K.O. (MLB Phillies)
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens

The idea is to split the data into a meaningful cluster so that it can be given as small input to various systems (de-duplication or classification) instead of the entire data itself.

Below are the steps involved in generating a fingerprint, an alternate representation of title (used as key)

  1. Remove special characters
  2. Remove numbers
  3. Remove stop words
  4. Stem each word
  5. Sort the words in alphabetical order

below is the python code that does it

# fingerprint.py

import sys
import re
import string
import itertools
import nltk
from stemming.porter2 import stem

class FingerPrint(object):
	def __init__(self):
		super(FingerPrint, self).__init__()
		self.remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
		self.remove_num = re.compile('[\d]+')

	def fp_steps(self,text):
		title = text.strip().lower()
		title_splchar_removed = self.remove_spl_char_regex.sub(" ",title)
		title_number_removed = self.remove_num.sub("", title_splchar_removed)
		words = title_number_removed.split()
		filter_stop_words = [w for w in words if not w in nltk.corpus.stopwords.words('english')]
		stemed = [stem(w) for w in filter_stop_words]
		return sorted(stemed)
	
	def fingerprint(self,text):
		fp = " ".join(self.fp_steps(text))
		return fp

Now, My Mapper can emit key value pairs where key = fingerprint and value = product title.

# map.py

import sys
import re
import string
from fingerprint import FingerPrint

f = FingerPrint()

for line in sys.stdin:
	try:
		print "%s\t%s" % (f.fingerprint(line),line.strip())
	except Exception as e:
		print e
		pass

Now I need to sort the output and group them based on a distance measure. I’m using levenshtein distance, and below is my logic behind the reducer.

  1. Default Distance measure is 20, Anything less than 20 will be added to the current cluster.
  2. Add the first title to the cluster (if empty).
  3. If the distance between current title’s fingerprint and the fingerprint of the last element in the cluster is less than the default distance measure (20), then add it to the cluster.
  4. If the distance is greater than the default distance(20) then create a new cluster and continue.

Following is the python code that does it

# reduce.py

import sys
from Levenshtein import distance
import json

DISTANCE = 20
cluster = {}
cid = 0

for i,line in enumerate(sys.stdin):
	cols = line.strip().split("\t")
	if i == 0:
		cluster[cid] = []
		cluster[cid].append(cols)
	else:
		last = cluster[cid][-1]
		if distance(last[0],cols[0]) <= DISTANCE:
			cluster[cid].append(cols)
		else:
			cid+=1
			cluster[cid] = []
			cluster[cid].append(cols)

for k,v in cluster.iteritems():
	print
	print "Cluster # ",k
	for entry in v:
		print entry[1]

To run,

cat input.tsv | python map.py | sort -k1,1 | python reduce.py 

and my output :O

Cluster #  0
adidas adiPURE IV TRX FG - Men's
adidas Adipure 11PRO TRX FG - Women's
adidas Sprintstar 4 - Men's
adidas Sprintstar 3 - Women's

Cluster #  1
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White

Cluster #  2
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Team (NFL Giants) BCA Womens T-Shirt

Cluster #  3
Converse All Star PC2 - Boys' Toddler

Cluster #  4
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens
Brooks Nightlife Infiniti 1/2 Zip - Women's

Cluster #  5
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt

Cluster #  6
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt

Cluster #  7
Converse All Star Ox - Girls' Toddler
HI Nike Polo Girls Golf Dress
HI Nike Sport Girls Golf Dress

Cluster #  8
Nike Therma-FIT K.O. (MLB Phillies)
Nike Therma-FIT K.O. (MLB Rays)
Nike Therma-FIT K.O. (MLB Twins)
HI Nike Solid Girls Golf Shorts

Source Code

This method has some loop holes, Ill try to address these issues in my next post using bi-gram fingerprints. Please let me know your thoughts/feedback!

Advertisement

19 thoughts on “Clustering Text – Map Reduce in Python

      1. No problem man. It looks like you’ve got a lot of great posts and I just wanted to let you know that you’re doing a great job 🙂

  1. I have followed the instruction above and there are no errors, but I can not see the results. Is there any idea about that? Thank you

    1. Thanks Agung. can you please share you input and output? If you have used the sample input, please make sure that you have included *sort* in between mapper and reducer.

      1. The input is input.tsv and I have used *sort*. After I executed the command *cat input.tsv | python map.py | sort -k1,1 | python reduce.py*, it did not show the results. I confuse because there is no warning error. Thank you

  2. hey, I can’t see any output produced. Can you run the code and verify that it produces results?
    thanks!

  3. Unable to get the desired output

    cat input.tsv | python map.py | sort -k1,1 | python reduce.py

    Cluster # 0
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]
    [“‘module’ object is not callable”]

    1. Hi chaibapat, you are trying to call something thats not callable using ‘()’. can you share your reducer code if its different?

Leave a Reply to rajmak Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.