Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding

Text classification help us to better understand and organize data. I’ve tried building a simple CNN classifier using Keras with tensorflow as backend to classify products available on eCommerce sites. Data for this experiment are product titles of three distinct categories from a popular eCommerce site. Reference: Tutorial

tl;dr

Python notebook and data

 Collecting Data

For this experiment I’ve collected product titles belonging to the following categories.

  • Women’s clothing
  • Cameras
  • Home appliances

Since these categories are distinct, meaning they don’t have any overlap of contextual information, Our model should have less classification errors/perform well. I’ve tried to implement 2 proven architecture of CNN with Word2Vec embedding.

Setup

We need the following libraries

  • Gensim
  • Keras
  • NLTK
  • Pandas
  • Numpy
  • Tensorflow

and

  • Conda to manage virtual environment
  • Pre-trained vectors trained on Google News dataset download 1.5GB for Word2Vec embedding.
import numpy as np
import pandas as pd
from gensim.models import KeyedVectors
from keras.layers import Flatten
from keras.layers import MaxPooling1D
from keras.models import Model
from keras.preprocessing.sequence import pad_sequences
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from nltk.corpus import stopwords
MAX_NB_WORDS = 200000
MAX_SEQUENCE_LENGTH = 30
EMBEDDING_DIM = 300
EMBEDDING_FILE = "../lib/GoogleNews-vectors-negative300.bin"
category_index = {"clothing":0, "camera":1, "home-appliances":2}
category_reverse_index = dict((y,x) for (x,y) in category_index.items())
STOPWORDS = set(stopwords.words("english"))

Loading Data

Download data. It is important to make sure that the data doesn’t have any null/Nan values.

clothing = pd.read_csv("clothing.tsv", sep='\t')
cameras = pd.read_csv("cameras.tsv", sep='\t')
home_appliances = pd.read_csv("home.tsv", sep='\t')
datasets = [clothing, cameras, home_appliances]
print("Make sure there are no null values in the datasets")
for data in datasets:
print("Has null values: ", data.isnull().values.any())
Make sure there are no null values in the datasets
Has null values:  False
Has null values:  False
Has null values:  False

Preprocessing

Stop words or words that occur frequently and is distracting are removed first, Then we use classes provided by Keras to help prepare text so it can be used by neural network models.

def preprocess(text):
text= text.strip().lower().split()
text = filter(lambda word: word not in STOPWORDS, text)
return " ".join(text)
for dataset in datasets:
dataset['title'] = dataset['title'].apply(preprocess)

To prepare the vector (array of integers) representation of text :

  • Combine titles from all three cateories to obtain a list of text.
  • Drop duplicates
  • Initialize tokenizer with num_words = MAX_NB_WORDS (200K). i.e. The tokenizer will perform a word count, sorted by number of occurences in descending order and pick top N words, 200K in this case
  • Use tokenizer’s texts_to_sequences method to convert text to array of integers.
  • The arrays obtained from previous step might not be of uniform length, use pad_sequences method to obtain arrays with length equal to MAX_SEQUENCE_LENGTH (30)
all_texts = clothing['title'] + cameras['title'] + home_appliances['title']
all_texts = all_texts.drop_duplicates(keep=False)
tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
tokenizer.fit_on_texts(all_texts)
clothing_sequences = tokenizer.texts_to_sequences(clothing['title'])
electronics_sequences = tokenizer.texts_to_sequences(cameras['title'])
home_appliances_sequences = tokenizer.texts_to_sequences(home_appliances['title'])
clothing_data = pad_sequences(clothing_sequences, maxlen=MAX_SEQUENCE_LENGTH)
electronics_data = pad_sequences(electronics_sequences, maxlen=MAX_SEQUENCE_LENGTH)
home_appliances_data = pad_sequences(home_appliances_sequences, maxlen=MAX_SEQUENCE_LENGTH)

word_index has a unique integer ID assigned to each word in the data. For example

word_index = tokenizer.word_index
test_string = "sports action spy pen camera"
print("word\t\tid")
print("-" * 20)
for word in test_string.split():
print("%s\t\t%s" % (word, word_index[word]))
word		id
--------------------
sports		16
action		13
spy		7
pen		55
camera		2

The tokenizer will replace words with unique integer id to get a vector representation of the title. Example:

test_sequence = tokenizer.texts_to_sequences(["sports action camera", "spy pen camera"])
padded_sequence = pad_sequences(test_sequence, maxlen=MAX_SEQUENCE_LENGTH)
print("Text to Vector", test_sequence)
print("Padded Vector", padded_sequence)
Text to Vector [[16, 13, 2], [7, 55, 2]]
Padded Vector [[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0 16 13  2]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  7 55  2]]

Product titles belonging to all three categories are kept separate so far for the sake of understanding. To prepare the input layer, All three cateogries are combined together and shuffled as shown below.

The category (y-axis or label) is converted to convnet’s understandable format (one hot vector) by using the keras.util method to_categorical. Example:

print("clothing: \t\t", to_categorical(category_index["clothing"], 3))
print("camera: \t\t", to_categorical(category_index["camera"], 3))
print("home appliances: \t", to_categorical(category_index["home-appliances"], 3))
clothing: 		 [[ 1.  0.  0.]]
camera: 		 [[ 0.  1.  0.]]
home appliances: 	 [[ 0.  0.  1.]]
print("clothing shape: ", clothing_data.shape)
print("electronics shape: ", electronics_data.shape)
print("home appliances shape: ", home_appliances_data.shape)
data = np.vstack((clothing_data, electronics_data, home_appliances_data))
category = pd.concat([clothing['category'], cameras['category'], home_appliances['category']]).values
category = to_categorical(category)
print("-"*10)
print("combined data shape: ", data.shape)
print("combined category/label shape: ", category.shape)
clothing shape:  (392721, 30)
electronics shape:  (1347, 30)
home appliances shape:  (11425, 30)
----------
combined data shape:  (405493, 30)
combined category/label shape:  (405493, 3)

Shuffling and splitting the data since categories are stacked one after the other. nb_validation_samples is the index which separates training and testing/validating sets. This step can be simplified by train_test_split from scikit.

VALIDATION_SPLIT = 0.4
indices = np.arange(data.shape[0]) # get sequence of row index
np.random.shuffle(indices) # shuffle the row indexes
data = data[indices] # shuffle data/product-titles/x-axis
category = category[indices] # shuffle labels/category/y-axis
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
x_train = data[:-nb_validation_samples]
y_train = category[:-nb_validation_samples]
x_val = data[-nb_validation_samples:]
y_val = category[-nb_validation_samples:]

word2vec embedding

Word2Vec brings in semantic similarity info which can be leveraged by the convnets. This experiment uses pre-trained vectors from Google news.One other option is GloVe.

word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True)
print('Found %s word vectors of word2vec' % len(word2vec.vocab))
Found 3000000 word vectors of word2vec

The following examples should help understand the intent behind using a pre trained word2vec.

print("Odd word out:", word2vec.doesnt_match("banana apple grapes carrot".split()))
print("-"*10)
print("Cosine similarity between TV and HBO:", word2vec.similarity("tv", "hbo"))
print("-"*10)
print("Most similar words to Computers:", ", ".join(map(lambda x: x[0], word2vec.most_similar("computers"))))
print("-"*10)
Odd word out: carrot
----------
Cosine similarity between TV and HBO: 0.613064891522
----------
Most similar words to Computers: computer, laptops, PCs, laptop_computers, desktop_computers, Computers, laptop, notebook_computers, Dell_OptiPlex_desktop, automated_seismographs
----------

Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.

from keras.layers import Embedding
word_index = tokenizer.word_index
nb_words = min(MAX_NB_WORDS, len(word_index))+1
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM))
for word, i in word_index.items():
if word in word2vec.vocab:
embedding_matrix[i] = word2vec.word_vec(word)
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0))
embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1
embedding_matrix.shape[1], # or EMBEDDING_DIM,
weights=[embedding_matrix],
input_length=MAX_SEQUENCE_LENGTH,
trainable=False)
Null word embeddings: 1473

Model

I recommend this (30 Min) video about how Convnets work to understand the layers. Below is the replication of 2 proven architectures. More can be found here

from keras.models import Sequential
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation
model = Sequential()
model.add(embedding_layer)
model.add(Dropout(0.2))
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2))
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2))
model.add(Flatten())
model.add(Dropout(0.2))
model.add(Dense(150,activation='sigmoid'))
model.add(Dropout(0.2))
model.add(Dense(3,activation='sigmoid'))
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 30, 300)           817200    
_________________________________________________________________
dropout_9 (Dropout)          (None, 30, 300)           0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 14, 300)           270300    
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 6, 150)            135150    
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 2, 75)             33825     
_________________________________________________________________
flatten_3 (Flatten)          (None, 150)               0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 150)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 150)               22650     
_________________________________________________________________
dropout_11 (Dropout)         (None, 150)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 3)                 453       
=================================================================
Total params: 1,279,578
Trainable params: 462,378
Non-trainable params: 817,200
_________________________________________________________________
model_1 = Sequential()
model_1.add(embedding_layer)
model_1.add(Conv1D(250,3,padding='valid',activation='relu',strides=1))
model_1.add(GlobalMaxPooling1D())
model_1.add(Dense(250))
model_1.add(Dropout(0.2))
model_1.add(Activation('relu'))
model_1.add(Dense(3))
model_1.add(Activation('sigmoid'))
model_1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc'])
model_1.summary()
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 30, 300)           817200    
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 28, 250)           225250    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 250)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 250)               62750     
_________________________________________________________________
dropout_12 (Dropout)         (None, 250)               0         
_________________________________________________________________
activation_5 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 3)                 753       
_________________________________________________________________
activation_6 (Activation)    (None, 3)                 0         
=================================================================
Total params: 1,105,953
Trainable params: 288,753
Non-trainable params: 817,200
_________________________________________________________________
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] - 22s 92us/step - loss: 0.1106 - acc: 0.9768 - val_loss: 0.1090 - val_acc: 0.9773
Epoch 2/5
243296/243296 [==============================] - 24s 97us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1091 - val_acc: 0.9775
Epoch 3/5
243296/243296 [==============================] - 21s 86us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1080 - val_acc: 0.9774
Epoch 4/5
243296/243296 [==============================] - 23s 93us/step - loss: 0.1096 - acc: 0.9772 - val_loss: 0.1088 - val_acc: 0.9776
Epoch 5/5
243296/243296 [==============================] - 24s 98us/step - loss: 0.1098 - acc: 0.9773 - val_loss: 0.1097 - val_acc: 0.9773
Test loss: 0.10969909843
Test accuracy: 0.977305375562
model_1.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128)
score = model_1.evaluate(x_val, y_val, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] - 13s 52us/step - loss: 8.3458e-04 - acc: 0.9999 - val_loss: 9.0927e-04 - val_acc: 0.9999
Epoch 2/5
243296/243296 [==============================] - 12s 48us/step - loss: 7.2089e-04 - acc: 0.9999 - val_loss: 0.0011 - val_acc: 0.9999
Epoch 3/5
243296/243296 [==============================] - 12s 49us/step - loss: 7.2221e-04 - acc: 1.0000 - val_loss: 0.0012 - val_acc: 0.9999
Epoch 4/5
243296/243296 [==============================] - 12s 51us/step - loss: 7.1913e-04 - acc: 0.9999 - val_loss: 0.0010 - val_acc: 0.9999
Epoch 5/5
243296/243296 [==============================] - 12s 49us/step - loss: 6.7104e-04 - acc: 1.0000 - val_loss: 0.0011 - val_acc: 0.9999
Test loss: 0.00113550592472
Test accuracy: 0.999895189184

model_1 is better than the other. Below is an example on how to use this model.

example_product = "Nikon Coolpix A10 Point and Shoot Camera (Black)"
example_product = preprocess(example_product)
example_sequence = tokenizer.texts_to_sequences([example_product])
example_padded_sequence = pad_sequences(example_sequence, maxlen=MAX_SEQUENCE_LENGTH)
print("-"*10)
print("Predicted category: ", category_reverse_index[model_1.predict_classes(example_padded_sequence, verbose=0)[0]])
print("-"*10)
probabilities = model_1.predict(example_padded_sequence, verbose=0)
probabilities = probabilities[0]
print("Clothing Probability: ",probabilities[category_index["clothing"]] )
print("Camera Probability: ",probabilities[category_index["camera"]] )
print("home appliances probability: ",probabilities[category_index["home-appliances"]] )
----------
Predicted category:  camera
----------
Clothing Probability:  5.12844e-21
Camera Probability:  0.505056
home appliances probability:  5.71945e-23

Conclusion

My observation is that with neural networks, the time taken for feature engineering is considerable reduced and researchers spend most of their time in deciding the architecture of the neural network layers. Word2Vec embedding greatly contributes to improving the accuracy of the model.

Advertisement

Productionizing a CRF model, Recipe Ingredients Tagger in Action.

image of Color coded entities

A popular way to productionize a statistical model would be to expose them as a REST API, so that they can be scaled horizontally and is cost effective. In this post I’ll discuss the steps involved without implementation details.

In my previous post I’ve discussed how to build a simple tagger using CRFSuite. The goal of the tagger is to convert unstructured data to structured one by tagging entities. I took ‘Food and Recipes’ as my domain and have identified 4 important entities which are required to describe a recipe.

  • QTY – Quantity, number of units required. Usually numbers.
  • UNIT – Such as teaspoon, pinch, bottles, cups etc.
  • NAME – Name of the ingredient, example: sugar, almond, chicken, milk etc.
  • COM – Comment about the ingredients. example: crushed, finely chopped, powdered etc.
  • OTHERS – Random text that can be ignored.

I’ve used Flask framework for microservices and GUnicorn for production deployment.

The input/output contract is simple, Given a list of ingredients, The API should identify entities and tag them.

Consider the following homemade mac and cheese recipe from allrecipes.com as an example.

image of a recipe
homemade mac and cheese ingredients

Our goal is to identify entities present in the text highlighted in yellow (i.e. list of ingredients).

The API accepts input in the following format.

[
"8 ounces uncooked elbow macaroni",
"2 cups shredded sharp Cheddar cheese",
"1/2 cup grated Parmesan cheese",
"3 cups milk",
"1/4 cup butter",
"2 1/2 tablespoons all-purpose flour",
"2 tablespoons butter",
"1/2 cup bread crumbs",
"1 pinch paprika"
]
view raw rit_input.json hosted with ❤ by GitHub

And generates output as shown below, Tokens and their respective tagged labels.

{
"tagged_tokens": [
[
{
"tag": "QTY",
"token": "8"
},
{
"tag": "UNIT",
"token": "ounces"
},
{
"tag": "COM",
"token": "uncooked"
},
{
"tag": "NAME",
"token": "elbow"
},
{
"tag": "NAME",
"token": "macaroni"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "UNIT",
"token": "cups"
},
{
"tag": "COM",
"token": "shredded"
},
{
"tag": "COM",
"token": "sharp"
},
{
"tag": "NAME",
"token": "Cheddar"
},
{
"tag": "NAME",
"token": "cheese"
}
],
[
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "COM",
"token": "grated"
},
{
"tag": "NAME",
"token": "Parmesan"
},
{
"tag": "NAME",
"token": "cheese"
}
],
[
{
"tag": "QTY",
"token": "3"
},
{
"tag": "UNIT",
"token": "cups"
},
{
"tag": "NAME",
"token": "milk"
}
],
[
{
"tag": "QTY",
"token": "1/4"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "NAME",
"token": "butter"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "tablespoons"
},
{
"tag": "NAME",
"token": "all-purpose"
},
{
"tag": "NAME",
"token": "flour"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "UNIT",
"token": "tablespoons"
},
{
"tag": "NAME",
"token": "butter"
}
],
[
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "NAME",
"token": "bread"
},
{
"tag": "NAME",
"token": "crumbs"
}
],
[
{
"tag": "QTY",
"token": "1"
},
{
"tag": "UNIT",
"token": "pinch"
},
{
"tag": "NAME",
"token": "paprika"
}
]
]
}
view raw rit_output.json hosted with ❤ by GitHub

A simple visualization to understand the output better.

image of Color coded entities
Color coded ingredient entities

CRFSuite is written in C++, We can leverage the CRFSuite’s C++ API by using SWIG wrapper for Python.

The following snippet explains the various steps involved in transforming the incoming data to model understandable features and how the output is interpreted in the end.

@app.route("/tag", methods=['GET', 'POST'])
def tag():
content = request.get_json(silent=True)
if len(content) > 50:
return abort(400)
tokens = map(nltk.word_tokenize, content)
tagged_tokens = map(nltk.pos_tag, tokens)
for_feature = pre_feature(tagged_tokens)
with_feature = map(feature_extractor, for_feature)
flattened_with_feature = [item for sublist in with_feature for item in sublist]
xseq = to_crfsuite(flattened_with_feature)
yseq = tagger.tag(xseq)
tags = []
for y in yseq:
tags.append(y)
tags = list(reversed(tags))
result = []
for feature in with_feature:
tagged_token = []
for token in feature:
tagged_token.append({"token": token['w'], "tag": tags.pop()})
result.append(tagged_token)
return jsonify(tagged_tokens=result)

Once the flask app is ready, Deploying with GUnicorn is simple.

Since CRF is a statistical model, It requires the modeler to understand the relation between variables and hence spends 90% of the time preparing data for training and testing. In other words, its time consuming. These models can be used as a stepping stone towards building unsupervised learning algorithms, search relevance, recommendation, shopping cart and buy button use cases etc.

You can try the API with different inputs at
Mashape

(registration required)

Structuring text – Sequence tagging using Conditional Random Field (CRF). Tagging recipe ingredient phrases.

Building a food graph is an interesting problem.
Such graphs can be used to mine similar recipes, analyse relationship between cuisines and food cultures etc.

This blog post from NYTimes about “Extracting Structured Data From Recipes Using Conditional Random Fields” could be an initial step towards building such graphs.

In an attempt to implement the idea shared in the blog post mentioned above, I’ve used CRFSuite to build a model that tags entities in ingredients list.
CRFSuite installation instruction here.

Note: For the impatient, Please checkout the TL;DR section at the end of the post.

3 steps to reach the goal.

  1. Understanding data.
  2. Preparing data.
  3. Building model.

Step 1: Understanding data.

The basic assumption is to use the following 5 entities to tag ingredients of a recipe.

  1. Quantity (QTY)
  2. Unit (UNIT)
  3. Comment (COM)
  4. Name (NAME)
  5. Others (OTHERS)

For example,

Ingredient Quantity Unit Comment Name Others
2 tablespoons of soya sauce 2 tablespoons NA soya, sauce of
Onions sliced and fried brown 3 medium 3 NA sliced, brown, fried onions and
3 Finely chopped Green Chillies 3 NA finely, chopped, green chillies NA

Similarly most of the ingredients shared in recipes can be tagged with these 5 labels.

Step 2: Preparing data.

Preparing data involves the following steps

  1. Collecting data
  2. POS tagging
  3. Labeling tokens
  4. Chunking

A simple script to politely scrape data from any recipe site will do the job. Checkout Scrapy.

I’ve collected data in the following format.

{
"url": "http://allrecipes.co.in/recipe/12227/pakal-fish-curry.aspx",
"ingredients": [
"7-8 pakal fish",
"1 teaspoon turmeric powder",
"as needed salt",
"2 tablespoon mustard oil",
"a pinch black cumin seeds/powder",
"2 tablespoon onion, sliced",
"1/2 teaspoon ginger paste",
"1/2 teaspoon garlic paste",
"2-3 green chilies, chopped",
"2 tablespoon white mustard paste",
"water as needed", "as needed sugar",
"1 tablespoon coriander leaves, chopped",
"3-4 green chilies, whole"
]
}
view raw sample.json hosted with ❤ by GitHub

The actual input file is a JSON Lines file.

A three column tab separated file is required for chunking.

  • Column 1 – Token
  • Column 2 – POS tag
  • Column 3 – Label (done manually)

Each token in a ingredient list gets a line in the TSV file and a new line is left to separate ingredients.
The following script generates data in required format taking the JSON lines file mentioned above as input.

import sys
import nltk
import json
for line in sys.stdin:
data = json.loads(line)
for ingredient in data['ingredients']:
tokens = nltk.word_tokenize(ingredient.strip())
tagged_tokens = nltk.pos_tag(tokens)
for token, pos in tagged_tokens:
try:
print "%s\t%s\tXXX" % (token.encode('utf8'), pos)
except Exception as e:
print e
print "Error writing token:", token
print
$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv

Note that XXX is just a place holder, which will be replaced by the actual label (i.e. one of QTY, UNIT, COM, NAME, OTHERS).
I’ve manually labeled each token with the help of OpenRefine, Skip this step if you are tagging using a model that is already available.
In the end the file should look similar to table shown below.

token pos label
7-8 JJ QTY
pakal NN NAME
fish NN NAME
1 CD QTY
teaspoon NN UNIT
turmeric JJ NAME
powder NN NAME
as IN OTHER
needed VBN OTHER
salt NN NAME
2 CD QTY
tablespoon NN UNIT
mustard NN NAME
oil NN NAME
... ... ...

Next task is chunking and it is explained well here.
The same POS and token position features discussed in the tutorial are used as features in this experiment as well,So using the util script provided in the CRFSuite repository we can generate chunks.

$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py -s $'\t' > chunk.txt 

After chunking the final output file should look similar to this.

QTY w[0]=7-8 w[1]=pakal w[2]=fish w[0]|w[1]=7-8|pakal pos[0]=JJ pos[1]=NN pos[2]=NN pos[0]|pos[1]=JJ|NN pos[1]|pos[2]=NN|NN pos[0]|pos[1]|pos[2]=JJ|NN|NN __BOS__
NAME w[-1]=7-8 w[0]=pakal w[1]=fish w[-1]|w[0]=7-8|pakal w[0]|w[1]=pakal|fish pos[-1]=JJ pos[0]=NN pos[1]=NN pos[-1]|pos[0]=JJ|NN pos[0]|pos[1]=NN|NN pos[-1]|pos[0]|pos[1]=JJ|NN|NN
NAME w[-2]=7-8 w[-1]=pakal w[0]=fish w[-1]|w[0]=pakal|fish pos[-2]=JJ pos[-1]=NN pos[0]=NN pos[-2]|pos[-1]=JJ|NN pos[-1]|pos[0]=NN|NN pos[-2]|pos[-1]|pos[0]=JJ|NN|NN __EOS__
QTY w[0]=1 w[1]=teaspoon w[2]=turmeric w[0]|w[1]=1|teaspoon pos[0]=CD pos[1]=NN pos[2]=JJ pos[0]|pos[1]=CD|NN pos[1]|pos[2]=NN|JJ pos[0]|pos[1]|pos[2]=CD|NN|JJ __BOS__
UNIT w[-1]=1 w[0]=teaspoon w[1]=turmeric w[2]=powder w[-1]|w[0]=1|teaspoon w[0]|w[1]=teaspoon|turmeric pos[-1]=CD pos[0]=NN pos[1]=JJ pos[2]=NN pos[-1]|pos[0]=CD|NN pos[0]|pos[1]=NN|JJ pos[1]|pos[2]=JJ|NN pos[-1]|pos[0]|pos[1]=CD|NN|JJ pos[0]|pos[1]|pos[2]=NN|JJ|NN
NAME w[-2]=1 w[-1]=teaspoon w[0]=turmeric w[1]=powder w[-1]|w[0]=teaspoon|turmeric w[0]|w[1]=turmeric|powder pos[-2]=CD pos[-1]=NN pos[0]=JJ pos[1]=NN pos[-2]|pos[-1]=CD|NN pos[-1]|pos[0]=NN|JJ pos[0]|pos[1]=JJ|NN pos[-2]|pos[-1]|pos[0]=CD|NN|JJ pos[-1]|pos[0]|pos[1]=NN|JJ|NN
NAME w[-2]=teaspoon w[-1]=turmeric w[0]=powder w[-1]|w[0]=turmeric|powder pos[-2]=NN pos[-1]=JJ pos[0]=NN pos[-2]|pos[-1]=NN|JJ pos[-1]|pos[0]=JJ|NN pos[-2]|pos[-1]|pos[0]=NN|JJ|NN __EOS__
OTHER w[0]=as w[1]=needed w[2]=salt w[0]|w[1]=as|needed pos[0]=IN pos[1]=VBN pos[2]=NN pos[0]|pos[1]=IN|VBN pos[1]|pos[2]=VBN|NN pos[0]|pos[1]|pos[2]=IN|VBN|NN __BOS__
OTHER w[-1]=as w[0]=needed w[1]=salt w[-1]|w[0]=as|needed w[0]|w[1]=needed|salt pos[-1]=IN pos[0]=VBN pos[1]=NN pos[-1]|pos[0]=IN|VBN pos[0]|pos[1]=VBN|NN pos[-1]|pos[0]|pos[1]=IN|VBN|NN
NAME w[-2]=as w[-1]=needed w[0]=salt w[-1]|w[0]=needed|salt pos[-2]=IN pos[-1]=VBN pos[0]=NN pos[-2]|pos[-1]=IN|VBN pos[-1]|pos[0]=VBN|NN pos[-2]|pos[-1]|pos[0]=IN|VBN|NN __EOS__
view raw chunk.txt hosted with ❤ by GitHub

Step 3: Building model

To train

$ crfsuite learn -m <model_name> <chunk_file>

To test

$ crfsuite tag -qt -m <model_name> <chunk_file>

To tag

$ crfsuite tag -m <model_name> <chunk_file>

TL;DR

I’ve collected 2000 recipes out of which 60% is used for training and 40% is used for testing.

Each ingredient is tokenized, POS tagged and manually labeled (hardest part).
Following are the input, intermediate and output files.

  • recipes.jl – a JSON lines file containing 2000 recipes. Input file
  • token_pos.tsv – Intermediate TSV file with token and its POS. (column with XXX is a place holder for next step)
  • token_pos_tagged.tsv – TSV file with token, pos and label columns, after tagging 3rd column manually.
  • train.txt – 60% of input, chunked, for training
  • test.txt – 40% of input, chunked, for testing
  • recipe.model – model output
$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv

Intermediate step: Manually label tokens and generate token_pos_tagged.tsv

$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py > chunk.txt

Intermediate step: split chunk.txt in 60/40 ratio to get train.txt and test.txt respectively

Training

$ crfsuite learn -m recipes.model train.txt

Testing

$ crfsuite tag -qt -m recipes.model test.txt

Performance by label (#match, #model, #ref) (precision, recall, F1):
    QTY: (7307, 7334, 7338) (0.9963, 0.9958, 0.9960)
    UNIT: (3944, 4169, 4091) (0.9460, 0.9641, 0.9550)
    COM: (5014, 5281, 5505) (0.9494, 0.9108, 0.9297)
    NAME: (11943, 12760, 12221) (0.9360, 0.9773, 0.9562)
    OTHER: (6984, 7094, 7483) (0.9845, 0.9333, 0.9582)
Macro-average precision, recall, F1: (0.962451, 0.956244, 0.959025)
Item accuracy: 35192 / 36638 (0.9605)
Instance accuracy: 6740 / 7854 (0.8582)
Elapsed time: 0.328684 [sec] (23895.3 [instance/sec])

Note: -qt option will work only with labeled data.

Precision 96%
Recall 95%
F1 Measure 95%

Read more about precision, recall and F1 measure here

To tag ingredients that the model has never seen before, follow Step 2 and run the following command

Tagging

$ crfsuite tag -m recipes.model test.txt

code and data here

Setting up python development environment with buildout

Attn: Checkout Conda before trying this.

Buildout is a Python-based build system for creating, assembling and deploying applications from multiple parts, some of which may be non-Python-based. It lets you create a buildout configuration and reproduce the same software later.
buildout.org

I’ve documented the steps required to create a simple buildout based project.

  1. Start by creating a project directory and initialise a virtual environment inside the project directory.

    $ mkdir word_count_buildout && cd word_count_buildout
    $ virtualenv --no-site-packages .env
    $ source .env/bin/activate
    
  2. Fetch bootstrap-buildout.py. A common script required to create necessary directories and eggs (like setuptools, etc. )

     
    $ wget https://bootstrap.pypa.io/bootstrap-buildout.py
    
  3. Create a buildout configuration file

    $ vi buildout.cfg
    

    copy & paste the following snippet

    [buildout]
    develop = .
    parts = job job-test scripts zipeggs
    # index = http://mypypicloud.example.com:6543/pypi/
    
    [job]
    recipe = zc.recipe.egg
    interpreter = python
    eggs = wordcount-job
    
    [job-test]
    recipe = pbp.recipe.noserunner
    eggs = pbp.recipe.noserunner
    working-directory = ${buildout:directory}
    
    [scripts]
    recipe = zc.recipe.egg:scripts
    eggs = dumbo
    
    [zipeggs]
    recipe = zipeggs:zipeggs
    target = dist 
    source = eggs
    

    basic config file structure explained here.

    • ln:4 index If using private PyPi, uncomment and replace the URL.
    • ln:21 recipe = zipeggs:zipeggs – a buildout recipe to zip all flattened/unzipped eggs, flattened/unzipped eggs are convenient while developing (for debugging purpose) and they load faster. Dumbo requires zipped eggs to be passed via -libegg param, zipeggs recipe can generate zipped eggs under target directory (dist). more details @tamizhgeek repo.
  4. Execute the bootstrap file

    $ python bootstrap-buildout.py
    
    Downloading https://pypi.python.org/packages/source/s/setuptools/setuptools-18.1.zip
    Extracting in /tmp/tmpZ7ki33
    Now working in /tmp/tmpZ7ki33/setuptools-18.1
    Building a Setuptools egg in /tmp/bootstrap-q4Xfc5
    /tmp/bootstrap-q4Xfc5/setuptools-18.1-py2.7.egg
    Creating directory '/home/raj/workspace/word_count/eggs'.
    Creating directory '/home/raj/workspace/word_count/bin'.
    Creating directory '/home/raj/workspace/word_count/parts'.
    Creating directory '/home/raj/workspace/word_count/develop-eggs'.
    Generated script '/home/raj/workspace/word_count/bin/buildout'.
    

    The script has created few directories and a buildout script inside bin directory. Read more about the directory structure here

  5. Create setup.py.

    $ vi setup.py
    
    from setuptools import setup, find_packages
    import os
    version = os.environ.get("PIPELINE_LABEL", "1.0")
    setup(
        name="wordcount-job",
        version=version,
        packages=find_packages(),
        zip_safe=True,
        install_requires=[
            'dumbo'
        ]
    )
    

    Read more about setuptools here.
    Any changes to setup.py and buildout.cfg should be followed by executing ./bin/buildout

  6. Create a python module for a simple word-count dumbo job.

    $ mkdir wordcount-job && touch wordcount-job/__init__.py
    $ vi wordcount-job/wordcount.py
    

    copy & paste the following code.

    def mapper(key, value):
        for word in value.split():
            yield word, 1
    
    def reducer(key, values):
        yield key, sum(values)
    
    if __name__ == "__main__":
        import dumbo
        dumbo.run(mapper, reducer)
    
  7. Finally, run the buildout script to fetch artifacts from private or central pypi

    $ ./bin/buildout
    

    All the dependencies (and its dependencies) mentioned in setup.py are collected under eggs/ directory. Zipped eggs are available under dist/ directory and executable scripts with dependencies wired are generated under bin directory. Try viewing the contents of bin/dumbo.

  8. To run the job

    $ ./bin/dumbo start wordcount-job/wordcount.py -input /tmp/input -output /tmp/output 
    

    wordcount.py can access all the dependencies (under eggs/ directory) mentioned in setup.py.

  9. To run tests.

    $ ./bin/job-test
    

    read more about pbp.recipe.noserunner recipe and nose

  10. To build egg

    $ ./bin/buildout setup . bdist_egg
    
  11. Private PyPi repos can also be used to distribute python eggs. To publish an egg to a private pypi, create config for pypicloud under home directory

    $ cd $HOME && vi .pypirc
    

    copy & paste the following

    [distutils]
    index-servers = my-pypi
    &nbsp;
    [my-pypi]
    repository: http://mypypicloud.example.com:6543/pypi/
    username: username
    password: password
    

    Under project working directory

    $ ./bin/buildout setup . bdist_egg upload -r my-pypi
    
  12. Source code
  13. Beer to bud @azhaguselvan for support and stuff.

Locality sensitive hashing (LSH) – Map-Reduce in Python

I’d try to explain LSH with help of python code and map-reduce technique.

It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]

Jaccard similarity

jaccard-index j = a intersection b / a union b

Where a and b are sets.
J = 0 if A and B are disjoint
J = 1 if A and B are identical

example,

>>> a = {'nike', 'running', 'shoe'}
>>> b = {'nike', 'black', 'running', 'shoe'}
>>> c = {'nike', 'blue', 'jacket'}
>>> float(len(a.intersection(b))) / len(a.union(b))
0.75 			# a and b are similar.				
>>> float(len(a.intersection(c))) / len(a.union(c))
0.2				# a and c are... meh..

Minhashing

Probability of collision is higher for similar sets.

Table 1: Matrix representation of sets

keyword x a b c
nike 1 1 1 1
running 2 1 1 0
shoe 3 1 1 0
black 4 0 1 0
blue 5 0 0 1
jacket 6 0 0 1

Table 2: Signature Matrix with hash values

Hash Function a b c
h1(x) = x + 1 mod 6 min(2,3,4) min(2,3,4,5) min(2,0,1)
h2(x) = 3x + 1 mod 6 min(4,1,4) min(4,1,4,1) min(4,4,1)

which becomes,

Table 3: Signature matrix with minhash values

Hash Function a b c
h1(x) = x + 1 mod 6 2 2 0
h2(x) = 3x + 1 mod 6 1 1 1

From Table 3 We can infer that set a and b are similar.
Similarity of a and b from Table 1 is 3/4 = 0.75
From signature matrix Table 3 similarity of a and b is 2/2 = 1

The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.

Map-Reduce

Mapper

sample_dict.txt will have word to id mapping.

  • for every line in input file
    • split text and convert to array of ids using the word to id mapping file.
    • for every id compute minimum hash value
    • split the array of min hash values into multiple equally sized chunks a.k.a, bands.
    •  assign id to bands and emit hash of band, band-id and doc-id

Reducer

  • group by band-hash and band-id to get list of similar doc-ids.

Mapper Code

# lsh_mapper.py
__author__ = 'raj'
import sys
from random import randrange

word_ids = dict()
num_hashes = 10
num_per_band = 2

# a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes 
a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]
b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]


def min_hash_fn(a, b, sig):
    hashes = [((a * x) + b) % len(word_ids) for x in sig]
    return min(hashes)


def get_min_hash_row(sig):
    hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)]
    return hashes


def get_band(l, n):
    for i in xrange(0, len(l), n):
        yield frozenset(l[i:i+n])


for word, wid in map(lambda x: x.split(), open(&quot;sample_dict.txt&quot;).readlines()):
    word_ids[word] = int(wid)

for doc_id, doc in enumerate(sys.stdin):
    words = doc.strip().lower().split()

    signature = map(lambda x: word_ids.get(x), words)
    signature = filter(lambda x: x is not None, signature)

    min_hash_row = get_min_hash_row(signature)

    banded = get_band(min_hash_row, num_per_band)

    for band_id, band in enumerate(banded):
        print &quot;%d\t%d\t%d&quot; % (band_id, hash(band), doc_id)

Reducer Code

#lsh_reducre.py
__author__ = 'raj'

import sys

prev_band_id, prev_band_hash = None, None
cluster = []
cid = 0

for line in sys.stdin:
    band_id, band_hash, doc_id = line.strip().split(&quot;\t&quot;, 3)

    if prev_band_id is None and prev_band_hash is None:
        prev_band_id, prev_band_hash = band_id, band_hash

    if prev_band_id is band_id:
        if prev_band_hash == band_hash:
            cluster.append(doc_id)
        else:
            print cid, cluster
            cluster = [doc_id]
    else:
        print cid, cluster
        cluster = [doc_id]
        cid += 1
    prev_band_id, prev_band_hash = band_id, band_hash

In action

sample_input.txt

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Nike Airmax Running SHoe
Corduroy Shorts - Flat Front (For Men) BEIGE
Nokia Lumia 721
Corduroy Shorts - Flat Front (For Men) BROWN

sample_dict.txt

&	1
(for	2
-0	3
1-14	4
12-	5
14	6
2-piece	7
721	8
airmax	9
and	10
beige	11
blue	12
brown	13
corduroy	14
corduroys	15
denim	16
doll	17
dot	18
dress	19
fashion	20
flat	21
flower	22
front	23
inch	24
jumper	25
leggings	26
lumia	27
me	28
men)	29
nike	30
nokia	31
outfit	32
piece	33
pink	34
polka	35
running	36
shirt	37
shoe	38
shorts	39
slate	40
teal	41
top	42
white	43
with	44
you	45
-	46

Command

$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py

Output

0 ['1', '2']
0 ['0']
0 ['5', '7']
0 ['6']
0 ['3']
0 ['4']
1 ['4']
1 ['6']
1 ['0']
1 ['2']
1 ['3', '5', '7']
1 ['1']
2 ['6']
2 ['4']
2 ['0', '1', '2']
2 ['3', '5', '7']
3 ['6']
3 ['3', '5', '7']
3 ['0', '1', '2']
3 ['4']
4 ['0', '1']
4 ['3']
4 ['5']
4 ['4']
4 ['2']
4 ['7']

resolved output

band 0
------
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 1
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 2
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 3
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

band 4
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt

code here

Clustering Text – Map Reduce in Python

Here I’m sharing a simple method to cluster text (product titles) based on key collision.

Dependencies

My Input file is a list of 20 product titles

Converse All Star PC2 - Boys' Toddler
HI Nike Sport Girls Golf Dress
Brooks Nightlife Infiniti 1/2 Zip - Women's
HI Nike Solid Girls Golf Shorts
Nike Therma-FIT K.O. (MLB Rays)
adidas adiPURE IV TRX FG - Men's
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt
adidas Adipure 11PRO TRX FG - Women's
HI Nike Team (NFL Giants) BCA Womens T-Shirt
adidas Sprintstar 4 - Men's
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Polo Girls Golf Dress
Nike Therma-FIT K.O. (MLB Twins)
adidas Sprintstar 3 - Women's
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Converse All Star Ox - Girls' Toddler
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White
Nike Therma-FIT K.O. (MLB Phillies)
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens

The idea is to split the data into a meaningful cluster so that it can be given as small input to various systems (de-duplication or classification) instead of the entire data itself.

Below are the steps involved in generating a fingerprint, an alternate representation of title (used as key)

  1. Remove special characters
  2. Remove numbers
  3. Remove stop words
  4. Stem each word
  5. Sort the words in alphabetical order

below is the python code that does it

# fingerprint.py

import sys
import re
import string
import itertools
import nltk
from stemming.porter2 import stem

class FingerPrint(object):
	def __init__(self):
		super(FingerPrint, self).__init__()
		self.remove_spl_char_regex = re.compile('[%s]' % re.escape(string.punctuation)) # regex to remove special characters
		self.remove_num = re.compile('[\d]+')

	def fp_steps(self,text):
		title = text.strip().lower()
		title_splchar_removed = self.remove_spl_char_regex.sub(" ",title)
		title_number_removed = self.remove_num.sub("", title_splchar_removed)
		words = title_number_removed.split()
		filter_stop_words = [w for w in words if not w in nltk.corpus.stopwords.words('english')]
		stemed = [stem(w) for w in filter_stop_words]
		return sorted(stemed)
	
	def fingerprint(self,text):
		fp = " ".join(self.fp_steps(text))
		return fp

Now, My Mapper can emit key value pairs where key = fingerprint and value = product title.

# map.py

import sys
import re
import string
from fingerprint import FingerPrint

f = FingerPrint()

for line in sys.stdin:
	try:
		print "%s\t%s" % (f.fingerprint(line),line.strip())
	except Exception as e:
		print e
		pass

Now I need to sort the output and group them based on a distance measure. I’m using levenshtein distance, and below is my logic behind the reducer.

  1. Default Distance measure is 20, Anything less than 20 will be added to the current cluster.
  2. Add the first title to the cluster (if empty).
  3. If the distance between current title’s fingerprint and the fingerprint of the last element in the cluster is less than the default distance measure (20), then add it to the cluster.
  4. If the distance is greater than the default distance(20) then create a new cluster and continue.

Following is the python code that does it

# reduce.py

import sys
from Levenshtein import distance
import json

DISTANCE = 20
cluster = {}
cid = 0

for i,line in enumerate(sys.stdin):
	cols = line.strip().split("\t")
	if i == 0:
		cluster[cid] = []
		cluster[cid].append(cols)
	else:
		last = cluster[cid][-1]
		if distance(last[0],cols[0]) <= DISTANCE:
			cluster[cid].append(cols)
		else:
			cid+=1
			cluster[cid] = []
			cluster[cid].append(cols)

for k,v in cluster.iteritems():
	print
	print "Cluster # ",k
	for entry in v:
		print entry[1]

To run,

cat input.tsv | python map.py | sort -k1,1 | python reduce.py 

and my output :O

Cluster #  0
adidas adiPURE IV TRX FG - Men's
adidas Adipure 11PRO TRX FG - Women's
adidas Sprintstar 4 - Men's
adidas Sprintstar 3 - Women's

Cluster #  1
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Purple/White
Under Armour Performance Team Polo - Mens - For All Sports - Clothing - Red/White

Cluster #  2
HI Nike Attitude (NFL Titans) BCA Womens T-Shirt
HI Nike Team (NFL Giants) BCA Womens T-Shirt

Cluster #  3
Converse All Star PC2 - Boys' Toddler

Cluster #  4
Brooks Nightlife Infiniti 1/2 Zip Jacket - Mens
Brooks Nightlife Infiniti 1/2 Zip - Women's

Cluster #  5
HI Nike College All-Purpose Seasonal Graphic (Washington) Womens T-Shirt

Cluster #  6
Nike College All-Purpose Seasonal Graphic (Oklahoma) Womens T-Shirt

Cluster #  7
Converse All Star Ox - Girls' Toddler
HI Nike Polo Girls Golf Dress
HI Nike Sport Girls Golf Dress

Cluster #  8
Nike Therma-FIT K.O. (MLB Phillies)
Nike Therma-FIT K.O. (MLB Rays)
Nike Therma-FIT K.O. (MLB Twins)
HI Nike Solid Girls Golf Shorts

Source Code

This method has some loop holes, Ill try to address these issues in my next post using bi-gram fingerprints. Please let me know your thoughts/feedback!