Productionizing a CRF model, Recipe Ingredients Tagger in Action.

A popular way to productionize a statistical model would be to expose them as a REST API, so that they can be scaled horizontally and is cost effective. In this post I’ll discuss the steps involved without implementation details.

In my previous post I’ve discussed how to build a simple tagger using CRFSuite. The goal of the tagger is to convert unstructured data to structured one by tagging entities. I took ‘Food and Recipes’ as my domain and have identified 4 important entities which are required to describe a recipe.

  • QTY – Quantity, number of units required. Usually numbers.
  • UNIT – Such as teaspoon, pinch, bottles, cups etc.
  • NAME – Name of the ingredient, example: sugar, almond, chicken, milk etc.
  • COM – Comment about the ingredients. example: crushed, finely chopped, powdered etc.
  • OTHERS – Random text that can be ignored.

I’ve used Flask framework for microservices and GUnicorn for production deployment.

The input/output contract is simple, Given a list of ingredients, The API should identify entities and tag them.

Consider the following homemade mac and cheese recipe from allrecipes.com as an example.

image of a recipe
homemade mac and cheese ingredients

Our goal is to identify entities present in the text highlighted in yellow (i.e. list of ingredients).

The API accepts input in the following format.

[
"8 ounces uncooked elbow macaroni",
"2 cups shredded sharp Cheddar cheese",
"1/2 cup grated Parmesan cheese",
"3 cups milk",
"1/4 cup butter",
"2 1/2 tablespoons all-purpose flour",
"2 tablespoons butter",
"1/2 cup bread crumbs",
"1 pinch paprika"
]
view raw rit_input.json hosted with ❤ by GitHub

And generates output as shown below, Tokens and their respective tagged labels.

{
"tagged_tokens": [
[
{
"tag": "QTY",
"token": "8"
},
{
"tag": "UNIT",
"token": "ounces"
},
{
"tag": "COM",
"token": "uncooked"
},
{
"tag": "NAME",
"token": "elbow"
},
{
"tag": "NAME",
"token": "macaroni"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "UNIT",
"token": "cups"
},
{
"tag": "COM",
"token": "shredded"
},
{
"tag": "COM",
"token": "sharp"
},
{
"tag": "NAME",
"token": "Cheddar"
},
{
"tag": "NAME",
"token": "cheese"
}
],
[
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "COM",
"token": "grated"
},
{
"tag": "NAME",
"token": "Parmesan"
},
{
"tag": "NAME",
"token": "cheese"
}
],
[
{
"tag": "QTY",
"token": "3"
},
{
"tag": "UNIT",
"token": "cups"
},
{
"tag": "NAME",
"token": "milk"
}
],
[
{
"tag": "QTY",
"token": "1/4"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "NAME",
"token": "butter"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "tablespoons"
},
{
"tag": "NAME",
"token": "all-purpose"
},
{
"tag": "NAME",
"token": "flour"
}
],
[
{
"tag": "QTY",
"token": "2"
},
{
"tag": "UNIT",
"token": "tablespoons"
},
{
"tag": "NAME",
"token": "butter"
}
],
[
{
"tag": "QTY",
"token": "1/2"
},
{
"tag": "UNIT",
"token": "cup"
},
{
"tag": "NAME",
"token": "bread"
},
{
"tag": "NAME",
"token": "crumbs"
}
],
[
{
"tag": "QTY",
"token": "1"
},
{
"tag": "UNIT",
"token": "pinch"
},
{
"tag": "NAME",
"token": "paprika"
}
]
]
}
view raw rit_output.json hosted with ❤ by GitHub

A simple visualization to understand the output better.

image of Color coded entities
Color coded ingredient entities

CRFSuite is written in C++, We can leverage the CRFSuite’s C++ API by using SWIG wrapper for Python.

The following snippet explains the various steps involved in transforming the incoming data to model understandable features and how the output is interpreted in the end.

@app.route("/tag", methods=['GET', 'POST'])
def tag():
content = request.get_json(silent=True)
if len(content) > 50:
return abort(400)
tokens = map(nltk.word_tokenize, content)
tagged_tokens = map(nltk.pos_tag, tokens)
for_feature = pre_feature(tagged_tokens)
with_feature = map(feature_extractor, for_feature)
flattened_with_feature = [item for sublist in with_feature for item in sublist]
xseq = to_crfsuite(flattened_with_feature)
yseq = tagger.tag(xseq)
tags = []
for y in yseq:
tags.append(y)
tags = list(reversed(tags))
result = []
for feature in with_feature:
tagged_token = []
for token in feature:
tagged_token.append({"token": token['w'], "tag": tags.pop()})
result.append(tagged_token)
return jsonify(tagged_tokens=result)

Once the flask app is ready, Deploying with GUnicorn is simple.

Since CRF is a statistical model, It requires the modeler to understand the relation between variables and hence spends 90% of the time preparing data for training and testing. In other words, its time consuming. These models can be used as a stepping stone towards building unsupervised learning algorithms, search relevance, recommendation, shopping cart and buy button use cases etc.

You can try the API with different inputs at
Mashape

(registration required)

Advertisement

2 thoughts on “Productionizing a CRF model, Recipe Ingredients Tagger in Action.

  1. Hello Rajmak

    I’m working on the same task. I’m interested in discussing with you.
    Would you be available to chat on this subject ?

    Regards,

    Florian

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.