Building a food graph is an interesting problem.
Such graphs can be used to mine similar recipes, analyse relationship between cuisines and food cultures etc.
This blog post from NYTimes about “Extracting Structured Data From Recipes Using Conditional Random Fields” could be an initial step towards building such graphs.
In an attempt to implement the idea shared in the blog post mentioned above, I’ve used CRFSuite to build a model that tags entities in ingredients list.
CRFSuite installation instruction here.
Note: For the impatient, Please checkout the TL;DR section at the end of the post.
3 steps to reach the goal.
- Understanding data.
- Preparing data.
- Building model.
Step 1: Understanding data.
The basic assumption is to use the following 5 entities to tag ingredients of a recipe.
- Quantity (QTY)
- Unit (UNIT)
- Comment (COM)
- Name (NAME)
- Others (OTHERS)
For example,
Ingredient | Quantity | Unit | Comment | Name | Others |
---|---|---|---|---|---|
2 tablespoons of soya sauce | 2 | tablespoons | NA | soya, sauce | of |
Onions sliced and fried brown 3 medium | 3 | NA | sliced, brown, fried | onions | and |
3 Finely chopped Green Chillies | 3 | NA | finely, chopped, green | chillies | NA |
Similarly most of the ingredients shared in recipes can be tagged with these 5 labels.
Step 2: Preparing data.
Preparing data involves the following steps
- Collecting data
- POS tagging
- Labeling tokens
- Chunking
A simple script to politely scrape data from any recipe site will do the job. Checkout Scrapy.
I’ve collected data in the following format.
{ | |
"url": "http://allrecipes.co.in/recipe/12227/pakal-fish-curry.aspx", | |
"ingredients": [ | |
"7-8 pakal fish", | |
"1 teaspoon turmeric powder", | |
"as needed salt", | |
"2 tablespoon mustard oil", | |
"a pinch black cumin seeds/powder", | |
"2 tablespoon onion, sliced", | |
"1/2 teaspoon ginger paste", | |
"1/2 teaspoon garlic paste", | |
"2-3 green chilies, chopped", | |
"2 tablespoon white mustard paste", | |
"water as needed", "as needed sugar", | |
"1 tablespoon coriander leaves, chopped", | |
"3-4 green chilies, whole" | |
] | |
} |
The actual input file is a JSON Lines file.
A three column tab separated file is required for chunking.
- Column 1 – Token
- Column 2 – POS tag
- Column 3 – Label (done manually)
Each token in a ingredient list gets a line in the TSV file and a new line is left to separate ingredients.
The following script generates data in required format taking the JSON lines file mentioned above as input.
import sys | |
import nltk | |
import json | |
for line in sys.stdin: | |
data = json.loads(line) | |
for ingredient in data['ingredients']: | |
tokens = nltk.word_tokenize(ingredient.strip()) | |
tagged_tokens = nltk.pos_tag(tokens) | |
for token, pos in tagged_tokens: | |
try: | |
print "%s\t%s\tXXX" % (token.encode('utf8'), pos) | |
except Exception as e: | |
print e | |
print "Error writing token:", token | |
$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv
Note that XXX is just a place holder, which will be replaced by the actual label (i.e. one of QTY, UNIT, COM, NAME, OTHERS).
I’ve manually labeled each token with the help of OpenRefine, Skip this step if you are tagging using a model that is already available.
In the end the file should look similar to table shown below.
token | pos | label | |
---|---|---|---|
7-8 | JJ | QTY | |
pakal | NN | NAME | |
fish | NN | NAME | |
1 | CD | QTY | |
teaspoon | NN | UNIT | |
turmeric | JJ | NAME | |
powder | NN | NAME | |
as | IN | OTHER | |
needed | VBN | OTHER | |
salt | NN | NAME | |
2 | CD | QTY | |
tablespoon | NN | UNIT | |
mustard | NN | NAME | |
oil | NN | NAME | |
... | ... | ... |
Next task is chunking and it is explained well here.
The same POS and token position features discussed in the tutorial are used as features in this experiment as well,So using the util script provided in the CRFSuite repository we can generate chunks.
$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py -s $'\t' > chunk.txt
After chunking the final output file should look similar to this.
QTY w[0]=7-8 w[1]=pakal w[2]=fish w[0]|w[1]=7-8|pakal pos[0]=JJ pos[1]=NN pos[2]=NN pos[0]|pos[1]=JJ|NN pos[1]|pos[2]=NN|NN pos[0]|pos[1]|pos[2]=JJ|NN|NN __BOS__ | |
NAME w[-1]=7-8 w[0]=pakal w[1]=fish w[-1]|w[0]=7-8|pakal w[0]|w[1]=pakal|fish pos[-1]=JJ pos[0]=NN pos[1]=NN pos[-1]|pos[0]=JJ|NN pos[0]|pos[1]=NN|NN pos[-1]|pos[0]|pos[1]=JJ|NN|NN | |
NAME w[-2]=7-8 w[-1]=pakal w[0]=fish w[-1]|w[0]=pakal|fish pos[-2]=JJ pos[-1]=NN pos[0]=NN pos[-2]|pos[-1]=JJ|NN pos[-1]|pos[0]=NN|NN pos[-2]|pos[-1]|pos[0]=JJ|NN|NN __EOS__ | |
QTY w[0]=1 w[1]=teaspoon w[2]=turmeric w[0]|w[1]=1|teaspoon pos[0]=CD pos[1]=NN pos[2]=JJ pos[0]|pos[1]=CD|NN pos[1]|pos[2]=NN|JJ pos[0]|pos[1]|pos[2]=CD|NN|JJ __BOS__ | |
UNIT w[-1]=1 w[0]=teaspoon w[1]=turmeric w[2]=powder w[-1]|w[0]=1|teaspoon w[0]|w[1]=teaspoon|turmeric pos[-1]=CD pos[0]=NN pos[1]=JJ pos[2]=NN pos[-1]|pos[0]=CD|NN pos[0]|pos[1]=NN|JJ pos[1]|pos[2]=JJ|NN pos[-1]|pos[0]|pos[1]=CD|NN|JJ pos[0]|pos[1]|pos[2]=NN|JJ|NN | |
NAME w[-2]=1 w[-1]=teaspoon w[0]=turmeric w[1]=powder w[-1]|w[0]=teaspoon|turmeric w[0]|w[1]=turmeric|powder pos[-2]=CD pos[-1]=NN pos[0]=JJ pos[1]=NN pos[-2]|pos[-1]=CD|NN pos[-1]|pos[0]=NN|JJ pos[0]|pos[1]=JJ|NN pos[-2]|pos[-1]|pos[0]=CD|NN|JJ pos[-1]|pos[0]|pos[1]=NN|JJ|NN | |
NAME w[-2]=teaspoon w[-1]=turmeric w[0]=powder w[-1]|w[0]=turmeric|powder pos[-2]=NN pos[-1]=JJ pos[0]=NN pos[-2]|pos[-1]=NN|JJ pos[-1]|pos[0]=JJ|NN pos[-2]|pos[-1]|pos[0]=NN|JJ|NN __EOS__ | |
OTHER w[0]=as w[1]=needed w[2]=salt w[0]|w[1]=as|needed pos[0]=IN pos[1]=VBN pos[2]=NN pos[0]|pos[1]=IN|VBN pos[1]|pos[2]=VBN|NN pos[0]|pos[1]|pos[2]=IN|VBN|NN __BOS__ | |
OTHER w[-1]=as w[0]=needed w[1]=salt w[-1]|w[0]=as|needed w[0]|w[1]=needed|salt pos[-1]=IN pos[0]=VBN pos[1]=NN pos[-1]|pos[0]=IN|VBN pos[0]|pos[1]=VBN|NN pos[-1]|pos[0]|pos[1]=IN|VBN|NN | |
NAME w[-2]=as w[-1]=needed w[0]=salt w[-1]|w[0]=needed|salt pos[-2]=IN pos[-1]=VBN pos[0]=NN pos[-2]|pos[-1]=IN|VBN pos[-1]|pos[0]=VBN|NN pos[-2]|pos[-1]|pos[0]=IN|VBN|NN __EOS__ |
Step 3: Building model
To train
$ crfsuite learn -m <model_name> <chunk_file>
To test
$ crfsuite tag -qt -m <model_name> <chunk_file>
To tag
$ crfsuite tag -m <model_name> <chunk_file>
TL;DR
I’ve collected 2000 recipes out of which 60% is used for training and 40% is used for testing.
Each ingredient is tokenized, POS tagged and manually labeled (hardest part).
Following are the input, intermediate and output files.
- recipes.jl – a JSON lines file containing 2000 recipes. Input file
- token_pos.tsv – Intermediate TSV file with token and its POS. (column with XXX is a place holder for next step)
- token_pos_tagged.tsv – TSV file with token, pos and label columns, after tagging 3rd column manually.
- train.txt – 60% of input, chunked, for training
- test.txt – 40% of input, chunked, for testing
- recipe.model – model output
$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv
Intermediate step: Manually label tokens and generate token_pos_tagged.tsv
$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py > chunk.txt
Intermediate step: split chunk.txt in 60/40 ratio to get train.txt and test.txt respectively
Training
$ crfsuite learn -m recipes.model train.txt
Testing
$ crfsuite tag -qt -m recipes.model test.txt Performance by label (#match, #model, #ref) (precision, recall, F1): QTY: (7307, 7334, 7338) (0.9963, 0.9958, 0.9960) UNIT: (3944, 4169, 4091) (0.9460, 0.9641, 0.9550) COM: (5014, 5281, 5505) (0.9494, 0.9108, 0.9297) NAME: (11943, 12760, 12221) (0.9360, 0.9773, 0.9562) OTHER: (6984, 7094, 7483) (0.9845, 0.9333, 0.9582) Macro-average precision, recall, F1: (0.962451, 0.956244, 0.959025) Item accuracy: 35192 / 36638 (0.9605) Instance accuracy: 6740 / 7854 (0.8582) Elapsed time: 0.328684 [sec] (23895.3 [instance/sec])
Note: -qt option will work only with labeled data.
Precision | 96% |
Recall | 95% |
F1 Measure | 95% |
Read more about precision, recall and F1 measure here
To tag ingredients that the model has never seen before, follow Step 2 and run the following command
Tagging
$ crfsuite tag -m recipes.model test.txt
code and data here