Structuring text – Sequence tagging using Conditional Random Field (CRF). Tagging recipe ingredient phrases.

Building a food graph is an interesting problem.
Such graphs can be used to mine similar recipes, analyse relationship between cuisines and food cultures etc.

This blog post from NYTimes about “Extracting Structured Data From Recipes Using Conditional Random Fields” could be an initial step towards building such graphs.

In an attempt to implement the idea shared in the blog post mentioned above, I’ve used CRFSuite to build a model that tags entities in ingredients list.
CRFSuite installation instruction here.

Note: For the impatient, Please checkout the TL;DR section at the end of the post.

3 steps to reach the goal.

  1. Understanding data.
  2. Preparing data.
  3. Building model.

Step 1: Understanding data.

The basic assumption is to use the following 5 entities to tag ingredients of a recipe.

  1. Quantity (QTY)
  2. Unit (UNIT)
  3. Comment (COM)
  4. Name (NAME)
  5. Others (OTHERS)

For example,

Ingredient Quantity Unit Comment Name Others
2 tablespoons of soya sauce 2 tablespoons NA soya, sauce of
Onions sliced and fried brown 3 medium 3 NA sliced, brown, fried onions and
3 Finely chopped Green Chillies 3 NA finely, chopped, green chillies NA

Similarly most of the ingredients shared in recipes can be tagged with these 5 labels.

Step 2: Preparing data.

Preparing data involves the following steps

  1. Collecting data
  2. POS tagging
  3. Labeling tokens
  4. Chunking

A simple script to politely scrape data from any recipe site will do the job. Checkout Scrapy.

I’ve collected data in the following format.

The actual input file is a JSON Lines file.

A three column tab separated file is required for chunking.

  • Column 1 – Token
  • Column 2 – POS tag
  • Column 3 – Label (done manually)

Each token in a ingredient list gets a line in the TSV file and a new line is left to separate ingredients.
The following script generates data in required format taking the JSON lines file mentioned above as input.

$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv

Note that XXX is just a place holder, which will be replaced by the actual label (i.e. one of QTY, UNIT, COM, NAME, OTHERS).
I’ve manually labeled each token with the help of OpenRefine, Skip this step if you are tagging using a model that is already available.
In the end the file should look similar to table shown below.

Next task is chunking and it is explained well here.
The same POS and token position features discussed in the tutorial are used as features in this experiment as well,So using the util script provided in the CRFSuite repository we can generate chunks.

$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py -s $'\t' > chunk.txt 

After chunking the final output file should look similar to this.

Step 3: Building model

To train

$ crfsuite learn -m <model_name> <chunk_file>

To test

$ crfsuite tag -qt -m <model_name> <chunk_file>

To tag

$ crfsuite tag -m <model_name> <chunk_file>

TL;DR

I’ve collected 2000 recipes out of which 60% is used for training and 40% is used for testing.

Each ingredient is tokenized, POS tagged and manually labeled (hardest part).
Following are the input, intermediate and output files.

  • recipes.jl – a JSON lines file containing 2000 recipes. Input file
  • token_pos.tsv – Intermediate TSV file with token and its POS. (column with XXX is a place holder for next step)
  • token_pos_tagged.tsv – TSV file with token, pos and label columns, after tagging 3rd column manually.
  • train.txt – 60% of input, chunked, for training
  • test.txt – 40% of input, chunked, for testing
  • recipe.model – model output
$ cat recipes.jl | python crf_input_generator.py > token_pos.tsv

Intermediate step: Manually label tokens and generate token_pos_tagged.tsv

$ cat token_pos_tagged.tsv | python ~/workspace/crfsuite/example/chunking.py > chunk.txt

Intermediate step: split chunk.txt in 60/40 ratio to get train.txt and test.txt respectively

Training

$ crfsuite learn -m recipes.model train.txt

Testing

$ crfsuite tag -qt -m recipes.model test.txt

Performance by label (#match, #model, #ref) (precision, recall, F1):
    QTY: (7307, 7334, 7338) (0.9963, 0.9958, 0.9960)
    UNIT: (3944, 4169, 4091) (0.9460, 0.9641, 0.9550)
    COM: (5014, 5281, 5505) (0.9494, 0.9108, 0.9297)
    NAME: (11943, 12760, 12221) (0.9360, 0.9773, 0.9562)
    OTHER: (6984, 7094, 7483) (0.9845, 0.9333, 0.9582)
Macro-average precision, recall, F1: (0.962451, 0.956244, 0.959025)
Item accuracy: 35192 / 36638 (0.9605)
Instance accuracy: 6740 / 7854 (0.8582)
Elapsed time: 0.328684 [sec] (23895.3 [instance/sec])

Note: -qt option will work only with labeled data.

Precision 96%
Recall 95%
F1 Measure 95%

Read more about precision, recall and F1 measure here

To tag ingredients that the model has never seen before, follow Step 2 and run the following command

Tagging

$ crfsuite tag -m recipes.model test.txt

code and data here

Advertisements

11 thoughts on “Structuring text – Sequence tagging using Conditional Random Field (CRF). Tagging recipe ingredient phrases.

  1. i have JSON data exported by mongo db in following format
    [{“_id”:{ “$oid”: “56c1da43848a5512712c9bfd” },”mac”:”C4:E9:84:18:7A:C2″,”angle”:null,”power”:-45,”time”:”2016-02-15T14:01:39.231Z” }},{“_id”:{ “$oid”: “56c1da43848a5512712c9bff” },”mac”:”54:27:1E:FD:7F:47″,”angle”:null,”power”:-91,”time”:”2016-02-15T14:01:39.970Z” }},{“_id”:{ “$oid”: “56c1da44848a5512712c9c01″ },”mac”:”74:29:AF:61:D0:7F”,”angle”:null,”power”:-84,”time”:”2016-02-15T14:01:40.592Z” }}]

    i want to extract data of “power” instead of “ingredients” in your case.
    crf_input_generator.py gives me a error when i run command to get token_pos.tsv.
    its the error of my Json format. i can not change my json data its in huge form.
    can you help me ?

    1. Hi Hasnain,
      Thanks for checking out my post. The python script accepts JSON lines file and you have to tweak the script a little bit to work with your input.
      for example,

      import json
      with open("input.json") as inp:
      data = json.loads(inp.readlines())
      for record in data:
      print record
      ## do stuff with record['power'] ##

      I suggest you to use jsonlines format, mongoexport exports data in jsonlines format by default.
      loading a fully-formed JSON is not memory efficient.

      hope it helps. thanks.

      1. Thanks for your reply Rajmak 🙂
        One more thing i need to know that:
        You have mentioned in the end that “To tag ingredients that the model has never seen before, follow Step 2 and run the following command
        $ crfsuite tag -m recipes.model test.txt”

        if i have a model file available
        should i POS tag , label(manually) and chunk the never seen data first or just make tokens of new data and run this command
        $ crfsuite tag -m recipes.model newdata.txt”.

      2. Hi Hasnain,
        You can skip the labelling part and use the model to do it for you. POS tag and chunking is enough.

  2. Hi Rajmak,

    Stumbled on your post few days back. I followed your instructions to implement the model. Really helpful.

    I was wondering to get your ideas about the recipe matching problem. Let’s say we have a model that identify the features in a ingredients document. How would you approach to match similar recipes ?

    Regards,
    Prabhakar

  3. This is one of the clearest tutorials on CRF I have come across. I am not a practitioner but I am trying to grasp the concept of using this in real life setting. How do you get to transform the output into something structured for example. i.e
    QUANTITY:7-8
    INGREDIENT:PAKAL FISH

    I am working on a e-commerce problem of trying to extract attribute pairs from product description a problem really close to what you are trying to solve here. Any advise on how to transform the learned output into structured data.

    1. Thank you for your interest Emma. I manually tagged each and every token that has high frequency in the dataset to generate training data.
      Please check the TL;DR section of my post for the summary of steps involved.
      For your problem, You might have to generate ‘product category’ specific models to avoid overlap (different models for apparels and electronics etc). You might also need to spend some time on feature engineering to discover patterns in descriptions and convert them as features. (My example has only one feature that is POS).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s