Productionizing a CRF model, Recipe Ingredients Tagger in Action.

A popular way to productionize a statistical model would be to expose them as a REST API, so that they can be scaled horizontally and is cost effective. In this post I’ll discuss the steps involved without implementation details.

In my previous post I’ve discussed how to build a simple tagger using CRFSuite. The goal of the tagger is to convert unstructured data to structured one by tagging entities. I took ‘Food and Recipes’ as my domain and have identified 4 important entities which are required to describe a recipe.

  • QTY – Quantity, number of units required. Usually numbers.
  • UNIT – Such as teaspoon, pinch, bottles, cups etc.
  • NAME – Name of the ingredient, example: sugar, almond, chicken, milk etc.
  • COM – Comment about the ingredients. example: crushed, finely chopped, powdered etc.
  • OTHERS – Random text that can be ignored.

I’ve used Flask framework for microservices and GUnicorn for production deployment.

The input/output contract is simple, Given a list of ingredients, The API should identify entities and tag them.

Consider the following homemade mac and cheese recipe from allrecipes.com as an example.

image of a recipe

homemade mac and cheese ingredients

Our goal is to identify entities present in the text highlighted in yellow (i.e. list of ingredients).

The API accepts input in the following format.

And generates output as shown below, Tokens and their respective tagged labels.

A simple visualization to understand the output better.

image of Color coded entities

Color coded ingredient entities

CRFSuite is written in C++, We can leverage the CRFSuite’s C++ API by using SWIG wrapper for Python.

The following snippet explains the various steps involved in transforming the incoming data to model understandable features and how the output is interpreted in the end.

Once the flask app is ready, Deploying with GUnicorn is simple.

Since CRF is a statistical model, It requires the modeler to understand the relation between variables and hence spends 90% of the time preparing data for training and testing. In other words, its time consuming. These models can be used as a stepping stone towards building unsupervised learning algorithms, search relevance, recommendation, shopping cart and buy button use cases etc.

You can try the API with different inputs at
Mashape

(registration required)