A popular way to productionize a statistical model would be to expose them as a REST API, so that they can be scaled horizontally and is cost effective. In this post I’ll discuss the steps involved without implementation details.
In my previous post I’ve discussed how to build a simple tagger using CRFSuite. The goal of the tagger is to convert unstructured data to structured one by tagging entities. I took ‘Food and Recipes’ as my domain and have identified 4 important entities which are required to describe a recipe.
QTY – Quantity, number of units required. Usually numbers.
UNIT – Such as teaspoon, pinch, bottles, cups etc.
NAME – Name of the ingredient, example: sugar, almond, chicken, milk etc.
COM – Comment about the ingredients. example: crushed, finely chopped, powdered etc.
Since CRF is a statistical model, It requires the modeler to understand the relation between variables and hence spends 90% of the time preparing data for training and testing. In other words, its time consuming. These models can be used as a stepping stone towards building unsupervised learning algorithms, search relevance, recommendation, shopping cart and buy button use cases etc.
A three column tab separated file is required for chunking.
Column 1 – Token
Column 2 – POS tag
Column 3 – Label (done manually)
Each token in a ingredient list gets a line in the TSV file and a new line is left to separate ingredients.
The following script generates data in required format taking the JSON lines file mentioned above as input.
Note that XXX is just a place holder, which will be replaced by the actual label (i.e. one of QTY, UNIT, COM, NAME, OTHERS).
I’ve manually labeled each token with the help of OpenRefine, Skip this step if you are tagging using a model that is already available.
In the end the file should look similar to table shown below.
Next task is chunking and it is explained well here.
The same POS and token position features discussed in the tutorial are used as features in this experiment as well,So using the util script provided in the CRFSuite repository we can generate chunks.