Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding

Text classification help us to better understand and organize data. I’ve tried building a simple CNN classifier using Keras with tensorflow as backend to classify products available on eCommerce sites. Data for this experiment are product titles of three distinct categories from a popular eCommerce site. Reference: Tutorial

tl;dr

Python notebook and data

 Collecting Data

For this experiment I’ve collected product titles belonging to the following categories.

  • Women’s clothing
  • Cameras
  • Home appliances

Since these categories are distinct, meaning they don’t have any overlap of contextual information, Our model should have less classification errors/perform well. I’ve tried to implement 2 proven architecture of CNN with Word2Vec embedding.

Setup

We need the following libraries

  • Gensim
  • Keras
  • NLTK
  • Pandas
  • Numpy
  • Tensorflow

and

  • Conda to manage virtual environment
  • Pre-trained vectors trained on Google News dataset download 1.5GB for Word2Vec embedding.

Loading Data

Download data. It is important to make sure that the data doesn’t have any null/Nan values.

Make sure there are no null values in the datasets
Has null values:  False
Has null values:  False
Has null values:  False

Preprocessing

Stop words or words that occur frequently and is distracting are removed first, Then we use classes provided by Keras to help prepare text so it can be used by neural network models.

To prepare the vector (array of integers) representation of text :

  • Combine titles from all three cateories to obtain a list of text.
  • Drop duplicates
  • Initialize tokenizer with num_words = MAX_NB_WORDS (200K). i.e. The tokenizer will perform a word count, sorted by number of occurences in descending order and pick top N words, 200K in this case
  • Use tokenizer’s texts_to_sequences method to convert text to array of integers.
  • The arrays obtained from previous step might not be of uniform length, use pad_sequences method to obtain arrays with length equal to MAX_SEQUENCE_LENGTH (30)

word_index has a unique integer ID assigned to each word in the data. For example

word		id
--------------------
sports		16
action		13
spy		7
pen		55
camera		2

The tokenizer will replace words with unique integer id to get a vector representation of the title. Example:

Text to Vector [[16, 13, 2], [7, 55, 2]]
Padded Vector [[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0 16 13  2]
 [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  7 55  2]]

Product titles belonging to all three categories are kept separate so far for the sake of understanding. To prepare the input layer, All three cateogries are combined together and shuffled as shown below.

The category (y-axis or label) is converted to convnet’s understandable format (one hot vector) by using the keras.util method to_categorical. Example:

clothing: 		 [[ 1.  0.  0.]]
camera: 		 [[ 0.  1.  0.]]
home appliances: 	 [[ 0.  0.  1.]]
clothing shape:  (392721, 30)
electronics shape:  (1347, 30)
home appliances shape:  (11425, 30)
----------
combined data shape:  (405493, 30)
combined category/label shape:  (405493, 3)

Shuffling and splitting the data since categories are stacked one after the other. nb_validation_samples is the index which separates training and testing/validating sets. This step can be simplified by train_test_split from scikit.

word2vec embedding

Word2Vec brings in semantic similarity info which can be leveraged by the convnets. This experiment uses pre-trained vectors from Google news.One other option is GloVe.

Found 3000000 word vectors of word2vec

The following examples should help understand the intent behind using a pre trained word2vec.

Odd word out: carrot
----------
Cosine similarity between TV and HBO: 0.613064891522
----------
Most similar words to Computers: computer, laptops, PCs, laptop_computers, desktop_computers, Computers, laptop, notebook_computers, Dell_OptiPlex_desktop, automated_seismographs
----------

Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False) method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.

Null word embeddings: 1473

Model

I recommend this (30 Min) video about how Convnets work to understand the layers. Below is the replication of 2 proven architectures. More can be found here

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 30, 300)           817200    
_________________________________________________________________
dropout_9 (Dropout)          (None, 30, 300)           0         
_________________________________________________________________
conv1d_9 (Conv1D)            (None, 14, 300)           270300    
_________________________________________________________________
conv1d_10 (Conv1D)           (None, 6, 150)            135150    
_________________________________________________________________
conv1d_11 (Conv1D)           (None, 2, 75)             33825     
_________________________________________________________________
flatten_3 (Flatten)          (None, 150)               0         
_________________________________________________________________
dropout_10 (Dropout)         (None, 150)               0         
_________________________________________________________________
dense_9 (Dense)              (None, 150)               22650     
_________________________________________________________________
dropout_11 (Dropout)         (None, 150)               0         
_________________________________________________________________
dense_10 (Dense)             (None, 3)                 453       
=================================================================
Total params: 1,279,578
Trainable params: 462,378
Non-trainable params: 817,200
_________________________________________________________________
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_2 (Embedding)      (None, 30, 300)           817200    
_________________________________________________________________
conv1d_12 (Conv1D)           (None, 28, 250)           225250    
_________________________________________________________________
global_max_pooling1d_3 (Glob (None, 250)               0         
_________________________________________________________________
dense_11 (Dense)             (None, 250)               62750     
_________________________________________________________________
dropout_12 (Dropout)         (None, 250)               0         
_________________________________________________________________
activation_5 (Activation)    (None, 250)               0         
_________________________________________________________________
dense_12 (Dense)             (None, 3)                 753       
_________________________________________________________________
activation_6 (Activation)    (None, 3)                 0         
=================================================================
Total params: 1,105,953
Trainable params: 288,753
Non-trainable params: 817,200
_________________________________________________________________
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] - 22s 92us/step - loss: 0.1106 - acc: 0.9768 - val_loss: 0.1090 - val_acc: 0.9773
Epoch 2/5
243296/243296 [==============================] - 24s 97us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1091 - val_acc: 0.9775
Epoch 3/5
243296/243296 [==============================] - 21s 86us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1080 - val_acc: 0.9774
Epoch 4/5
243296/243296 [==============================] - 23s 93us/step - loss: 0.1096 - acc: 0.9772 - val_loss: 0.1088 - val_acc: 0.9776
Epoch 5/5
243296/243296 [==============================] - 24s 98us/step - loss: 0.1098 - acc: 0.9773 - val_loss: 0.1097 - val_acc: 0.9773
Test loss: 0.10969909843
Test accuracy: 0.977305375562
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] - 13s 52us/step - loss: 8.3458e-04 - acc: 0.9999 - val_loss: 9.0927e-04 - val_acc: 0.9999
Epoch 2/5
243296/243296 [==============================] - 12s 48us/step - loss: 7.2089e-04 - acc: 0.9999 - val_loss: 0.0011 - val_acc: 0.9999
Epoch 3/5
243296/243296 [==============================] - 12s 49us/step - loss: 7.2221e-04 - acc: 1.0000 - val_loss: 0.0012 - val_acc: 0.9999
Epoch 4/5
243296/243296 [==============================] - 12s 51us/step - loss: 7.1913e-04 - acc: 0.9999 - val_loss: 0.0010 - val_acc: 0.9999
Epoch 5/5
243296/243296 [==============================] - 12s 49us/step - loss: 6.7104e-04 - acc: 1.0000 - val_loss: 0.0011 - val_acc: 0.9999
Test loss: 0.00113550592472
Test accuracy: 0.999895189184

model_1 is better than the other. Below is an example on how to use this model.

----------
Predicted category:  camera
----------
Clothing Probability:  5.12844e-21
Camera Probability:  0.505056
home appliances probability:  5.71945e-23

Conclusion

My observation is that with neural networks, the time taken for feature engineering is considerable reduced and researchers spend most of their time in deciding the architecture of the neural network layers. Word2Vec embedding greatly contributes to improving the accuracy of the model.

Advertisements

Locality sensitive hashing (LSH) – Map-Reduce in Python

I’d try to explain LSH with help of python code and map-reduce technique.

It is said that There is a remarkable connection between minhashing and Jaccard similarity of the sets that are minhashed. [Chapter 3, 3.3.3 Mining of massive datasets]

Jaccard similarity

jaccard-index j = a intersection b / a union b

Where a and b are sets.
J = 0 if A and B are disjoint
J = 1 if A and B are identical

example,

>>> a = {'nike', 'running', 'shoe'}
>>> b = {'nike', 'black', 'running', 'shoe'}
>>> c = {'nike', 'blue', 'jacket'}
>>> float(len(a.intersection(b))) / len(a.union(b))
0.75 			# a and b are similar.				
>>> float(len(a.intersection(c))) / len(a.union(c))
0.2				# a and c are... meh..

Minhashing

Probability of collision is higher for similar sets.

Table 1: Matrix representation of sets

keyword x a b c
nike 1 1 1 1
running 2 1 1 0
shoe 3 1 1 0
black 4 0 1 0
blue 5 0 0 1
jacket 6 0 0 1

Table 2: Signature Matrix with hash values

Hash Function a b c
h1(x) = x + 1 mod 6 min(2,3,4) min(2,3,4,5) min(2,0,1)
h2(x) = 3x + 1 mod 6 min(4,1,4) min(4,1,4,1) min(4,4,1)

which becomes,

Table 3: Signature matrix with minhash values

Hash Function a b c
h1(x) = x + 1 mod 6 2 2 0
h2(x) = 3x + 1 mod 6 1 1 1

From Table 3 We can infer that set a and b are similar.
Similarity of a and b from Table 1 is 3/4 = 0.75
From signature matrix Table 3 similarity of a and b is 2/2 = 1

The fraction from signature matrix Table 3 is just an estimate of the true jaccard similarity. on a larger set the estimates will be close.

Map-Reduce

Mapper

sample_dict.txt will have word to id mapping.

  • for every line in input file
    • split text and convert to array of ids using the word to id mapping file.
    • for every id compute minimum hash value
    • split the array of min hash values into multiple equally sized chunks a.k.a, bands.
    •  assign id to bands and emit hash of band, band-id and doc-id

Reducer

  • group by band-hash and band-id to get list of similar doc-ids.

Mapper Code

# lsh_mapper.py
__author__ = 'raj'
import sys
from random import randrange

word_ids = dict()
num_hashes = 10
num_per_band = 2

# a_hash and b_hash cannot be generated on the fly if running in a distributed env. they should be same across all nodes 
a_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]
b_hash = [randrange(sys.maxint) for _ in xrange(0, num_hashes)]


def min_hash_fn(a, b, sig):
    hashes = [((a * x) + b) % len(word_ids) for x in sig]
    return min(hashes)


def get_min_hash_row(sig):
    hashes = [min_hash_fn(a, b, sig) for a, b in zip(a_hash, b_hash)]
    return hashes


def get_band(l, n):
    for i in xrange(0, len(l), n):
        yield frozenset(l[i:i+n])


for word, wid in map(lambda x: x.split(), open("sample_dict.txt").readlines()):
    word_ids[word] = int(wid)

for doc_id, doc in enumerate(sys.stdin):
    words = doc.strip().lower().split()

    signature = map(lambda x: word_ids.get(x), words)
    signature = filter(lambda x: x is not None, signature)

    min_hash_row = get_min_hash_row(signature)

    banded = get_band(min_hash_row, num_per_band)

    for band_id, band in enumerate(banded):
        print "%d\t%d\t%d" % (band_id, hash(band), doc_id)

Reducer Code

#lsh_reducre.py
__author__ = 'raj'

import sys

prev_band_id, prev_band_hash = None, None
cluster = []
cid = 0

for line in sys.stdin:
    band_id, band_hash, doc_id = line.strip().split("\t", 3)

    if prev_band_id is None and prev_band_hash is None:
        prev_band_id, prev_band_hash = band_id, band_hash

    if prev_band_id is band_id:
        if prev_band_hash == band_hash:
            cluster.append(doc_id)
        else:
            print cid, cluster
            cluster = [doc_id]
    else:
        print cid, cluster
        cluster = [doc_id]
        cid += 1
    prev_band_id, prev_band_hash = band_id, band_hash

In action

sample_input.txt

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Nike Airmax Running SHoe
Corduroy Shorts - Flat Front (For Men) BEIGE
Nokia Lumia 721
Corduroy Shorts - Flat Front (For Men) BROWN

sample_dict.txt

&	1
(for	2
-0	3
1-14	4
12-	5
14	6
2-piece	7
721	8
airmax	9
and	10
beige	11
blue	12
brown	13
corduroy	14
corduroys	15
denim	16
doll	17
dot	18
dress	19
fashion	20
flat	21
flower	22
front	23
inch	24
jumper	25
leggings	26
lumia	27
me	28
men)	29
nike	30
nokia	31
outfit	32
piece	33
pink	34
polka	35
running	36
shirt	37
shoe	38
shorts	39
slate	40
teal	41
top	42
white	43
with	44
you	45
-	46

Command

$ cat sample_input.txt | python lsh_mapper.py | sort | python lsh_reducer.py

Output

0 ['1', '2']
0 ['0']
0 ['5', '7']
0 ['6']
0 ['3']
0 ['4']
1 ['4']
1 ['6']
1 ['0']
1 ['2']
1 ['3', '5', '7']
1 ['1']
2 ['6']
2 ['4']
2 ['0', '1', '2']
2 ['3', '5', '7']
3 ['6']
3 ['3', '5', '7']
3 ['0', '1', '2']
3 ['4']
4 ['0', '1']
4 ['3']
4 ['5']
4 ['4']
4 ['2']
4 ['7']

resolved output

band 0
------
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 1
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 2
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

band 3
------
Corduroy Shorts - Flat Front (For Men) SLATE BLUE
Corduroy Shorts - Flat Front (For Men) BEIGE
Corduroy Shorts - Flat Front (For Men) BROWN

You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt
You & Me 1-14 inch Doll Piece Fashion Outfit - Flower Dress and Leggings pink

band 4
------
You & Me 1-14 inch Doll Piece Outfit - Teal Corduroys with Top white
You & Me 12- 14 inch 2-Piece Doll Fashion Outfit - Polka Dot Denim Dress Jumper with White Shirt

code here