Text classification help us to better understand and organize data. I’ve tried building a simple CNN classifier using Keras with tensorflow as backend to classify products available on eCommerce sites. Data for this experiment are product titles of three distinct categories from a popular eCommerce site. Reference: Tutorial
tl;dr
Python notebook and data
Collecting Data
For this experiment I’ve collected product titles belonging to the following categories.
- Women’s clothing
- Cameras
- Home appliances
Since these categories are distinct, meaning they don’t have any overlap of contextual information, Our model should have less classification errors/perform well. I’ve tried to implement 2 proven architecture of CNN with Word2Vec embedding.
Setup
We need the following libraries
- Gensim
- Keras
- NLTK
- Pandas
- Numpy
- Tensorflow
and
- Conda to manage virtual environment
- Pre-trained vectors trained on Google News dataset download 1.5GB for Word2Vec embedding.
import numpy as np | |
import pandas as pd | |
from gensim.models import KeyedVectors | |
from keras.layers import Flatten | |
from keras.layers import MaxPooling1D | |
from keras.models import Model | |
from keras.preprocessing.sequence import pad_sequences | |
from keras.preprocessing.text import Tokenizer | |
from keras.utils import to_categorical | |
from nltk.corpus import stopwords | |
MAX_NB_WORDS = 200000 | |
MAX_SEQUENCE_LENGTH = 30 | |
EMBEDDING_DIM = 300 | |
EMBEDDING_FILE = "../lib/GoogleNews-vectors-negative300.bin" | |
category_index = {"clothing":0, "camera":1, "home-appliances":2} | |
category_reverse_index = dict((y,x) for (x,y) in category_index.items()) | |
STOPWORDS = set(stopwords.words("english")) |
Loading Data
Download data. It is important to make sure that the data doesn’t have any null
/Nan
values.
clothing = pd.read_csv("clothing.tsv", sep='\t') | |
cameras = pd.read_csv("cameras.tsv", sep='\t') | |
home_appliances = pd.read_csv("home.tsv", sep='\t') | |
datasets = [clothing, cameras, home_appliances] | |
print("Make sure there are no null values in the datasets") | |
for data in datasets: | |
print("Has null values: ", data.isnull().values.any()) |
Make sure there are no null values in the datasets Has null values: False Has null values: False Has null values: False
Preprocessing
Stop words or words that occur frequently and is distracting are removed first, Then we use classes provided by Keras to help prepare text so it can be used by neural network models.
def preprocess(text): | |
text= text.strip().lower().split() | |
text = filter(lambda word: word not in STOPWORDS, text) | |
return " ".join(text) | |
for dataset in datasets: | |
dataset['title'] = dataset['title'].apply(preprocess) |
To prepare the vector (array of integers) representation of text :
- Combine titles from all three cateories to obtain a list of text.
- Drop duplicates
- Initialize tokenizer with
num_words = MAX_NB_WORDS
(200K). i.e. The tokenizer will perform a word count, sorted by number of occurences in descending order and pick top N words, 200K in this case - Use tokenizer’s
texts_to_sequences
method to convert text to array of integers. - The arrays obtained from previous step might not be of uniform length, use
pad_sequences
method to obtain arrays with length equal toMAX_SEQUENCE_LENGTH
(30)
all_texts = clothing['title'] + cameras['title'] + home_appliances['title'] | |
all_texts = all_texts.drop_duplicates(keep=False) | |
tokenizer = Tokenizer(num_words=MAX_NB_WORDS) | |
tokenizer.fit_on_texts(all_texts) | |
clothing_sequences = tokenizer.texts_to_sequences(clothing['title']) | |
electronics_sequences = tokenizer.texts_to_sequences(cameras['title']) | |
home_appliances_sequences = tokenizer.texts_to_sequences(home_appliances['title']) | |
clothing_data = pad_sequences(clothing_sequences, maxlen=MAX_SEQUENCE_LENGTH) | |
electronics_data = pad_sequences(electronics_sequences, maxlen=MAX_SEQUENCE_LENGTH) | |
home_appliances_data = pad_sequences(home_appliances_sequences, maxlen=MAX_SEQUENCE_LENGTH) |
A word_index
has a unique integer ID assigned to each word in the data. For example
word_index = tokenizer.word_index | |
test_string = "sports action spy pen camera" | |
print("word\t\tid") | |
print("-" * 20) | |
for word in test_string.split(): | |
print("%s\t\t%s" % (word, word_index[word])) |
word id -------------------- sports 16 action 13 spy 7 pen 55 camera 2
The tokenizer will replace words with unique integer id to get a vector representation of the title. Example:
test_sequence = tokenizer.texts_to_sequences(["sports action camera", "spy pen camera"]) | |
padded_sequence = pad_sequences(test_sequence, maxlen=MAX_SEQUENCE_LENGTH) | |
print("Text to Vector", test_sequence) | |
print("Padded Vector", padded_sequence) |
Text to Vector [[16, 13, 2], [7, 55, 2]] Padded Vector [[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 13 2] [ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 55 2]]
Product titles belonging to all three categories are kept separate so far for the sake of understanding. To prepare the input layer, All three cateogries are combined together and shuffled as shown below.
The category (y-axis or label) is converted to convnet’s understandable format (one hot vector) by using the keras.util
method to_categorical
. Example:
print("clothing: \t\t", to_categorical(category_index["clothing"], 3)) | |
print("camera: \t\t", to_categorical(category_index["camera"], 3)) | |
print("home appliances: \t", to_categorical(category_index["home-appliances"], 3)) |
clothing: [[ 1. 0. 0.]] camera: [[ 0. 1. 0.]] home appliances: [[ 0. 0. 1.]]
print("clothing shape: ", clothing_data.shape) | |
print("electronics shape: ", electronics_data.shape) | |
print("home appliances shape: ", home_appliances_data.shape) | |
data = np.vstack((clothing_data, electronics_data, home_appliances_data)) | |
category = pd.concat([clothing['category'], cameras['category'], home_appliances['category']]).values | |
category = to_categorical(category) | |
print("-"*10) | |
print("combined data shape: ", data.shape) | |
print("combined category/label shape: ", category.shape) |
clothing shape: (392721, 30) electronics shape: (1347, 30) home appliances shape: (11425, 30) ---------- combined data shape: (405493, 30) combined category/label shape: (405493, 3)
Shuffling and splitting the data since categories are stacked one after the other. nb_validation_samples
is the index which separates training and testing/validating sets. This step can be simplified by train_test_split from scikit.
VALIDATION_SPLIT = 0.4 | |
indices = np.arange(data.shape[0]) # get sequence of row index | |
np.random.shuffle(indices) # shuffle the row indexes | |
data = data[indices] # shuffle data/product-titles/x-axis | |
category = category[indices] # shuffle labels/category/y-axis | |
nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0]) | |
x_train = data[:-nb_validation_samples] | |
y_train = category[:-nb_validation_samples] | |
x_val = data[-nb_validation_samples:] | |
y_val = category[-nb_validation_samples:] |
word2vec embedding
Word2Vec brings in semantic similarity info which can be leveraged by the convnets. This experiment uses pre-trained vectors from Google news.One other option is GloVe.
word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE, binary=True) | |
print('Found %s word vectors of word2vec' % len(word2vec.vocab)) |
Found 3000000 word vectors of word2vec
The following examples should help understand the intent behind using a pre trained word2vec.
print("Odd word out:", word2vec.doesnt_match("banana apple grapes carrot".split())) | |
print("-"*10) | |
print("Cosine similarity between TV and HBO:", word2vec.similarity("tv", "hbo")) | |
print("-"*10) | |
print("Most similar words to Computers:", ", ".join(map(lambda x: x[0], word2vec.most_similar("computers")))) | |
print("-"*10) |
Odd word out: carrot ---------- Cosine similarity between TV and HBO: 0.613064891522 ---------- Most similar words to Computers: computer, laptops, PCs, laptop_computers, desktop_computers, Computers, laptop, notebook_computers, Dell_OptiPlex_desktop, automated_seismographs ----------
Keras embedding layer can be obtained by Gensim Word2Vec’s word2vec.get_keras_embedding(train_embeddings=False)
method or constructed like shown below. The null word embeddings indicate the number of words not found in our pre-trained vectors (In this case Google News). This could possibly be unique words for brands in this context.
from keras.layers import Embedding | |
word_index = tokenizer.word_index | |
nb_words = min(MAX_NB_WORDS, len(word_index))+1 | |
embedding_matrix = np.zeros((nb_words, EMBEDDING_DIM)) | |
for word, i in word_index.items(): | |
if word in word2vec.vocab: | |
embedding_matrix[i] = word2vec.word_vec(word) | |
print('Null word embeddings: %d' % np.sum(np.sum(embedding_matrix, axis=1) == 0)) | |
embedding_layer = Embedding(embedding_matrix.shape[0], # or len(word_index) + 1 | |
embedding_matrix.shape[1], # or EMBEDDING_DIM, | |
weights=[embedding_matrix], | |
input_length=MAX_SEQUENCE_LENGTH, | |
trainable=False) |
Null word embeddings: 1473
Model
I recommend this (30 Min) video about how Convnets work to understand the layers. Below is the replication of 2 proven architectures. More can be found here
from keras.models import Sequential | |
from keras.layers import Conv1D, GlobalMaxPooling1D, Flatten | |
from keras.layers import Dense, Input, LSTM, Embedding, Dropout, Activation | |
model = Sequential() | |
model.add(embedding_layer) | |
model.add(Dropout(0.2)) | |
model.add(Conv1D(300, 3, padding='valid',activation='relu',strides=2)) | |
model.add(Conv1D(150, 3, padding='valid',activation='relu',strides=2)) | |
model.add(Conv1D(75, 3, padding='valid',activation='relu',strides=2)) | |
model.add(Flatten()) | |
model.add(Dropout(0.2)) | |
model.add(Dense(150,activation='sigmoid')) | |
model.add(Dropout(0.2)) | |
model.add(Dense(3,activation='sigmoid')) | |
model.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc']) | |
model.summary() |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 30, 300) 817200 _________________________________________________________________ dropout_9 (Dropout) (None, 30, 300) 0 _________________________________________________________________ conv1d_9 (Conv1D) (None, 14, 300) 270300 _________________________________________________________________ conv1d_10 (Conv1D) (None, 6, 150) 135150 _________________________________________________________________ conv1d_11 (Conv1D) (None, 2, 75) 33825 _________________________________________________________________ flatten_3 (Flatten) (None, 150) 0 _________________________________________________________________ dropout_10 (Dropout) (None, 150) 0 _________________________________________________________________ dense_9 (Dense) (None, 150) 22650 _________________________________________________________________ dropout_11 (Dropout) (None, 150) 0 _________________________________________________________________ dense_10 (Dense) (None, 3) 453 ================================================================= Total params: 1,279,578 Trainable params: 462,378 Non-trainable params: 817,200 _________________________________________________________________
model_1 = Sequential() | |
model_1.add(embedding_layer) | |
model_1.add(Conv1D(250,3,padding='valid',activation='relu',strides=1)) | |
model_1.add(GlobalMaxPooling1D()) | |
model_1.add(Dense(250)) | |
model_1.add(Dropout(0.2)) | |
model_1.add(Activation('relu')) | |
model_1.add(Dense(3)) | |
model_1.add(Activation('sigmoid')) | |
model_1.compile(loss='categorical_crossentropy',optimizer='rmsprop',metrics=['acc']) | |
model_1.summary() |
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= embedding_2 (Embedding) (None, 30, 300) 817200 _________________________________________________________________ conv1d_12 (Conv1D) (None, 28, 250) 225250 _________________________________________________________________ global_max_pooling1d_3 (Glob (None, 250) 0 _________________________________________________________________ dense_11 (Dense) (None, 250) 62750 _________________________________________________________________ dropout_12 (Dropout) (None, 250) 0 _________________________________________________________________ activation_5 (Activation) (None, 250) 0 _________________________________________________________________ dense_12 (Dense) (None, 3) 753 _________________________________________________________________ activation_6 (Activation) (None, 3) 0 ================================================================= Total params: 1,105,953 Trainable params: 288,753 Non-trainable params: 817,200 _________________________________________________________________
model.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128) | |
score = model.evaluate(x_val, y_val, verbose=0) | |
print('Test loss:', score[0]) | |
print('Test accuracy:', score[1]) |
Train on 243296 samples, validate on 162197 samples Epoch 1/5 243296/243296 [==============================] - 22s 92us/step - loss: 0.1106 - acc: 0.9768 - val_loss: 0.1090 - val_acc: 0.9773 Epoch 2/5 243296/243296 [==============================] - 24s 97us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1091 - val_acc: 0.9775 Epoch 3/5 243296/243296 [==============================] - 21s 86us/step - loss: 0.1102 - acc: 0.9770 - val_loss: 0.1080 - val_acc: 0.9774 Epoch 4/5 243296/243296 [==============================] - 23s 93us/step - loss: 0.1096 - acc: 0.9772 - val_loss: 0.1088 - val_acc: 0.9776 Epoch 5/5 243296/243296 [==============================] - 24s 98us/step - loss: 0.1098 - acc: 0.9773 - val_loss: 0.1097 - val_acc: 0.9773 Test loss: 0.10969909843 Test accuracy: 0.977305375562
model_1.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=2, batch_size=128) | |
score = model_1.evaluate(x_val, y_val, verbose=0) | |
print('Test loss:', score[0]) | |
print('Test accuracy:', score[1]) |
Train on 243296 samples, validate on 162197 samples Epoch 1/5 243296/243296 [==============================] - 13s 52us/step - loss: 8.3458e-04 - acc: 0.9999 - val_loss: 9.0927e-04 - val_acc: 0.9999 Epoch 2/5 243296/243296 [==============================] - 12s 48us/step - loss: 7.2089e-04 - acc: 0.9999 - val_loss: 0.0011 - val_acc: 0.9999 Epoch 3/5 243296/243296 [==============================] - 12s 49us/step - loss: 7.2221e-04 - acc: 1.0000 - val_loss: 0.0012 - val_acc: 0.9999 Epoch 4/5 243296/243296 [==============================] - 12s 51us/step - loss: 7.1913e-04 - acc: 0.9999 - val_loss: 0.0010 - val_acc: 0.9999 Epoch 5/5 243296/243296 [==============================] - 12s 49us/step - loss: 6.7104e-04 - acc: 1.0000 - val_loss: 0.0011 - val_acc: 0.9999 Test loss: 0.00113550592472 Test accuracy: 0.999895189184
model_1
is better than the other. Below is an example on how to use this model.
example_product = "Nikon Coolpix A10 Point and Shoot Camera (Black)" | |
example_product = preprocess(example_product) | |
example_sequence = tokenizer.texts_to_sequences([example_product]) | |
example_padded_sequence = pad_sequences(example_sequence, maxlen=MAX_SEQUENCE_LENGTH) | |
print("-"*10) | |
print("Predicted category: ", category_reverse_index[model_1.predict_classes(example_padded_sequence, verbose=0)[0]]) | |
print("-"*10) | |
probabilities = model_1.predict(example_padded_sequence, verbose=0) | |
probabilities = probabilities[0] | |
print("Clothing Probability: ",probabilities[category_index["clothing"]] ) | |
print("Camera Probability: ",probabilities[category_index["camera"]] ) | |
print("home appliances probability: ",probabilities[category_index["home-appliances"]] ) |
---------- Predicted category: camera ---------- Clothing Probability: 5.12844e-21 Camera Probability: 0.505056 home appliances probability: 5.71945e-23
Conclusion
My observation is that with neural networks, the time taken for feature engineering is considerable reduced and researchers spend most of their time in deciding the architecture of the neural network layers. Word2Vec embedding greatly contributes to improving the accuracy of the model.
Hi Rajmak! Thanks for sharing your knowledge.
You got excellent coding skills! Love in particular your script for ‘null word embeddings’! Awesome!
A few comments: for Word2Vec, we can safely ignore ‘stop words’. The optimizer ‘adam’ does often a better job than ‘rmsprop’. Last but not least, the dataset in general is small-ish, in particualar for ‘electronics’ and ‘home appliances’. I love DL, but with a small dataset, a logistic regression might perform equally well.
Hi Franco, Thanks for your interest in my blog. I greatly appreciate your kind words and thoughtful comments that helped me improve.
Beginner here!
nb_words = min(MAX_NB_WORDS, len(word_index))+1
Can you explain the +1 here?
Hi Dhruv, Thanks for your interest in my blog.
The zero’th index of the input dimension (for Embedding class of Keras) is reserved for masking/padding/no-data. This is done to give room for unknown word,
i.e. in case the sequence contains a word that is not in the word index (dictionary), this word will be the unknown index (or zero’th index)
https://keras.io/layers/embeddings/
Hi Rajesh Manikka,
Very meaningful explanation.
I have followed your steps but got “Test loss: nan” after executing below code:
model_1.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=5, batch_size=128)
score = model_1.evaluate(x_val, y_val, verbose=0)
print(‘Test loss:’, score[0])
print(‘Test accuracy:’, score[1])
What I missed?
Please Help.
Hi Ghanshyam,
Appreciate your feedback and thanks for your interest in my blog.
nan or “Not a number” could also mean its infinity. Please check your data, Finding outliers, normalizing data could help.
Hello Rajesh Manikka,
Thanks for your response.
I have taken the same code & data provide in your given example.
For “model_1” -> first Epoch gives proper value.
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] – 13s 52us/step – loss: 8.3458e-04 – acc: 0.9999 – val_loss: 9.0927e-04 – val_acc: 0.9999
But Epoch 2/5 -> gives loss: nan and same for rest of all.
Thanks. Please Guide.
Hey Rajesh,
great Blog.
One question remains in the back of my mind. How does the Embedding Layer know which word is meant by the given index? So where is the connection between the tokenizer and the embedding layer?
Thank you Sebastian!
Kindly check this gist for the relation between tokenizer and embedding layer.