Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding

2018-05-26T01:21:24+05:30

Hi Rajmak! Thanks for sharing your knowledge.

You got excellent coding skills! Love in particular your script for ‘null word embeddings’! Awesome!

A few comments: for Word2Vec, we can safely ignore ‘stop words’. The optimizer ‘adam’ does often a better job than ‘rmsprop’. Last but not least, the dataset in general is small-ish, in particualar for ‘electronics’ and ‘home appliances’. I love DL, but with a small dataset, a logistic regression might perform equally well.

Reply

2018-05-28T13:56:30+05:30

Hi Franco, Thanks for your interest in my blog. I greatly appreciate your kind words and thoughtful comments that helped me improve.

Reply

2018-07-17T19:27:17+05:30

Beginner here!

nb_words = min(MAX_NB_WORDS, len(word_index))+1

Can you explain the +1 here?

Reply

2018-07-22T14:35:22+05:30

Hi Dhruv, Thanks for your interest in my blog.

The zero’th index of the input dimension (for Embedding class of Keras) is reserved for masking/padding/no-data. This is done to give room for unknown word,
i.e. in case the sequence contains a word that is not in the word index (dictionary), this word will be the unknown index (or zero’th index)
https://keras.io/layers/embeddings/

Reply

2019-05-13T16:58:56+05:30

Hi Rajesh Manikka,

Very meaningful explanation.
I have followed your steps but got “Test loss: nan” after executing below code:

model_1.fit(x_train, y_train, validation_data=(x_val, y_val), epochs=5, batch_size=128)
score = model_1.evaluate(x_val, y_val, verbose=0)
print(‘Test loss:’, score[0])
print(‘Test accuracy:’, score[1])

What I missed?
Please Help.

Reply

2019-05-14T16:32:54+05:30

Hi Ghanshyam,
Appreciate your feedback and thanks for your interest in my blog.
nan or “Not a number” could also mean its infinity. Please check your data, Finding outliers, normalizing data could help.

Reply

2019-05-14T18:20:07+05:30

Hello Rajesh Manikka,
Thanks for your response.
I have taken the same code & data provide in your given example.

For “model_1” -> first Epoch gives proper value.
Train on 243296 samples, validate on 162197 samples
Epoch 1/5
243296/243296 [==============================] – 13s 52us/step – loss: 8.3458e-04 – acc: 0.9999 – val_loss: 9.0927e-04 – val_acc: 0.9999

But Epoch 2/5 -> gives loss: nan and same for rest of all.

Thanks. Please Guide.

2019-08-17T14:41:30+05:30

Hey Rajesh,

great Blog.
One question remains in the back of my mind. How does the Embedding Layer know which word is meant by the given index? So where is the connection between the tokenizer and the embedding layer?

Reply

2019-08-27T20:17:37+05:30

Thank you Sebastian!
Kindly check this gist for the relation between tokenizer and embedding layer.

Reply

	import numpy as np
	import pandas as pd
	from gensim.models import KeyedVectors
	from keras.layers import Flatten
	from keras.layers import MaxPooling1D
	from keras.models import Model
	from keras.preprocessing.sequence import pad_sequences
	from keras.preprocessing.text import Tokenizer
	from keras.utils import to_categorical
	from nltk.corpus import stopwords

	MAX_NB_WORDS = 200000
	MAX_SEQUENCE_LENGTH = 30
	EMBEDDING_DIM = 300

	EMBEDDING_FILE = "../lib/GoogleNews-vectors-negative300.bin"
	category_index = {"clothing":0, "camera":1, "home-appliances":2}
	category_reverse_index = dict((y,x) for (x,y) in category_index.items())
	STOPWORDS = set(stopwords.words("english"))

	clothing = pd.read_csv("clothing.tsv", sep='\t')
	cameras = pd.read_csv("cameras.tsv", sep='\t')
	home_appliances = pd.read_csv("home.tsv", sep='\t')

	datasets = [clothing, cameras, home_appliances]

	print("Make sure there are no null values in the datasets")
	for data in datasets:
	print("Has null values: ", data.isnull().values.any())

	def preprocess(text):
	text= text.strip().lower().split()
	text = filter(lambda word: word not in STOPWORDS, text)
	return " ".join(text)

	for dataset in datasets:
	dataset['title'] = dataset['title'].apply(preprocess)

	all_texts = clothing['title'] + cameras['title'] + home_appliances['title']
	all_texts = all_texts.drop_duplicates(keep=False)

	tokenizer = Tokenizer(num_words=MAX_NB_WORDS)
	tokenizer.fit_on_texts(all_texts)

	clothing_sequences = tokenizer.texts_to_sequences(clothing['title'])
	electronics_sequences = tokenizer.texts_to_sequences(cameras['title'])
	home_appliances_sequences = tokenizer.texts_to_sequences(home_appliances['title'])

	clothing_data = pad_sequences(clothing_sequences, maxlen=MAX_SEQUENCE_LENGTH)
	electronics_data = pad_sequences(electronics_sequences, maxlen=MAX_SEQUENCE_LENGTH)
	home_appliances_data = pad_sequences(home_appliances_sequences, maxlen=MAX_SEQUENCE_LENGTH)

	word_index = tokenizer.word_index
	test_string = "sports action spy pen camera"
	print("word\t\tid")
	print("-" * 20)
	for word in test_string.split():
	print("%s\t\t%s" % (word, word_index[word]))

Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding

tl;dr

Collecting Data

Setup

Loading Data

Preprocessing

word2vec embedding

Model

Conclusion

Published by rajmak

9 thoughts on “Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding”

Leave a comment Cancel reply

	test_sequence = tokenizer.texts_to_sequences(["sports action camera", "spy pen camera"])
	padded_sequence = pad_sequences(test_sequence, maxlen=MAX_SEQUENCE_LENGTH)
	print("Text to Vector", test_sequence)
	print("Padded Vector", padded_sequence)

	print("clothing: \t\t", to_categorical(category_index["clothing"], 3))
	print("camera: \t\t", to_categorical(category_index["camera"], 3))
	print("home appliances: \t", to_categorical(category_index["home-appliances"], 3))

	print("clothing shape: ", clothing_data.shape)
	print("electronics shape: ", electronics_data.shape)
	print("home appliances shape: ", home_appliances_data.shape)

	data = np.vstack((clothing_data, electronics_data, home_appliances_data))
	category = pd.concat([clothing['category'], cameras['category'], home_appliances['category']]).values
	category = to_categorical(category)
	print("-"*10)
	print("combined data shape: ", data.shape)
	print("combined category/label shape: ", category.shape)

	VALIDATION_SPLIT = 0.4
	indices = np.arange(data.shape[0]) # get sequence of row index
	np.random.shuffle(indices) # shuffle the row indexes
	data = data[indices] # shuffle data/product-titles/x-axis
	category = category[indices] # shuffle labels/category/y-axis
	nb_validation_samples = int(VALIDATION_SPLIT * data.shape[0])
	x_train = data[:-nb_validation_samples]
	y_train = category[:-nb_validation_samples]
	x_val = data[-nb_validation_samples:]
	y_val = category[-nb_validation_samples:]

tl;dr

Collecting Data

Setup

Loading Data

Preprocessing

word2vec embedding

Model

Conclusion

Share this:

Related

Published by rajmak

9 thoughts on “Text Classification – Classifying product titles using Convolutional Neural Network and Word2Vec embedding”

Leave a comment Cancel reply