Edit: As pointed out in the comments my initial claim that it beats the winning solution turned out to be false. The prize was judged on a dataset that was set in a future time as compared to the training set.

If you are familiar with the Netflix prize challenge, you would remember the final solution that got the 1M prize.. It was a mix of solutions from a few winning teams and probably was not the most elegant of solutions. There were some reports that Netflix did not use the final solution.

Here is a much simpler Neural network based solution t~~hat beats the top result on a validation set carved from the original dataset. and should not take more than 3 hours to run and few minutes to code~~.

Code for the model as implemented in Keras

movie_count = 17771 user_count = 2649430 model_left = Sequential() model_left.add(Embedding(movie_count, 60, input_length=1)) model_right = Sequential() model_right.add(Embedding(user_count, 20, input_length=1)) model = Sequential() model.add(Merge([model_left, model_right], mode='concat')) model.add(Flatten()) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(64)) model.add(Activation('sigmoid')) model.add(Dense(1)) model.compile(loss='mean_squared_error', optimizer='adadelta')

In short there are 2 embeddings a 60 dimensional embedding for movies and a 20 dimensional embedding for each of the users. The embeddings are concatenated to form a single 80 dimensional input vector.

The Embedding layer in Keras provides a mapping from an integer to a vector of specified length initialized randomly. For instance this code

model_left.add(Embedding(movie_count, 60, input_length=1))

initializes a matrix of dimensions movie_count x 60 ( 17771 x 60) randomly. Similarly for users, a matrix of dimensions ( 2649430 x 20) is initialized. During training phase, the vector for each user and movie is updated so as to reduce the error. Ultimately at the end of training all users with similar interests should move closer and similar movies should move closer in their respective embedding space.

The network learns a mapping between this embedding vector and the rating. The model is fit as:

model.fit([tr[:,0].reshape((L,1)), tr[:,1].reshape((L,1))], tr[:,2].reshape((L,1)), batch_size=24000, nb_epoch=42, validation_data=([ ts[:,0].reshape((M,1)), ts[:,0].reshape((M,1))], ts[:,2].reshape((M,1))))

In the above code, tr is the training data, which are triples of movie_id, user_id, rating. L is 90432456 and M is 10048051, (around 90%, 10%) split for training and validation data. Another thing to note is that the data set is randomly shuffled before split and the training is done on the random set.

After about 40 epochs you should see the validation error around: 0.7006 or a root mean squared error of 0.837. ~~This is around 2% better than the million dollar prize winning error rate of 0.8553.~~

Epoch 39/64 90432456/90432456 [==============================] - 1183s - loss: 0.6475 - val_loss: 0.7006

This model does not consider the time of rating date. The results can probably improved by using all available data.

Its easy to get 0.85x scores on a random subset.

Try to validate your model on the original NetflixPrize probe set and report the RMSE again 🙂

It will be 0.9xx I guess.

I’d be a bit skeptical here as well. Deep layered models were tested during that contest w.r.t the RBMs with as I recall very little to no success. The probe data set, and the contest test set used a fixed number of predictions per customer which is different than a random selection from the entire training set which would skew towards customers with more ratings and therefore lower model error.

Thanks Michael, Aron, You are probably right. I am currently running the training on the all training set (but probe) and validating on the probe set. I will post the results as soon as I have it.

Yes, please do post. I’m curious to see the results as well.

Michael was right, on the exact same model, the validation error rate is 0.9357. I will update the post. I am also tuning the model and adding the time factor. I will post back if I can get the score above 0.9

Could you add a link to the dataset?

Here is a link to the dataset http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a

Curious if you have the code to process the data-set into tr, somewhere?

Also as requested by Akash, here is the shell script I used (on osx):

for file in `ls`; do m=`echo $file | sed -e ‘s/mv_//’ | sed -e ‘s/.txt//’`; cat $file | sed -e “s/^/${m},/”; done > ../ds

cd ..

grep -v “:” ds > ds_clean

cat ds_clean | gshuf > suffed_ds

head -n 90432456 shuffed_ds > tr

tail -n 10048051 shuffed_ds > ts

Thank you for the additional information! I have not seen this before – is it accurate to say that what is happening is not unlike word2vec, where a one-hot encoding for users stacked with movies is being transferred to a continuous vector space….BUT we are also at the same time learning the relationship between these vectors and the ratings? So when a test set is scored, their user/movie IDs are mapped to the vector space and then a rating is predicted?

Yes that is correct,

Would you mind sharing the parameters you used to achieve these results? I would like to try and replicate them myself and play around with the model a bit.

Hi Ben,

I do not have any other parameters other then the ones shown.. All parameters are keras defaults.. I guess by tuning the parameters a better solution can be obtained on the probe set as well.

I cant reproduce since I am not on a Mac (the shell script), but what is tr – is is a matrix? What is the structure being passed to fit for X and y?

tr is just a 90432456 x 3 matrix, the first column is the movie-id (as provided by netflix), the second column is the user-id (again as provided by netflix) and the third column is the rating. It is generated from the tr file. which looks like

0002391,1447366,3,2004-08-09

0006287,2045025,5,2005-10-11

0016922,2018881,4,2005-03-01

0012317,2236860,5,2005-10-23

0002152,20408,4,2005-05-20

The 4th column is not used in the above model.

Do you happen to have knowledge of what flatten() does? I see you need to use it but I am not sure why or what it does?

I guess in this example, the output of the first merged layer is (batch_size x 1 x 70) and flatten makes it (batch_size x 70) .. I have not entirely sure though.

I see that you use a sigmoid activation function. Do you normalize the ratings?

No. I do not normalize the rating, that would be a good thing to try.

Cool example!

model.fit([tr[:,0].reshape((L,1)), tr[:,1].reshape((L,1))]

Here are you passing in a LIST for X? And the list has two elements, each of which is an matrix, and both are shape 90432456 x 1. Is that correct understanding for Keras to process the 2 inputs to the concatenated embedding matrix?

Yes.I am passing in a list where each element corresponds to a model in the merged model.. You could theoretically add other features (like date features or movie features) to the merged model and make it richer and hopefully better.

Do you happen to have any example of that – can you add “normal” features (not embeddings) into the model that has a merged embedding layers like this?