Deep learning solution for netflix prize

Edit: As pointed out in the comments my initial claim that it beats the winning solution turned out to be false. The prize was judged on a dataset that was set in a future time as compared to the training set.

If you are familiar with the Netflix prize challenge, you would remember the final solution that got the 1M prize.. It was a mix of solutions from a few winning teams and probably was not the most elegant of solutions.  There were some reports that Netflix did not use the final solution.

Here is a much simpler Neural network based solution that beats the top result on a validation set carved from the original dataset. and should not take more than 3 hours to run and few minutes to code.

Code for the model as implemented in Keras

movie_count = 17771
user_count = 2649430
model_left = Sequential()
model_left.add(Embedding(movie_count, 60, input_length=1))
model_right = Sequential()
model_right.add(Embedding(user_count, 20, input_length=1))
model = Sequential()
model.add(Merge([model_left, model_right], mode='concat'))
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(64))
model.add(Activation('sigmoid'))
model.add(Dense(1))
model.compile(loss='mean_squared_error', optimizer='adadelta')

In short there are 2 embeddings a 60 dimensional embedding for movies and a 20 dimensional embedding for each of the users. The embeddings are concatenated to form a single 80 dimensional input vector.

The Embedding layer in Keras provides a mapping from an integer to a vector of specified length initialized randomly. For instance this code

model_left.add(Embedding(movie_count, 60, input_length=1))

 

initializes a matrix of dimensions movie_count x 60 ( 17771 x 60) randomly. Similarly for users,  a matrix of dimensions ( 2649430 x 20)  is initialized. During training phase, the vector for each user and movie is updated so as to reduce the error.  Ultimately at the end of training all users with similar interests should move closer and similar movies should move closer in their respective embedding space.

The network learns a mapping between this embedding vector and the rating.  The model is fit as:

 model.fit([tr[:,0].reshape((L,1)), tr[:,1].reshape((L,1))], tr[:,2].reshape((L,1)), batch_size=24000, nb_epoch=42, validation_data=([ ts[:,0].reshape((M,1)), ts[:,0].reshape((M,1))], ts[:,2].reshape((M,1)))) 

In the above code, tr is the training data, which are triples of movie_id, user_id, rating. L is 90432456 and M is 10048051, (around 90%, 10%) split for training and validation data. Another thing to note is that the data set is randomly shuffled before split and the training is done on the random set.

After about 40 epochs you should see the validation error around: 0.7006 or a root mean squared error of 0.837.  This is around 2% better than the million dollar prize winning error rate of 0.8553.  

 Epoch 39/64
90432456/90432456 [==============================] - 1183s -
 loss: 0.6475 - val_loss: 0.7006
 

This model does not consider the time of rating date. The results can probably improved by using all available data.

Posted in Uncategorized | 22 Comments

Controlling of robots using python

I started taking a course on robotics by Prof Peter Corke and I ordered this robot on eBay for the project.  Prof Corke has an excellent Matlab toolbox for controlling and visualizing robots.  However the python version is incomplete and some of the tools did not work out of the box. I am now writing a simplified version of the robot control code.  The idea is to provide the basics of the Matlab robot toolbox and some more practical stuff.  Ultimately I want to integrate this with Caffe to have an automated robot.

 

Posted in Uncategorized | Tagged , | Leave a comment

Installing Caffe on Yosemite

Step 1:

Follow instructions for installing Caffe for 10.9 as on http://caffe.berkeleyvision.org/installation.html

Make sure you uninstall all the required packages in brew, modify the formula and then install the packages.

Step 2:

Edit Makefile look for ifneq ($(findstring 10.9, $(shell sw_vers -productVersion)),) andreplace 10.9 to 10.10

Step 3:

Fix BLAS_INCLUDE and Framework in LD_FLAGS

else ifeq ($(OSX), 1)

                # OS X packages atlas as the vecLib framework

                # BLAS_INCLUDE ?= /System/Library/Frameworks/vecLib.framework/Versions/Current/Headers/

                BLAS_INCLUDE ?= /System/Library/Frameworks/Accelerate.framework/Versions/Current/Frameworks/vecLib.framework/Versions/Current/Headers

                LIBRARIES += cblas

                #LDFLAGS += -framework vecLib

                LDFLAGS += -framework Accelerate

        endif

Step 4:

run make and it should compile fine.

Posted in Uncategorized | Leave a comment

libgpuarray installation issues

Here are some issues I faced installing Theano and libgpuarray and the solutions for those:

  1. Installation failed because of not finding gpuarray
    1. use: python setup.py build_ext -L /usr/local/lib -I /usr/local/include and then python setup.py install
  2. error importing pygpu
    1. If CUDA is installed on the system you need to add the path for CUDA in LD_LIBRARY_PATH, like: export LD_LIBRARY_PATH=/usr/local/lib:$LD_LIBRARY_PATH
  3. example convolution_mlp.py does not run on CUDA
    1. Try running the job as root
Posted in Uncategorized | Tagged , | Leave a comment

Leveldb v/s berkeley db

I had a requirement to index a good chunk of data (around 250M key value pairs) and wanted to try out both berkeleydb and leveldb. Here are some metrics when running on an amazon m1.xlarge machine. 

  1.  Time to build the index / db:
  • Berkeley DB: 1hr 52 minutes
  • Level DB: 33 minutes

    2.  Time to lookup 6719350 keys from the db (random lookup) on the second run 

  • Berkeley DB (cold): 25m 2s
  • Berkeley DB (hot): 1m 41s
  • Level DB (cold): 3m 16s
  • Level DB (hot) : 2m 18s

    3. DB Size

  • Berkeley DB: 9.5G
  • Level DB: 3.5G
Posted in Uncategorized | Leave a comment

How to create a set of indicator (booleans / onehot ) variables from a categorical (factor) variables in R

Here is an example of a categorical variable (factor in R) .

 data = cbind(data,model.matrix( ~ 0 + user_state, data))

Here user_state is a variable containing 51 values (1 for each state in US).. After the operation, we end up with the data variable containing 51 indicator variables, 1 for each state

Posted in Uncategorized | Leave a comment

s3cmd sync fails with Problem: OSError: [Errno 22] Invalid argument

I had s3cmd sync failing with Error 22.  But s3cmd get was working fine.. On closer examination of the code it turns out get uses “ab” mode for writing the destination file where as sync uses “wb” mode.. (As get assumes that we are creating a new file). To get around this problem you will have to edit s3cmd and change the write mode to “ab” and you will also have to enter a line to delete the file before opening in “ab” mode.

Posted in Uncategorized | Leave a comment