Nothing in this post is new research; this is all relatively basic machine learning. But it is fun.
This is the first day of my vacation. Time for some machine learning! Today, I will create embeddings. I will use the relatively big dataset from the Netflix Prize competition to create embeddings for movies.
We develop a way to automatically find the American counterpart of a Japanese movie!
After some preprocessing of the Netflix data, I ended up with 100,000,000 rows of data looking like this:
array([[ 14550, 108683, 4], [ 10583, 222881, 2], [ 16278, 3416, 5], ..., [ 8131, 144114, 3], [ 5862, 477329, 5], [ 10928, 225987, 3]])
The first column is the movie ID, the second columns is the user ID, and the last one is a rating 1–5 that the user assigned the movie. I know the titles of the movies, but have not made them available to the algorithm.
I used this data to train the following neural network:
The network takes the movie and user IDs as input, and returns the rating.
After training the Neural network, we have a 40-dimensional embedding of every movie in the dataset. These vectors are hard to visualize, but we can visualize the first two dimensions after applying PCA. Here are the 200 movies with the most number of ratings:
Some things make sense here. All three Lord of the Rings movies are really close to each other and so are the Kill Bill movies. Remember, the network did not have access to the names of the movies.
But perhaps the network has learnt some more interesting structure of the movies? Let’s define the following function:
def find_american_counterpart(japanese_name, n=10): # Use The Ring/Ringu as the canonical example from_japan_to_usa = (movie_points[movie_id["The Ring"], :] - movie_points[movie_id["Ringu"], :]) print_closest(movie_points[movie_id[japanese_name], :] + from_japan_to_usa, n)
This function takes a name of the Japanese movies and moves in the 40-dimensional space in the exact same direction and distance as from Ringu to The Ring.
Let’s try it on an old Kurosawa movie:
>>>find_american_counterpart("Yojimbo", 1) A Fistful of Dollars
Bingo! 😀 I am a bit surprised that this worked! As a sanity check, it is good to print the movies that are close to Yojimbo, so we are not just returning the closest one.
>>>print_closest(movie_points[movie_id["Yojimbo"], :]) Yojimbo Sanjuro Throne of Blood Hidden Fortress The Third Man The Big Sleep Modern Times Rashomon Black Adder II Ran
These are the real movie points and the real vector, but a lot of information is lost when going from 40 dimensions down to only two.
Let’s try the same vector with The Grudge:
>>>find_american_counterpart("Ju-on: The Grudge") The Ring Frailty Identity Saw The Legend of Sleepy Hollow Minority Report Stir of Echoes Fallen The Grudge Terminator 3: Rise of the Machines
This did not work as well (9th from the top). But at least The Grudge is not particular close to Ju-on: The Grudge, which means the counterpart function did some useful work.
For the interested, here is the Keras code generating the neural network:
movie = Input(shape=(1,), dtype='int32', name='Movie') user = Input(shape=(1,), dtype='int32', name='User') movie_emb = Embedding(num_movies, 40, name='MovieEmbedding')(movie) user_emb = Embedding(num_users, 40, name='UserEmbedding')(user) input = Concatenate()([movie_emb, user_emb]) x = Flatten()(input) x = Dense(64, activation='relu')(x) x = Dense(64, activation='relu')(x) x = Dense(64, activation='relu')(x) output = Dense(1)(x) model = Model(inputs=[movie, user], outputs=output) model.compile(loss='mean_squared_error', optimizer='adadelta')