In the last decade the Internet has come to dominate how we consume information. News, entertainment, and even education, are often a click away. If you know what you want, a few typed words can lead you to the webpage you seek. But what if your search is less concrete? What if you are looking to find inspiring, new, undiscovered content to consume? You could certainly ask a friend, or you could ask a personal recommendation engine.
Personal recommendation engines suggest content that they believe you may like. Just like your friend would. They figure this out based on what you, or similar minded people like you, have liked so far. Just like a friend would. The more you interact with them, the more successful they are when they make a suggestion. Just like a friend would. The big advantage these recommendation engines have over your friends, is in how much data they have access to, and in their ability to crunch this Big Data to provide diverse recommendations. Using this principle, systems like Pandora, Flipboard, and Netflix try to cater music, news, and movies to the preferences of the user.
So personal recommendation engines can be useful, but can they be critical to how users consume information on the web? Helping users discover new and interesting content is critical for user retention. In addition, a good understanding of your audience demographic has been at the core of business intelligence for a while now. Learning what your users need makes for a good product and compelling service: because a good product does not demand the user fit the model, but instead fits the user.
Netflix provides a good case study for understanding the significance, role, and performance of recommendation engines. As a movie rental company, Netflix’s main goal is to provide and connect its subscribers with movies that they enjoy. In order to effectively achieve this, they need to infer what the subscribers may enjoy, and do this with a high level of accuracy. What their system learns not only impacts what they recommend to the user on the website, but also what content they license from studios and movie makers, and perhaps, more recently, what original programming they invest in. Success in learning these user preferences, and tapping into the pulse of their audience, can yield great results for the company, like getting over half a million viewers to binge watch ‘House of Cards’ on the day of its premier.
It was with these motivations that Netflix introduced the Netflix prize, back in 2006, to improve their home grown recommendation algorithm called Cinematch. They decided to crowdsource the quest by turning it into a million dollar competition. Designing a machine learning system, or any general algorithm, comprises of many working parts, and going through them one by one, leads to the solution.
The goal here is to predict a rating on a 5 point scale, given a movie and a user combination. In order to arrive at this prediction, the solution can use information given to it beforehand: 100 million ratings from over 480 thousand randomly-chosen, anonymous customers on nearly 18 thousand movie titles. This huge dataset of <user, movie, rating> tuples is known as the training set. Inspection of this dataset, not only provides clues about how to design the algorithms, but also helps tune any parameters that factor into the prediction and decision making of the algorithm.
For the case of predicting movie ratings for users, here are some design considerations:
- Is a user’s rating based on what else (s)he has rated and (dis)likes? Or does it depend on how similar minded users rated this movie? Or both?
- How exactly are two users considered similar minded?
- For that matter, when are two movies considered similar? Is it genre, the cast, the director, the year it was made, or the country it hails from?
- When two users rate a movie 2 out of 5, do they dislike it the same amount?
Each recommendation engine, and in general, a machine learning algorithm, has to inspect the real world domain to identify important aspects of the problem at hand. Figuring out which of these truly matter, and how to tune the system to include them in the prediction is a process called “training”, and is done using the training set.
Once an algorithm is designed and tuned, how does one decide how good it is? To do this, the solution must be evaluated, and its accuracy measured.
- The first thing that is required is a performance metric, which measures how well the algorithm does at achieving its goal. This can be done in many ways using information criteria and prediction errors. The Netflix Prize made use of one of the most common metrics called the root mean squared errors (RMSE).
- The training set is just a glimpse into the kind of data the algorithm has to predict for. For example, the Netflix algorithm must be able to predict ratings for all future subscribers, and for all movies released henceforth. Which means we want the prediction error to be low for users and movies beyond the ones featured in the training set. Hence, measuring the error on the training data is not sufficient. To truly quantify the performance of the algorithm, it must be tested on a dataset different from the training set, one that represents the diversity of the real world problem. The Netflix Prize evaluated submitted algorithms on two such sets, called the quiz set and the test set.
From the onset, the competition was set up to last multiple years (up to 2011), because the task was considered to be an uphill one. While the ultimate goal was a 10% improvement over their existing 2006 algorithm’s performance, a rolling system also rewarded submissions which improved over the previous year’s best algorithm. The competition lasted 3 years, and received thousands of team registrations and algorithm submissions.
Finally, on September 21 2009, the Grand Prize was awarded to team BellKor’s Pragmatic Chaos. Their solution involved insights and improvements in various aspects of collaborative filtering.
Low prediction error is the most direct measure of the success of a machine learning algorithm. But for the accuracy to pay off, there are other considerations that must be carefully designed.
- Scalability: Space and time complexity of inference algorithms are critical in real-time systems like recommendation systems. A great suggestion is less useful if it takes several minutes or hours to compute. Similarly, if the performance suffers as more customers subscribe to Netflix, the user experience diminishes. Scalable design is an integral part of good recommendation engines.
- Exploration versus exploitation: One sure way of pleasing a user is to stay within their comfort zone. While this approach may provide accurate prediction of user preferences, it risks boring the user. Recommendation systems have to strike a balance between suggesting the more obvious choices (exploiting what is known to work), and introducing the user to new areas of content (exploring the unknown). Walking this fine line is no easy task, and is an area of research, in and of itself.
- Privacy: Personal information can allow recommendation engines to provide customized and personalized help to users. But they are also a treasure trove of information for malicious entities. To protect their audience from such anti-social forces, these systems have to abide by the highest levels of privacy and use the best security measures to protect and anonymize user information.
While the Netflix prize focused on movies, it is representative of the problem structure, design considerations, and performance metrics of other recommendation engines, and machine learning algorithms, at large. Good solutions involve breaking the problem down to its working parts, and building it back up to a solution that scales, and I will be investigating other such problems in machine learning, computer vision, and robotics along these lines in my future posts.