STRIDE project — Marina Cheng

From CSclasswiki
Jump to: navigation, search

This project is about recommender systems in education. By analyzing students’ test scores, the goal is to help schools put students with teachers best suited for their academic needs. Recommender systems are software tools that give predictions of user preferences. For example, Netflix uses them to predict movies that each user might like based on ratings of previous movies.

Cinematch was Netflix’s original movie recommendation system. Netflix started the Netflix Prize Competition in October 2006 with a $1 million prize in order to find a more effective recommendation system. Netflix provided competitors with anonymous rating data. Each team had to get at least 10% improvement in the accuracy of movie predictions to actual user ratings from cinematch in order to qualify for the prize (www.netflixprize.com/rules). BellKor’s Pragmatic Chaos won the competition in September 2009. They used matrix factorization models (Koren, Bell, Volinsky).

The BellKor Pragmatic Chaos Team (The winners of the Netfix Prize Competition): File:BellKor PragmaticChaos.png

source: http://www2.research.att.com/~volinsky/netflix/bpc.html, Getty Images

The leading contestants generally used collaborative filtering and matrix factorization. Collaborative filtering matches users who like similar movies and recommends new movies based on that match. Matrix factorization characterizes both movies and viewers by vectors of factors inferred from movie rating patterns. High correspondence between movie and viewer factors leads to a recommendation.

The matrix factorization graph with the five movies is a fictional example of the first two vectors from a matrix decomposition of movie ratings data. The movies are placed based on their two factor vectors. Increasing the dimensionality of the vectors increases the accuracy of the predictions.

The matrix factorization graph: caption movies source: imdb.com

We use PyRSVD, a Regularized Singular Value Decomposition Solver written in Python, and modify it in order to take in Campus School test scores data instead of Netflix’s movie ratings data.

We used data from the Campus School, which is a lab school for Smith’s education department. We looked at anonymized test scores for 313 students over 3 grade levels, covering 5 consecutive years and analyzed the students’ total scores.

A picture of the Campus School: File:CampusSchool.png source: http://www.smith.edu/sccs/

Netflix calculates the root-mean-square(RMSE) to determine the accuracy of the ratings, so we do so for the Campus School data as well. When we input the campus school data in the PyRSVD program, we get the error analysis shown in the Campus School graph. The error calculated is the Root Mean Square Error, which is the difference between the predicted and actual results. Train err is the error of the training data, where the program already knows the desired output. Probe err is the error of the masked data, where the program does not already know the desired output.

Campus School graph: caption

The PyRSVD Netflix graph: caption

The root mean square error of the students’ predicted total scores is about 14.47. In regards to test scores, such uncertainty is too high, which means that we don’t have enough data for accurate enough predictions without more students’ test scores over a longer period of years.

RMSE calculation: caption

Currently, we are trying to determine how much data is needed for accurate predictions and how much data can be masked.

Poster that I made about this research: File:Mcheng55 collaborations 040314.pdf


Important notes to run the program:

-I used PyRSVD-0.2.5

-Must install Python2.7, Numpy, Matplotlib, and Cython

-test PyRSVD using tutorial.py that is included in the file

-I used 1 million ratings from 6000 users on 4000 movies from: http://grouplens.org/datasets/movielens/


Website essential in running PyRSVD:

PyRSVD, a regularized singular value decomposition solver for collaborative filtering written in python: https://code.google.com/p/pyrsvd/


Sites that I referred to/used:

BellKor’s Pragmatic Chaos’ website (the winning team of the Netflix Prize Competition): http://www2.research.att.com/~volinsky/netflix/bpc.html

Netflix Prize Rules: http://www.netflixprize.com/rules

Paper written by Koren, Bell, and Volinsky. BellKor’s Pragmatic Chaos’ paper. Title: Matrix Factorization Techniques for Recommender Systems https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf

Smith College Campus School website: http://www.smith.edu/sccs/

“Introduction to Recommender Systems Handbook” by Ricci, Rokach, and Shapira http://www.inf.unibz.it/~ricci/papers/intro-rec-sys-handbook.pdf

A quick intro about the Netflix Prize – Wikipedia http://en.wikipedia.org/wiki/Netflix_Prize

“Beyond Recommender Systems: Helpeing People Help Each Other” by Terveen and Hill http://files.grouplens.org/papers/rec-sys-overview.pdf


Other papers:

“Content-boosted Matrix Factorization for Recommender Systems: Experiments with Recipe Recommendation” by Forbes and Zhu

“Matrix Factorization and Neighbor Based Algorithms for the Netflix Prize Problem” by Takács, Pilászy, Németh, and Tikk

“Matrix factorization for the Netflix Prize” (2012)

“The BellKor 2008 Solution to the Netflix Prize” by Bell, Koren, and Volinsky

“The BigChaos Solution to the Netflix Prize 2008” by Töscher and Michael Jahrer


Future work:

- See if PyRSVD can recreate synthetic data

- Find out how much Netflix (and Campus School) data is needed to have accurate enough predictions

- Find out how much Netflix (and Campus School) data can be masked

- Conduct tests with more data from other schools


Go to: 2014/2015 - STRIDE project - Marina Cheng