Note: Any FAQs or clarifications relevant to the assignment will be posted here. This post will be continually updated (with newer updates at the bottom of the page), so make sure to check on it regularly—click the “Watch” button in the top-right corner to receive notifications about this thread. If you have a question which is not already addressed on this page, create a new thread to post your question on Ed.
Over the last few weeks, you’ve learned about the graph data structure, which we use to represent network data, i.e., data consisting of various entities and the relationships between them. As you might imagine, this makes graphs one of the most useful and flexible data structures in computer science! In this assignment, you’ll explore one use of graphs to model a large dataset of book reviews, and then perform some analysis and computations on this graph to create a book recommendation application.
This assignment is divided into two parts. In Part 1, you familiarize yourself with this assignment’s datasets, represent the data using graphs, and use these graphs to generate book recommendations. Then in Part 2, you learn about an augmented form of graphs that allow us to store a richer set of review data, and use this to make recommendations and predictions of how users might rate books they have not read yet.
Some pieces of advice for this assignment:
honour_code.txt file
included in the starter files on MarkUs (more below) which gives an
overview of common mistakes students make in regards to academic
integrity.To obtain the starter files for this assignment:
csc111/assignments folder.a3 folder that contains the
assignment’s starter files. This should look similar to what you had for
previous assignments.This assignment uses the following additional Python libraries:
networkxnumpyscipykaleido (optional, used to save Plotly graphs as image
files)To install these libraries, you can follow the same steps as manually installing any Python library from PyCharm:
This assignment contains a mixture of both written and programming
questions. All of your written work should be completed in the
a3.tex starter file, using the LaTeX typesetting language.
We recommend using the online platform Overleaf for LaTeX. Overleaf also
provides many tutorials
to help you get started with LaTeX. See the bottom of this page for Submission instructions.
With Amazon being as big as it is now, it’s easy to forget that it started as a humble online book seller. In this assignment, we revisit Amazon’s initial focus by looking at a dataset containing tens of thousands of Amazon user reviews of books.
There are two main entities for our domain, users and books. Books and users can be related by a review. Even though Amazon reviews have lots of data associated with them, we just focus on the numerical score given by the user, which is a positive integer rating from 1 to 5, inclusive.
Our dataset for this assignment consists of two different types of CSV files.
The first type is a CSV file representing book data. Each row has two entries: a unique identifier and the book’s title. Each book title is also unique (i.e., no two rows have the same title).
The second type is a CSV file representing book review data. Each row has three entries: a unique identifier of the user who made the review, a unique identifier of the book that was reviewed (this is the same id as in the book CSV file described in the previous point), and an integer representing the score that the user gave the book.
Each user has reviewed each book at most once (i.e., no two rows have the same user ID and book ID).
In the starter files, we’ve given you a few different sample dataset files in these formats. As usual, we recommend starting with the small ones (and even making up your own) rather than jumping to largest dataset, which is hardest to debug!
Our first goal for this assignment is to get you comfortable working with the dataset and representing it as a graph in Python.
In a3_part1.py, we’ve provided a modified version of
the Graph and _Vertex classes from in lecture.
First review these two classes, and then complete the function
load_review_graph contained at the bottom of the file,
according to its docstring.
Note that function may be fairly complex, and you’re encouraged to use helper functions to simplify your implementation.1
Also, your function implemention must have a running
time of \(\Theta(m + n)\), where \(m\) is the number of lines in the
reviews_file file, and \(n\) is the number of lines in the
book_names_file file. If your implementation has a spread
of running times, its worst-case running time must be \(\Theta(m + n)\).
After you’re done this part, you can try visualizing some of the
datasets using the visualize_graph function from
a3_visualization.py. For example:
>>> my_graph = load_review_graph('data/reviews_large.csv', 'data/book_names.csv')
>>> from a3_visualization import visualize_graph
>>> visualize_graph(my_graph)
Tip: you can click and drag a region of the graph visualization to zoom in on that region.
In a3.tex, analyze the running time of your
load_review_graph implementation, in terms of the \(m\) and \(n\) variables described in Question 1. To
do so, you need to analyze the running time of all helper functions you
write, as well as any Graph/_Vertex methods
you call (e.g., Graph.__init__ and
Graph.add_vertex).
Python running time notes:
open and csv.reader functions
both take constant time. (It’s the looping through each line of
the file that takes non-constant time.)One of the most fundamental questions we have when given entities in a graph like this is how similar these vertices are, which intuitively means how similar their sets of neighbours are. Formally, let \(G = (V, E)\) be a graph and \(v_1, v_2 \in V\). We can define the similarity score between \(v_1\) and \(v_2\) as follows:
\[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
Implement the method _Vertex.similarity_score, which
computes the similarity score between two vertices. Hint: you
can implement this formula in terms of set operations.
Then, implement Graph.similarity_score, which should
call that method.
Using this similarity score, we can now define a simple
recommendation system for our dataset. Your final task in this
section is to implement the method Graph.recommend_books,
which takes a book title and returns a list of recommendations by
choosing which books are most similar to the book.
Note: to test this function, don’t just use the large dataset! Use a small dataset and manually calculate similarity scores between different books, and then make sure that the recommendations are returned in the correct order.
After you’ve completed this part, try picking a book from
book_names.csv and seeing what books your
Graph.recommend_book recommend. Hopefully you should see
some recommendations that make intuitive sense, although we don’t have a
formal definition for what “good recommendation” really means!
>>> my_graph = load_review_graph('data/reviews_full.csv', 'data/book_names.csv')
>>> title = "Harry Potter and the Sorcerer's Stone (Book 1)"
>>> import pprint
>>> pprint.pprint(my_graph.recommend_books(title, 10))
['The Casual Vacancy',
'Harry Potter and the Chamber of Secrets',
'Harry Potter and the Prisoner of Azkaban',
'Harry Potter and the Chamber of Secrets, Book 2',
'Harry Potter and the Deathly Hallows, Book 7',
'Harry Potter And The Goblet Of Fire',
'Harry Potter And The Order Of The Phoenix',
'Harry Potter and the Half-Blood Prince (Book 6)',
"The Cuckoo's Calling (Cormoran Strike)",
'Fellowship of the Ring (Lord of the Rings Part 1)']The main limitation of our graph representation in Part 1 is that it
didn’t store the actual review scores. For our recommendation algorithm
Graph.recommend_books, this meant that the recommended
books just had to share similar sets of reviewers, but those reviewers
might have given very different scores. In Part 2, we study how we can
augment graphs so that every edge in the graph stores
additional information about the relationship between two
vertices.
Definition. A weighted graph is a tuple of two sets \(G = (V, E)\), where:
We also define the weight function of a graph, \(w: V \times V \to \R^{\geq 0}\), as follows:
\[ w(v_1, v_2) = \begin{cases} \text{the weight of the edge between $v_1$ and $v_2$,} & \text{if $v_1$ and $v_2$ are adjacent} \\ 0, & \text{if $v_1$ and $v_2$ are not adjacent} \end{cases} \]
We call the previous graphs we’ve studied unweighted graphs. We can also view unweighted graphs as a special case of weighted graphs where all edge weights equal 1. All of our previous graph terminology (e.g., adjacent, path, connected) apply to weighted graphs as well.
Now let us use a weighted graph to represent our book review dataset. Each vertex represents a user or a book, and an edge between a user and book represents a review, same as before. But now, each edge’s weight equals the score given for the review.
In a3_part2_recommendations.py, review the
WeightedGraph and _WeightedVertex classes. As
in Part 1, some methods are provided for you, and others you’ll need to
implement throughout this part.
Before implementing any of those methods, complete the function
load_weighted_review_graph at the bottom of the file, which
takes in the same datasets as before, but now returns a weighted graph
according to the description provided above.
Next, we incorporate these weights into our recommendation
system. There are two different similarity score methods in the
_WeightedVertex class for you to implement, according to
the following formulas:
_WeightedVertex.similarity_score_unweighted: this is
the exact same as in Part 1.
\[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
_WeightedVertex.similarity_score_strict: same as
Part 1, except the numerator only counts common neighbours that have the
same weight on the corresponding edges. When comparing two books, this
means we only count the common users that gave the books the exact same
review score.
\[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$, and $w(u, v_1) = w(u, v_2)$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
Your tasks:
_WeightedVertex methods.WeightedGraph.get_similarity_score; note that
there’s an additional parameter that allows the user to specify which of
the two similarity scores to use.WeightedGraph.recommend_books to
make book recommendations, again providing a choice of which similarity
score to use. You’ll notice that these recommendations are quite similar
to the ones from Part 1, as least for
"Harry Potter and the Sorcerer's Stone (Book 1)". This is
normal, so keep going!So far, our recommendations have been based entirely on book similarities. But think about how you might recommend a book or TV show to a friend: you don’t just look for books that are similar, you try to take into account your friend’s interests, likes, and dislikes. So the next step beyond making book recommendations is making predictions about what review score a user would give a particular book.
Open a3_part2_predictions.py and review the
ReviewScorePredictor abstract class and example subclasses
that we’ve implemented for you. Your task is to implement the remaining
subclass according to its specification.
When we studed recommendation algorithms in Part 1 and the beginning of this part, we left it up to our intuition to evaluate the quality of these recommendations. Does the unweighted vs. strict similarity score give “better” recommendations? You probably found it unsatisfying that we didn’t have quantitative ways of evaluating these algorithms.3 We can ask the same question of the review score prediction algorithms we introduced in the previous question. Given a predicted score for a user and book, how do we know how good it is?
It turns out that this problem is easier to evaluate computationally.
We follow a common methodology used in machine learning: we take our
real dataset and divide it into two parts: a training dataset
and a test dataset. In our context, the training
dataset is simply the CSV files you’ve been using so far to create
your Graph/WeightedGraph instances. But you
are now ready to use our test dataset, which you can find at
data/test-data/reviews_test.csv.
Your final task in this part is in
a3_part2_predictions.py: implement the function
evaluate_predictor, which takes in a predictor instance and
runs it on a test dataset.
After you complete this part, try running
evaluate_predictor on each of the subclasses found in
a3_part2_predictions.py, and with different
score_types for SimilarUserPredictor. You
should find something interesting: all of the prediction
algorithms get fewer number of reviews correct than simply always
predicting a 5! (5-star reviews are by far the most common review score
on Amazon.) But our weighted strict similarity score predictor has one
advantage: it has a lower average error than always predicting a score
of 5.
So does that make SimilarUserPredictor “better” than
FiveStarPredictor? Well, it depends on what we value more,
and what other metrics we might want to take into account. But at least
we now have quantitative measures that we can compare!
Please proofread and test your work carefully before your final submission!
Login to MarkUs.
Go to Assignment 3, and the “Submissions” tab.
Submit the following files: a3.tex,
a3.pdf, a3_part1.py,
a3_part2_predictions.py,
a3_part2_recommendations.py, and
honour_code.txt.
Please note that MarkUs is picky with filenames, and so your filenames must match these exactly, including using lowercase letters.
Note: for your Python code files, please make sure they run (in the Python console) before submitting them! Code submissions that do not run will receive a grade of zero for that part, which we do not want to happen to any of you, so please check your work carefully.
Refresh the page, and then download each file to make sure you submitted the right version.
Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.
After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or eat some fruit!
You may use work that you previously completed in Tutorial 7. ↩︎
In general, edge weights can be arbitrary real numbers, or more complex data types. For this assignment, our edge weights will represent review scores, and so will be positive integers between 1 and 5.↩︎
Note that evaluating recommendation quality is different than evaluating the correctness of the recommendation algorithm itself. All of the recommendation algorithms you’ve implemented can (and should!) be tested thoroughly so that they return the correct results on any give graph. What we’re talking about in this question is evaluating how “good” those recommendations are, assuming our algorithms all work properly. ↩︎