CSC111 Winter 2026 Assignment 3: Graphs and Recommender Systems

Note: Any FAQs or clarifications relevant to the assignment will be posted here. This post will be continually updated (with newer updates at the bottom of the page), so make sure to check on it regularly—click the “Watch” button in the top-right corner to receive notifications about this thread. If you have a question which is not already addressed on this page, create a new thread to post your question on Ed.

Over the last few weeks, you’ve learned about the graph data structure, which we use to represent network data, i.e., data consisting of various entities and the relationships between them. As you might imagine, this makes graphs one of the most useful and flexible data structures in computer science! In this assignment, you’ll explore one use of graphs to model a large dataset of book reviews, and then perform some analysis and computations on this graph to create a book recommendation application.

This assignment is divided into two parts. In Part 1, you familiarize yourself with this assignment’s datasets, represent the data using graphs, and use these graphs to generate book recommendations. Then in Part 2, you learn about an augmented form of graphs that allow us to store a richer set of review data, and use this to make recommendations and predictions of how users might rate books they have not read yet.

Some pieces of advice for this assignment:

Even though we are using graphs on this assignment, you actually won’t need the recursive traversal pattern we studied in lecture. You explore other kinds of computations on graphs that don’t require this pattern—or recursion at all, actually.
Similar to Assignment 2, the “final result” of many parts of this assignment are not directly testable. Instead, make sure you thoroughly test all of the main calculations and helper functions you create, as this will be a much better way of identifying errors than just looking at full program outputs/visualizations.
For both parts of this assignment, you may define your own “runner” functions in your files to help with testing purposes, and call them in the main block or in the Python console.

Logistics

Due date: Monday March 16, 1:00pm Toronto time.
General assignment expectations, including late policy and extensions: see the Assignment Expectations page and the syllabus.
This assignment must be done individually. The work you submit must be your own. Please review the Academic Integrity policies in the syllabus and the honour_code.txt file included in the starter files on MarkUs (more below) which gives an overview of common mistakes students make in regards to academic integrity.
You will submit your assignment solutions on MarkUs (see Submission instructions at the end of this handout).

To obtain the starter files for this assignment:

Click this link to download the starter files for this assignment. This will download a zip file to your computer.
Extract the contents of this zip file into your csc111/assignments folder.
You should see a new a3 folder that contains the assignment’s starter files. This should look similar to what you had for previous assignments.

Installing new libraries

This assignment uses the following additional Python libraries:

networkx
numpy
scipy
kaleido (optional, used to save Plotly graphs as image files)

To install these libraries, you can follow the same steps as manually installing any Python library from PyCharm:

Open the Settings/Preferences window, and go to “Python -> Interpreter”.
Click on the “+” icon on the border of the table of Python libraries.
In the popup window, search for each library listed above, and click on “Install package”.

General instructions

This assignment contains a mixture of both written and programming questions. All of your written work should be completed in the a3.tex starter file, using the LaTeX typesetting language. We recommend using the online platform Overleaf for LaTeX. Overleaf also provides many tutorials to help you get started with LaTeX. See the bottom of this page for Submission instructions.

The problem domain: Amazon book reviews

With Amazon being as big as it is now, it’s easy to forget that it started as a humble online book seller. In this assignment, we revisit Amazon’s initial focus by looking at a dataset containing tens of thousands of Amazon user reviews of books.

There are two main entities for our domain, users and books. Books and users can be related by a review. Even though Amazon reviews have lots of data associated with them, we just focus on the numerical score given by the user, which is a positive integer rating from 1 to 5, inclusive.

Our dataset for this assignment consists of two different types of CSV files.

The first type is a CSV file representing book data. Each row has two entries: a unique identifier and the book’s title. Each book title is also unique (i.e., no two rows have the same title).
The second type is a CSV file representing book review data. Each row has three entries: a unique identifier of the user who made the review, a unique identifier of the book that was reviewed (this is the same id as in the book CSV file described in the previous point), and an integer representing the score that the user gave the book.

Each user has reviewed each book at most once (i.e., no two rows have the same user ID and book ID).

In the starter files, we’ve given you a few different sample dataset files in these formats. As usual, we recommend starting with the small ones (and even making up your own) rather than jumping to largest dataset, which is hardest to debug!

Part 1: The book review graph and simple recommendations

Our first goal for this assignment is to get you comfortable working with the dataset and representing it as a graph in Python.

In a3_part1.py, we’ve provided a modified version of the Graph and _Vertex classes from in lecture. First review these two classes, and then complete the function load_review_graph contained at the bottom of the file, according to its docstring.

Note that function may be fairly complex, and you’re encouraged to use helper functions to simplify your implementation.¹

Also, your function implemention must have a running time of $\Theta(m + n)$, where $m$ is the number of lines in the reviews_file file, and $n$ is the number of lines in the book_names_file file. If your implementation has a spread of running times, its worst-case running time must be $\Theta(m + n)$.

After you’re done this part, you can try visualizing some of the datasets using the visualize_graph function from a3_visualization.py. For example:
```
>>> my_graph = load_review_graph('data/reviews_large.csv', 'data/book_names.csv')
>>> from a3_visualization import visualize_graph
>>> visualize_graph(my_graph)
```
Tip: you can click and drag a region of the graph visualization to zoom in on that region.
In a3.tex, analyze the running time of your load_review_graph implementation, in terms of the $m$ and $n$ variables described in Question 1. To do so, you need to analyze the running time of all helper functions you write, as well as any Graph/_Vertex methods you call (e.g., Graph.__init__ and Graph.add_vertex).

Python running time notes:
- The built-in open and csv.reader functions both take constant time. (It’s the looping through each line of the file that takes non-constant time.)
One of the most fundamental questions we have when given entities in a graph like this is how similar these vertices are, which intuitively means how similar their sets of neighbours are. Formally, let $G = (V, E)$ be a graph and $v_1, v_2 \in V$. We can define the similarity score between $v_1$ and $v_2$ as follows:

\[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
1. Implement the method _Vertex.similarity_score, which computes the similarity score between two vertices. Hint: you can implement this formula in terms of set operations.
2. Then, implement Graph.similarity_score, which should call that method.
Using this similarity score, we can now define a simple recommendation system for our dataset. Your final task in this section is to implement the method Graph.recommend_books, which takes a book title and returns a list of recommendations by choosing which books are most similar to the book.

Note: to test this function, don’t just use the large dataset! Use a small dataset and manually calculate similarity scores between different books, and then make sure that the recommendations are returned in the correct order.

After you’ve completed this part, try picking a book from book_names.csv and seeing what books your Graph.recommend_book recommend. Hopefully you should see some recommendations that make intuitive sense, although we don’t have a formal definition for what “good recommendation” really means!

>>> my_graph = load_review_graph('data/reviews_full.csv', 'data/book_names.csv')
>>> title = "Harry Potter and the Sorcerer's Stone (Book 1)"
>>> import pprint
>>> pprint.pprint(my_graph.recommend_books(title, 10))
['The Casual Vacancy',
 'Harry Potter and the Chamber of Secrets',
 'Harry Potter and the Prisoner of Azkaban',
 'Harry Potter and the Chamber of Secrets, Book 2',
 'Harry Potter and the Deathly Hallows, Book 7',
 'Harry Potter And The Goblet Of Fire',
 'Harry Potter And The Order Of The Phoenix',
 'Harry Potter and the Half-Blood Prince (Book 6)',
 "The Cuckoo's Calling (Cormoran Strike)",
 'Fellowship of the Ring (Lord of the Rings Part 1)']

Part 2: Weighted graphs, recommendations, review prediction

The main limitation of our graph representation in Part 1 is that it didn’t store the actual review scores. For our recommendation algorithm Graph.recommend_books, this meant that the recommended books just had to share similar sets of reviewers, but those reviewers might have given very different scores. In Part 2, we study how we can augment graphs so that every edge in the graph stores additional information about the relationship between two vertices.

Definition. A weighted graph is a tuple of two sets $G = (V, E)$, where:

$V$ is a set of vertices (same as before).
$E$ is a set of edges, where is each edge is now a tuple of two elements $(\{u, v\}, w)$. The first element $\{u, v\}$ is a set of two vertices from $V$, same as before. The second element $w$ is a positive real number called the weight of the edge.²

We also define the weight function of a graph, $w: V \times V \to \R^{\geq 0}$, as follows:

\[ w(v_1, v_2) = \begin{cases} \text{the weight of the edge between $v_1$ and $v_2$,} & \text{if $v_1$ and $v_2$ are adjacent} \\ 0, & \text{if $v_1$ and $v_2$ are not adjacent} \end{cases} \]

We call the previous graphs we’ve studied unweighted graphs. We can also view unweighted graphs as a special case of weighted graphs where all edge weights equal 1. All of our previous graph terminology (e.g., adjacent, path, connected) apply to weighted graphs as well.

Now let us use a weighted graph to represent our book review dataset. Each vertex represents a user or a book, and an edge between a user and book represents a review, same as before. But now, each edge’s weight equals the score given for the review.

In a3_part2_recommendations.py, review the WeightedGraph and _WeightedVertex classes. As in Part 1, some methods are provided for you, and others you’ll need to implement throughout this part.

Before implementing any of those methods, complete the function load_weighted_review_graph at the bottom of the file, which takes in the same datasets as before, but now returns a weighted graph according to the description provided above.
Next, we incorporate these weights into our recommendation system. There are two different similarity score methods in the _WeightedVertex class for you to implement, according to the following formulas:
1. _WeightedVertex.similarity_score_unweighted: this is the exact same as in Part 1.
  
  \[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
2. _WeightedVertex.similarity_score_strict: same as Part 1, except the numerator only counts common neighbours that have the same weight on the corresponding edges. When comparing two books, this means we only count the common users that gave the books the exact same review score.
  
  \[\begin{align*} sim(v_1, v_2) = \begin{cases} 0, & \text{if $d(v_1) = 0$ or $d(v_2) = 0$ } \\ \frac{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ AND $v_2$, and $w(u, v_1) = w(u, v_2)$} \} \big|}{\big| \{u ~\mid~ u \in V \text{ and $u$ is adjacent to $v_1$ OR $v_2$} \} \big|}, & \text{otherwise} \end{cases} \end{align*}\]
Your tasks:
1. Implement the two _WeightedVertex methods.
2. Implement WeightedGraph.get_similarity_score; note that there’s an additional parameter that allows the user to specify which of the two similarity scores to use.
3. Finally, implement WeightedGraph.recommend_books to make book recommendations, again providing a choice of which similarity score to use. You’ll notice that these recommendations are quite similar to the ones from Part 1, as least for "Harry Potter and the Sorcerer's Stone (Book 1)". This is normal, so keep going!
So far, our recommendations have been based entirely on book similarities. But think about how you might recommend a book or TV show to a friend: you don’t just look for books that are similar, you try to take into account your friend’s interests, likes, and dislikes. So the next step beyond making book recommendations is making predictions about what review score a user would give a particular book.

Open a3_part2_predictions.py and review the ReviewScorePredictor abstract class and example subclasses that we’ve implemented for you. Your task is to implement the remaining subclass according to its specification.
When we studed recommendation algorithms in Part 1 and the beginning of this part, we left it up to our intuition to evaluate the quality of these recommendations. Does the unweighted vs. strict similarity score give “better” recommendations? You probably found it unsatisfying that we didn’t have quantitative ways of evaluating these algorithms.³ We can ask the same question of the review score prediction algorithms we introduced in the previous question. Given a predicted score for a user and book, how do we know how good it is?

It turns out that this problem is easier to evaluate computationally. We follow a common methodology used in machine learning: we take our real dataset and divide it into two parts: a training dataset and a test dataset. In our context, the training dataset is simply the CSV files you’ve been using so far to create your Graph/WeightedGraph instances. But you are now ready to use our test dataset, which you can find at data/test-data/reviews_test.csv.

Your final task in this part is in a3_part2_predictions.py: implement the function evaluate_predictor, which takes in a predictor instance and runs it on a test dataset.

After you complete this part, try running evaluate_predictor on each of the subclasses found in a3_part2_predictions.py, and with different score_types for SimilarUserPredictor. You should find something interesting: all of the prediction algorithms get fewer number of reviews correct than simply always predicting a 5! (5-star reviews are by far the most common review score on Amazon.) But our weighted strict similarity score predictor has one advantage: it has a lower average error than always predicting a score of 5.

So does that make SimilarUserPredictor “better” than FiveStarPredictor? Well, it depends on what we value more, and what other metrics we might want to take into account. But at least we now have quantitative measures that we can compare!

Submission instructions

Please proofread and test your work carefully before your final submission!

Login to MarkUs.
Go to Assignment 3, and the “Submissions” tab.
Submit the following files: a3.tex, a3.pdf, a3_part1.py, a3_part2_predictions.py, a3_part2_recommendations.py, and honour_code.txt.

Please note that MarkUs is picky with filenames, and so your filenames must match these exactly, including using lowercase letters.

Note: for your Python code files, please make sure they run (in the Python console) before submitting them! Code submissions that do not run will receive a grade of zero for that part, which we do not want to happen to any of you, so please check your work carefully.
Refresh the page, and then download each file to make sure you submitted the right version.

Remember, you can submit your files multiple times before the due date. So you can aim to submit your work early, and if you find an error or a place to improve before the due date, you can still make your changes and resubmit your work.

After you’ve submitted your work, please give yourself a well-deserved pat on the back and go take a rest or do something fun or eat some fruit!

You may use work that you previously completed in Tutorial 7. ↩︎
In general, edge weights can be arbitrary real numbers, or more complex data types. For this assignment, our edge weights will represent review scores, and so will be positive integers between 1 and 5.↩︎
Note that evaluating recommendation quality is different than evaluating the correctness of the recommendation algorithm itself. All of the recommendation algorithms you’ve implemented can (and should!) be tested thoroughly so that they return the correct results on any give graph. What we’re talking about in this question is evaluating how “good” those recommendations are, assuming our algorithms all work properly. ↩︎