Assignment 2: TTC

Updates

Due: Wednesday December 5th before 11:00pm

Partnerships:

You may work alone or in a group of two. Declare your partnership on MarkUs BEFORE you begin coding.
Once a partnership is set up, I will not change groupings (barring extraordinary circumstances), so make sure it is someone you want to work with.

It is often best to work with your partner in the same room, rather than splitting up the assignment and letting each person work on their own. This is because when working alone, each person doesn't know what's happening on other parts of the assignment, and that can lead to issues and bugs when putting the code together (and, a much longer time to finish it!). Try working together in the same room and using the driver/navigator method we use in lab. Previous experience has shown that this will lead to finishing the assignemnt more quickly, since both people will be communicating their thoughts as the code is written and you can help each other out when needed.

Advice:

Start early! This assignment will take a considerable amount of time to complete. Under no circumstances should you leave this to the last few days before it is due.

It is recommended that you print out this handout and highlight key points. Make sure you understand what each part of the assignment is asking you to do, and ask questions if you are stuck on something. Don't forget that we have a discussion board!

Remember that the Teaching Labs in Bahen are open 24/7 for students taking CSC courses and you can work on the assignment in any of the lab rooms (if they're not taken up by a class).

Before you start: A warning about academic offenses

The CS Department has software that is used to compare similaraties between different submissions, and checks for similaties to code found online, and submissions of similar assignments from previous terms and courses.

As stated in the course information sheet, you must hand in your own work. You should only show and discuss your code with your partner (if you choose to have one), the CSC120 TAs, and the instructor. Do not submit code that is not yours or that you found somewhere. Do not show your solution to other students in the course, and do not post your code on the discussion board.

Please see the syllabus for information about academic offenses - they are taken very seriously by the University.

Background: TTC

The Toronto Transit Commision (TTC) is the agency responsible for the operation of Toronto's public transportation system, including services like buses, subways, streetcars.

Like any large organization, the TTC has needs to store and work with a lot of data, specifically on the routes that its vehicles go on from day to day. They want to have information on when a certain bus stopped at a specific bus stop, and have a way of figuring out the time it took for an entire trip along a route. This information is vital to consider when making decisions on the future of transport in the City of Toronto.

In this assignment, you will be working with real data provided by the TTC, which gives detailed information about the trips of its vehicles on the many routes offered in Toronto.
As you can imagine, these data sets are particularly large, considering the amount of trips taken by TTC vehicles each day. For the purposes of this assignment, the data has been reduced somewhat so you can focus on what you need to complete the assignment.

The Assignment

Your tasks

Part 1: Working with a SQL Database of TTC data

You will be writing functions that work with a SQL database that contains different data about the TTC's operations.
The main goal is to write a function that prints details about a specific TTC trip, given a trip ID (a number that identifies the trip). You will write all of the helper functions necessary to run this printing function, so that all you have to do in the printing function is call your helper functions properly.

Files to Download

Please download the A2 Part 1 Files and extract the zip archive. A description of each of the files that we have provided is given in the paragraphs below:

Starter code: ttc_trips.py

The ttc_trips.py file contains a constant representing the name of the ttc database file, and a helper function that can help run database queries. You must not modify the provided helper functions.

The ttc_trips.py file also contains function headers and docstrings for the A2 data functions to which you are required to add function bodies. For each function, read the header and docstring to learn what task the function performs.

At the bottom of ttc_trips.py is an if __name__ == '__main__' block, with a while loop that acts as the main progam. You should not change this part of the code. You just need to write the functions above it to ensure they work properly.

Database file: ttc.db

The ttc.db file contains the TTC Trips SQLite database, with tables that are described below. You must not modify this file. If you think you may have accidentaly changed the database, you can always download a fresh copy here.

Checker for Part 1: checker_a2_part1.py

We have provided a checker program (checker_a2_part1.py) that tests two things:

The checker program does not test the correctness of your functions, so you must do that yourself.
More details on the checker are found later in this handout.

Note that this checker is only for Part 1. There will be another one for Part 2.

Definitions of TTC data

These definitions correspond to the contents of the tables of the ttc.db database file.

Term Description Example
Stop A stop is a point along the route. Every stop has a stop ID, a stop code (which is what passengers see), and a stop name. The stop with stop ID 262 is named DANFORTH RD AT KENNEDY RD with stop code 662.
Route A route is collection of stops that a vehicle will go to. A route has a route ID, a route number (which is the number a passenger sees), and a route name. The route with route ID 53424 has route number 10 and name VAN HORNE.
Trip A trip represents an actual ride along a specific route - many trips can go on the same route. A trip has a trip ID, and a route ID to specify which route it goes on. Each trip has different stops along the route, each with a stop time that indicates when that stop was reached during the trip. The trip with trip ID 35902622 is named
EAST - 10 VAN HORNE TOWARDS VICTORIA PARK
goes along the route with route ID 53424.
Stop Time Every stop time is a time at which a vehicle on a specific trip stops at a specific time. Each stop time has trip ID that it is assocaited with, a stop ID to indicate the stop, and a stop_time, which indicates when the vehicle arrived at the stop. The stop sequence specifies which stop number along the route this is (the database records happen to be in order of stop sequence for each trip, which can help you). On the trip with trip ID 35902600, the third stop has stop ID 3807, and has a stop time of 6:15:54.
Vehicle Every route is serviced by a particular vehicle, which is identified by a vehicle number. The route with route_id 53639 is serviced by vehicle number 20.

Viewing the ttc.db database

To look at the contents of the TTC database, open ttc.db using the DB Browser for SQLite, a visual database browser. Have a look at the tables and their respective schemas (their column names and data types).
Be careful not to accidentally change some data in the browser - you can always redownload the database from this page if you think you may have changed something.
You will notice that there are some similar column names between the tables, which can be used to your advantage when writing your SQL queries.

The Main Program

The goal of your program is to make output that looks like this (with use input in large bold). This is the correct output for the given trip ID:

        Please enter a trip ID or q to quit: 35902613
        Show stop details? y/n: y
        Trip details for trip 35902613
        Route: 10 VAN HORNE, on Vehicle #4
        Total Number of Stops: 23

        Stop 1: DON MILLS STN. at 16:30:00
        Stop 2: DON MILLS STATION at 16:30:01
        Stop 3: DON MILLS RD AT LEITH HILL RD NORTH SIDE at 16:30:54
        Stop 4: DON MILLS RD AT FAIRVIEW MALL DR NORTH SIDE at 16:32:21
        Stop 5: DON MILLS RD AT GODSTONE RD at 16:33:41
        Stop 6: DON MILLS RD AT DEERFORD RD at 16:34:51
        Stop 7: DON MILLS RD. AT VAN HORNE AVE. at 16:36:00
        Stop 8: VAN HORNE AVE AT DON MILLS RD EAST SIDE at 16:36:08
        Stop 9: VAN HORNE AVE AT HOBART DR at 16:36:40
        Stop 10: VAN HORNE AVE AT HOUSTON CRES (WEST) at 16:37:09
        Stop 11: VAN HORNE AVE AT KINGSLAKE RD at 16:37:40
        Stop 12: VAN HORNE AVE AT EDMONTON RD EAST SIDE at 16:38:58
        Stop 13: VAN HORNE AVE AT LUCIFER DR at 16:39:18
        Stop 14: VAN HORNE AVE AT BICKERTON CRES at 16:40:06
        Stop 15: VAN HORNE AVE AT CLIPPER RD at 16:40:39
        Stop 16: VAN HORNE AT BICKERTON CR. at 16:41:00
        Stop 17: VAN HORNE AVE AT VICTORIA PARK AVE at 16:41:04
        Stop 18: VICTORIA PARK AVE AT PLEASANT VIEW DR at 16:42:23
        Stop 19: 140 PLEASANT VIEW DR at 16:43:03
        Stop 20: PLEASANT VIEW DR AT BRIAN DR at 16:43:41
        Stop 21: BRIAN AVE AT MARGARET AVE at 16:44:26
        Stop 22: BRIAN AVE AT VAN HORNE AVE at 16:44:50
        Stop 23: BRIAN DR. AT VAN HORNE AVE. at 16:45:00

        Trip start time: 16:30:00
        Trip end time: 16:45:00 

        Please enter a trip ID or q to quit: q
        >>>
    
If argument printDetailedStops is FALSE for TripDetails, the Stop 1:..Stop 2:.., etc. part should not be shown:
        Please enter a trip ID or q to quit: 35902613
        Show stop details? y/n: n
        Trip details for trip 35902613
        Route: 10 VAN HORNE, on Vehicle #4
        Total Number of Stops: 23
        Trip start time: 16:30:00
        Trip end time: 16:45:00 

        Please enter a trip ID or q to quit: q
        >>>
    
Requirements

In order to get the above output, you will have to complete the functions in ttc_trips.py, including the function trip_details(), which will print the output above. They are summarized in the table below.

Getting Data

All of the data helper functions (the ones that start with 'get'), are the functions that are supposed to be used to query the TTC database. Therefore, you will be creating and executing SQL queries in these functions, and only these functions.
NOTE: You may not run any SQL queries or call any database functions inside of trip_deatils(). You will lose lots of marks if you do this! You must find a way to use all of your data helper functions in trip_deatils() to produce the correct output.
You can use the run_query() function like we did in lecture to make it easier to run queries without having to set up all of the connection code for every function.

Other requirements: You must format your strings in trip_details() using str.format() as we will see in lecture.
For example,

    >>> a = 3
    >>> print('The value of a is {}.'.format(a))
    The value of a is 3.
   

Data Function explanations

All x_id variables correspond to the id columns in the respective data files.
Function name(arguments)
Description
stop_name_for_stop_id(stop_id)

This function returns the name of a specific stop.
get_stops_for_trip(trip_id)

This function returns a list of tuples with all of the stop times data for a specific trip (i.e., all of the columns from the stop times table for that particular trip)
get_route_number_for_route(route_id)

This function returns the route number for a specific route.
get_route_name_for_route(route_id)

This function returns the route name for a specific route.
get_route_id_for_trip(trip_id)

This function returns the route ID for the route of a particular trip.
get_vehicle_number_for_trip(trip_id)

This function returns the vehicle number of the vehicle that serviced a particular trip.

Part 2: Subway Delays

The TTC subway system often encoutners delays. Trains can be stuck at a station for some amount of time due to various reasons, and can only leave the station once they've been given an "all clear" sign by TTC operators.
In order to figure out how to best deal with delays, it's good to have some data to analyze about them, so that stations that have more delays can be dealt with in future planning.

The TTC provides data about these delays for the previous year in a SQL database. You will have access to the first half of the previous year's data (from September 2017 to February 2018). Your job is to write a Python script that creates some useful graphs using this data.

Files to Download

Please download the A2 Part 2 Files and extract the zip archive.
You should put Part 2 in a separate folder than Part 1 so you don't confuse the two.
A description of each of the files that we have provided is given in the paragraphs below:

Starter code: subway_delays.py

The subway_delays.py file contains a constant representing the name of the subway delays database file, the month names as they appear in the data, and a helper function that can help run database queries. You must not modify the provided helper function and constants.

At the bottom of subway_delays.py is an if __name__ == '__main__' block, where you will add the code that will make the graphs (by calling functions that you write yourself).

Database file: delays.db

The delays.db file contains the Subway Delays SQL database, with tables that are described below. You must not modify this file. If you think you may have accidentaly changed the database, you can always download a fresh copy here.

Checker for Part 2: checker_a2_part2.py

As for Part 1, we have provided a checker program (checker_a2_part2.py) for Part 2. Run this one separately from the Part 1 checker (and in a different folder).
This checker will not check your parameter and return types, since it is up to you what your functions will take in as input and proudce as output. However, the TAs will take a look at your type contracts and what your functions are doing.

The Subway Delays Database

Use the SQLite Browser to open the delays.db database. In it, there is only one table, delays, that contains rows of data about the delays that have occured over the six months. Each row contains the data for one observed delay, including:

Requirements

In Part 2, there is very little starter code. It is up to you to write all of the functions that will produce the required graphs, and the proper code in the if __name__ == '__main__' block so that, when run, the code produces the plots and saves them to the specifed files.
You will write plotting functions that use the matplotlib plotting library, and you will write well-designed and documented helper functions, that can make your functions smaller, and reduce repeated code.
Your main program doesn't need to ask for or validate user input, it just has to produce the required graphs when run. More details on those below.

What graphs should my code produce?
The graphs that your code should produce and save to files include:

For each month, you should create the following graphs, and save them as png image files to the specified file names. You will have 6 graphs for each (for the 6 months) for each of the three points below, and you should present these in a single plot with 2 rows of 3 graphs each. You should have a total of three plots each with six graphs.
  1. A bar plot where the x-axis is the day of the week (i.e. Monday) and the y-axis is the total number of minutes that trains were delayed for that day of the week.
    Filename: day_delays.png
  2. A bar plot where the x-axis is the five stations with the longest overall delays, and the y-axis is the total minutes delayed for each of those stations. If there is a tie, just choose any five that are the top five longest delayed stations.
    Filename: week_top_five.png
  3. A histogram (similar to a bar chart, but for continuous values) of the number of delays (not time) that occur during each hour of each day, for all days combined for that month. (hours on x-axis, number of delays on y-axis) Hours start from 0 and should increment by 1 on the x-axis of the chart all the way up to 23.
    Filename: hour_delays.png
    Note: Times are in the format hh:mm in the database. Although there are more complex SQL queries that can separate out the hour and minute, you may just select the time column in your program and do the string separation in Python as you've done before.
After you have made the three plots with the six graphs for each month, you should make the following individual bar plots that present data about the entire 6-month period:
  1. A bar plot where the x axis is the month and the y axis is the average number of minutes that trains were delayed for that month. Filename: month_avg_minutes.png
  2. A bar plot where the x axis is the month and the y axis is the total number of delays for that month. Filename: month_delays.png
  3. A bar plot of the five stops with the longest overall delays for the entire 6 months, with total minutes delayed indicated. If there is a tie, just choose any five with the longest delays. Filename: month_top_five.png
How should I go about making my graphs?

You will have to make SQL queries to get the data you need from the database. Most of the plots you will be making are bar plots, which deal with the counts, sums, and averages of certain columns. Since the data you need involves working with numbers, you should be using aggreagate SQL functions and ordering in your SQL.

Note: You must use SQL and aggregate functions to get the proper values, and not use Python to do this (except for the histogram, which we've allowed you to aggregate by time in Python). Not doing so will result in losing a lot of marks!

Once you have this data, you can then feed the appropriate x and y axis lists to functions of the matplotlib plotting library, which is stored in the plt variable in the starter code.
The plt.bar() function can create the appropriate bar plots. The histogram can also be plotted this way.

Graph presentation: Note that your graphs should have appropriate title and axis labels to indicate what is being presented. That is, we should know what each bar in your plot is referring to, and what the axes represent overall. You may use what we've done in lab 8, and you can search through online resources about matplotlib to see which other ways you could make the plots look better, by adding different colours and other elements. We will take note of your plot presentation when marking your assignment, and there will be grades allocated for it.

When subway_delays.py is run by the TAs, your script should save the plots as .png files, with the file names indicated above.
It does not need to show the plots - you should not run plt.show().

Code readability and style.
The TAs will look at code readability mostly for Part 2. You should consider:

Remember to follow the Python style guidelines, and make your functions as short as you can! Your main block should also not be very long at all.

Marking

These are the aspects of your work that will be marked for A2:

What to Hand In

The very last thing you do before submitting should be to run the checker programs for both parts one last time. Otherwise, you could make a small error in your final changes before submitting that causes your code to receive zero for correctness.

Submit the following files on MarkUs in the 'Assignment 2' submissions section by following the instructions on the course website. Remember that spelling of filenames, including case, counts: your file must be named exactly as above.

Part 1: ttc_trips.py

Part 2: subway_delays.py