- Goal: determine how accurate Next Bus predictions are
- Step 1: Create a timeseries
- Step 2: Flag arrivals and departures
- Step 2: Filter
- Step 3: Calculate error in Next Bus predictions
- Step 4: Write out data to csv
In a the previous post we extracted Next Bus predictions from the Wmata API.
If that’s too much or uninteresting to you, start here. Our extracted data looks something like this:
Goal: determine how accurate Next Bus predictions are
I did this analysis in R, however it could be easily be done in Python or one of many languages. It could even be done in JavaScript. However, doing some data magic in R first will improve performance of the d3 visualizations. It also allows for much more rapid prototyping, exploration and analysis.
Step 1: Create a timeseries
Wmata conveniently provides with us a TripID for each unique bus trip. The only problem is they’re not actually unique.
The simplest fix I found was the create a unique identifier for bus trips by concatenating TripID
and VehicleID
.
There are certainly more complex ways that use a time dimension to determine unique trips, but this worked for me.
Step 2: Flag arrivals and departures
Next Bus doesn’t actually tell us when a bus arrives. We need to determine this from the time series we collect. This is one of the reasons I collect predictions from the API every 10 seconds. There is probably a more computationally efficient method for doing this using vectorized functions, but this works fine.
This gives us something like this:
Assumption 1: When Next Bus says a bus is arriving (Minutes==0
) multiple times, I take the latest prediction as the actual arrival time.
In other words, I take the last prediction where Minutes==0
before the bus disappears off your Next Bus app as the arrival time.
Step 2: Filter
Assumption 2: Remove bus trips that never arrive. There are some ghost buses out there. It’s impossible to determine the error in a prediction if you don’t know the outcome (true arrival time), so these trips are removed.
Step 3: Calculate error in Next Bus predictions
Now that we know the arrival times for each bus trip (df$arrival
), the prediction in minutes until arrival (df$Minutes
)
and the time the prediction was made (df$time
), we can figure calculate the prediction error (df$err
) and actual time until arrival (df$est
).
Step 4: Write out data to csv
Easy enough. This is the data that will feed the d3 visualizations built in the next post.