This post is about the “making of the analysis” … which might be rather boring to those non data geeks (normal people). If you’re just interested in the story and the pictures, jump straight here!
I worked in R, as I usually do for most things statistical and graphical.
Our cleaned up data looks like this:
est is the actual time until arrival from the time of prediction. err is the prediction error in minutes (postive values are late buses, negative values are early buses). Minutes is the prediction made at time.
Next Bus predictions are discrete, that is they are made in whole numbers such as 1, 2, 3 … up to 99 minutes — no decimals or seconds.
Each prediction (5 minutes, 10 minutes, etc) is made many times. So we can calculate some statistics around each prediction.
First I simply calculated the average prediction error.
Then I calculated the quantiles of actual arrival times (df$est) for each possible prediction (0, 1, 2, 3…100 minutes) using .05 increments for flexibilibity.
which turns out to be this (only showing predictions 0-5 minutes):
So when the Next Bus app predicts 5 minutes:
50% of the time the bus arrives between 5.1 and 6.5 minutes.
70% of the time the bus arrives between 4.7 and 6.9 minutes.
90% of the time the bus arrives between 4.1 and 7.7 minutes.
We can visualize this using the code below to explore the confidence intervals for each possible prediction: 0,1,2,3…60 minutes. I cut it off at 60 minutes because the data gets sparse after that and the lines generally continue to flat line.
First note from the red waves above that predictions are consistently biased conservatively. That is, buses usually (~80% of the time) arrive after their predicted time.
Second note that the something weird aboout predictions of 11,12 and 13 minutes. As one would expect, the error of predictions is less when the bus is closer (<10 minutes away). As the bus gets closer, Next Bus only has to predict a couple periods ahead rather than many. It’s well known among forecasters (and non-forecasters) that the general happenings of tomorrow is a much
easier task than predicting the general happenings for a specific day decades in the future.
We have more information about tomorrow.
So why are predictions made a few months (11-13 minutes) into the future worse than
those made years into the future (14-60 minutes)? My first thought was that forecasts are recalibrated when they start predicting in the 11-13 minute window.
I used to forecast macroeconomic indicators for a small emerging market country when I worked at the Fed. GDP, CPI inflation, monetary policy, all that good stuff. If there’s one thing I learned from my role as a country analyst, it was that forecasting is hard and usually more art than science… at least for macro indicators. Forecasts more than a couple years out are based on, well, usually not much. However macroeconomic forecasts for the
next quarter or two are more grounded.
There is higher frequency and more recent (useful) data to make forecasts based on nowcasts, timeseries models, and knowledge of policy, momentum, etc.
So when presented with new information about the economy of my forecast country during the forecast period (every month or two), I often faced a choice: revise or the stay the course. Knowing when to revise is hard. One doesn’t want to overreact
to information too soon just to revise a forecast in the opposite direction next time. Consumers of forecasts value both accuracy AND
consistency. The balance of which is tricky – there is usually a tradeoff.
This revision behavior explains why the Next Bus errors trend downward as predictions decrease from 10 to 0 minutes. But why the spike in
errors 11-13 minutes out? My second thought led me to create the visualization below which allows for an investigation of individual buses and their predictions throughout the week. Possibly a forecast bug? An idiosyncrasy of this particular bus route?
Surprisingly, the revision hypothesis appears to be dead wrong (or at least very poorly implemented).
Predictions often follow a consistent trend (straight diagonal line) until some point where predictions
prematurely jump to 11, 12, or 13 minutes, when in reality, there is much longer to wait.
Below: Exploring the variation around predictions 0, 1, 2, 3 … 60 minutes. Reaffirming that predictions of 11-13 minutes
are the most volatile.
Are Next Bus predictions less reliable during certain parts of the day, like rush hour?
Short answer, yes - predictions are off the mark (late) most in the morning between 8am and 10am.
Note this analysis is for the 64 bus heading from a residential neighborhood (Petworth) south to Federal Triangle (downtown DC). So peak ridership and traffic and thus prediction errors along the bus route would likely be in the morning.
I’m particularly interested in assessing relative accuracy of rush hour predictions, so I’m only looking at predictions on the weekdays. The lubridate R package makes working with POSIXct dates easy. First creating dummies for weekday and extracting hour.
Calculating the median prediction error for each prediction increment (1,2,3 … 60 minutes) by hour of day the prediction was made.
which returns something like this (sample below). So the average error for predictions made between 20:00 (8pm) 21:00 (9pm) when Next Bus predicts 3 minutes is 0.59874 minutes or 36 seconds (late).
To make this easier to interpret and analyze, I want to reshape the results from long to wide format.
which creates something like this (sample):
Now creating the graphic that puts it all together. I chose to not show average errors when predictions are 1,2,3 … 60 minutes separately.
Firstly because the trend is the same and secondly because it’s messy. So I averaged together the median errors for predictions of 5,6,7,8,9 minutes in one
goup and 10,11,12,13,14 minutes in a separate group.
So predictions are wrong (late) the most between 8am and 10am on the weekdays. The errors jump around a bit throughout the day. Some of this could be the chosen frequency (hours)… it would likely be smoother if I chose 4 or 5 periods of the day spanning several hours each. Some of this could be the fact that I’m only analyzing one week of data for one bus line. Perhaps there was anomalous behavior around predictions made at 9pm for a couple buses that is driving the spike in prediction errors between 9pm and 10pm.