View on GitHub

131500

Analysing Sydney's new public transport timetable using knitR & ggplot

Download this project as a .zip file Download this project as a tar.gz file

Sydney new public transport timetable

Transport Data Exchange (TDX) program

You can download current timetable (and traffic) from TDX. Registration is required.

General Transit Feed Specification (GTFS)

GTFS contains several files saved as comma-delimited text and properly escaped i.e. CSV. File encoding should be UTF-8 and BOM is acceptable. The minimal feed consist of 6 files:

File	Defines
agency.txt	One or more transit agencies that provide the data in this feed
stops.txt	Individual locations where vehicles pick up or drop off passengers
routes.txt	Transit routes. A route is a group of trips that are displayed to riders as a single service
trips.txt	Trips for each route. A trip is a sequence of two or more stops that occurs at specific time
stop_times.txt	Times that a vehicle arrives at and departs from individual stops for each trip
calendar.txt	Dates for service IDs using a weekly schedule. Specify when service starts and ends, as well as days of the week where service is available

GTFS dataset structure

// you can use different engines to render chunks
// see http://yihui.name/knitr/demo/engines/
graph gtfs {
  node [shape = box]
  subgraph cluster_1 {
    label = "Transport Data Exchange (TDX) program"
    agency -- routes [label = "agency_id", constraint=false]
    routes -- trips [label = "route_id"]
    trips -- stop_times [label = "trip_id"]
    stop_times -- stops [label = "stop_id"]
    shapes -- trips [label = "shape_id"]
    calendar -- trips [label = "service_id"]
    calendar_dates -- trips [label = "service_id"]
  }
  stops -- transfers [label = "stop_id"]
  fare_attributes -- fare_rules [label = "fare_id"]
  fare_rules -- routes [label = "route_id"]
}

GTFS

General Transit Feed Specification (GTFS) is documented at Google Developers web site.

BOM (byte-order-mark)

If your imported data set has the first column beginning with 'X.' then the input file contains BOM. Fortunately, this was fixed in R 3.0:

readLines() and scan() (and hence read.table()) in a UTF-8 locale now discard a UTF-8 byte-order-mark (BOM). Such BOMs are allowed but not recommended by the Unicode Standard: however Microsoft applications can produce them and so they are sometimes found on websites.The encoding name “UTF-8-BOM” for a connection will ensure that a UTF-8 BOM is discarded.

make.names("\357\273\277stop_id")

## [1] "X.stop_id"

In older versions of R you need to rename the first column or remove BOM manually

Extent of Greater Sydney public transport area

GTFS dataset must have stops.txt file which contains stop location (WGS84), name and several other attributes. Let's read the previous timetable:

old <- gtfs.load("data/Oct2012")

stop_id	stop_code	stop_name	stop_lat	stop_lon	location_type	parent_station
254122	254122	Worrigee House Reception Centre Worrigee Rd	-34.91	150.6	NA	NA
254112	254112	Nowra High School Moss St	-34.87	150.6	NA	NA
254172	254172	Princes Hwy Before Central Av	-34.91	150.6	NA	NA
254162	254162	Princes Hwy Near Hillcrest Av	-34.90	150.6	NA	NA
254142	254142	Illaroo Road Ps	-34.86	150.6	NA	NA
2540317	2540317	Park Row Near Flora St	-34.91	150.8	NA	NA

and the new timetable:

new <- gtfs.load("data/Oct2013")

stop_id	stop_code	stop_name	stop_lat	stop_lon	location_type	parent_station
254122	254122	Worrigee House Reception Centre Worrigee Rd	-34.91	150.6	NA	NA
254112	254112	Nowra High School Moss St	-34.87	150.6	NA	NA
254172	254172	Princes Hwy Before Central Av	-34.91	150.6	NA	NA
254162	254162	Princes Hwy Near Hillcrest Av	-34.90	150.6	NA	NA
254142	254142	Illaroo Road Ps	-34.86	150.6	NA	NA
2540317	2540317	Park Row Near Flora St	-34.91	150.8	NA	NA

The sp package has gConvexHull function which returns the smallest convex polygon that contains all subgeometries, in our case all transit stops.

polygons <- rbind(
  data.frame(gtfs.bbox(new, "convex"), type = "new"),
  data.frame(gtfs.bbox(old, "convex"), type = "old")
)

Apparently, public transport coverage remains unchanged under the new time table.

# get base map
# base map is 640x640 (Google API limits) and is centered at [mean(x), mean(y)] and zoom level is 7
base <- with(polygons, get_map(paste(mean(y), mean(x), sep = " "), zoom = 7, color = "bw"))

# zoom in and keep margins around data points &uot; polygons
bbox <- coord_map(
  xlim = extendrange(polygons$x), 
  ylim = extendrange(polygons$y)
)

# final map: bounding rectangles and transit stops
ggmap(base) + 
  geom_path(data = polygons, aes(x, y, colour = type), size = 1) +
  scale_color_manual(values = c("blue", "red")) +
  geom_point(data = new[["stops"]], aes(x = stop_lon, y = stop_lat), size = 0.5, colour = "black") + # transit stops in black
  bbox +
  ggtitle("Greater Sydney public transport area")

plot of chunk show-bbox

Network overview

Shape data can be used to display densidity of public transport across all routes. Shape file mirros roads and tracks.

new.segments <- compute.segments(new[["shapes"]])
g <- ggplot(data = new.segments) +
  geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2, colour = cut(count, 
    breaks = c(0, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000),
    labels = c("1", "10", "20", "50", "100", "200", "500", "1000", "2000")
  ))) +
  scale_colour_brewer(name = "Services", palette = "PuBuGn", guide = "legend")

g + bbox + ggtitle("Network overview\n(new timetable)")

plot of chunk network-overview

and Inner Sydney area:


g + coord_equal(xlim = c(151, 151.35), ylim = c(-33.95, -33.75)) +
  ggtitle("Network overview\n(new timetable)")

plot of chunk network-overview

Basic network statistics

Transit Capacity and Quality of Service Manual TCQSM, 3rd edition provides instructions on how to quantitatively review everything from the person-capacity of a rail transit line to the quality of a bus stop. Most of the calculations are complex and in many cases external datasets (e.g. census data) are required to perform those calculations.

GTFS dataset provided by TDX programme is rather a shapshot rather than a complete timetable. Hence, only single workday from each timetable (old and new) is analysed. The following days are used:

old timetable: 2013-10-09
new timetable: 2013-11-06

old.singleday <- gtfs.trips(old, as.POSIXct("2013-10-09"))
new.singleday <- gtfs.trips(new, as.POSIXct("2013-11-06"))

latlon <- old[["stops"]][, c("stop_id", "stop_lat", "stop_lon")]

# add location info to transit stops
old.singleday <- merge(old.singleday, latlon)
old.singleday$time <- with(old.singleday,
  truncDiffTime(min(arrival_time, na.rm  = T), arrival_time) %% 86400
)
old.singleday <- old.singleday[!is.na(old.singleday$arrival_time), ]

new.singleday <- merge(new.singleday, latlon)
new.singleday$time <- with(new.singleday,
  truncDiffTime(min(arrival_time, na.rm = T), arrival_time) %% 86400
)
new.singleday <- new.singleday[!is.na(new.singleday$arrival_time), ]

# http://www.csgnetwork.com/degreelenllavcalc.html
# Selected Specific Latitude = -33.8230854940154
# 1deg lat -> 110 919.17m
# 1deg lon ->  92 575.84m
# this give us 4km hex bins
lat.bin.width <- 4000 / 110919.17
lon.bin.width <- 4000 / 92575.84

xbnds <- c(
  floor(min(polygons$x) / lon.bin.width) * lon.bin.width, 
  ceiling(max(polygons$x) / lon.bin.width) * lon.bin.width
)
ybnds <- c(
  floor(min(polygons$y) / lat.bin.width) * lat.bin.width,
  ceiling(max(polygons$y) / lat.bin.width) * lat.bin.width
)

registerDoMC() # 4 cores, takes ~ 2 x 20 minutes
old.bins <- compute.hex.bins(old.singleday, xbnds, ybnds, lat.bin.width, lon.bin.width)
new.bins <- compute.hex.bins(new.singleday, xbnds, ybnds, lat.bin.width, lon.bin.width)

new.old.bins <- with(
  merge(old.bins, new.bins, by = c("ID", "time"), all = T, suffixes = c(".old", ".new")),
  data.frame(
    ID = ID,
    time = time,
    count.old = ifelse(is.na(count.old), 0, count.old),
    count.new = ifelse(is.na(count.new), 0, count.new),
    count.diff = ifelse(is.na(count.new), 0, count.new) - ifelse(is.na(count.old), 0, count.old),
    lat = ifelse(is.na(y.old), y.new, y.old),
    lon = ifelse(is.na(x.old), x.new, x.old),
    change.old.new = interaction(is.na(y.old), is.na(y.new))
  )
)

Sydney transport network has three modes: bus, train, and ferry.

	Old timetable	New timetable
Number of trips	42076	49123
- Bus	38548	45300
- Train	2776	2901
- Ferry	752	922
Longest trip
- Bus	Sydney to Port Stephens	Sydney to Port Stephens
- Train	Blue Mountains Line	Blue Mountains Line
- Ferry	Eco Hopper	Eco Hopper

Travel time (or speed) varies across different modes. Average car speed on Sydney's highways is between 31km/h and 42 km/h.

ggplot(data = old.summary, aes(y = travel_time, x = distance)) + 
  geom_point(aes(colour = route_type), alpha = 0.7) +
  geom_abline(intercept = 0, slope = 3600/31, linetype = "dotted") + # AM peak
  geom_abline(intercept = 0, slope = 3600/42, linetype = "dotted") + # PM peak
  scale_y_continuous(breaks = seq(0, 5, 1) * 3600, labels = seq(0, 5)) +
  xlim(0, 250) + 
  facet_grid(route_type ~ .) +
  ylab("Travel time [hours]") +
  xlab("Distance [km]") +
  ggtitle("Old timetable\n(cars: AM peak: 31km/h, PM peak: 42km/h)")


ggplot(data = new.summary, aes(y = travel_time, x = distance)) + 
  geom_point(aes(colour = route_type), alpha = 0.7) +
  geom_abline(intercept = 0, slope = 3600/31, linetype = "dotted") +
  geom_abline(intercept = 0, slope = 3600/42, linetype = "dotted") +
  scale_y_continuous(breaks = seq(0, 5, 1) * 3600, labels = seq(0, 5)) +
  xlim(0, 250) + 
  facet_grid(route_type ~ .) +
  ylab("Travel time [hours]") +
  xlab("Distance [km]") +
  ggtitle("New timetable\n(cars: AM peak: 31km/h, PM peak: 42km/h)")

Public transport availability during a workday

# collect all images and create animation
opts_chunk$set(animation.fun = hook_scianimator)
values <- unique(c(seq(min(new.old.bins$count.diff), 0, length.out = 6), seq(0, max(new.old.bins$count.diff), length.out = 6)))
mid = abs(min(new.old.bins$count.diff) / max(new.old.bins$count.diff))
rescaled <- unique(c(
  rescale(seq(min(new.old.bins$count.diff), 0, length.out = 6), c(0, mid)), 
  rescale(seq(0, max(new.old.bins$count.diff), length.out = 6), c(mid, 1))
))
pal <- brewer.pal(11, "RdYlBu")
pal[5] <- brewer.pal(9, "Greens")[4] 
# display.brewer.all()
g <- scale_fill_gradientn(
  limits = range(values),
  colours = pal,
  values = rescaled,
  breaks = round(values, -2),
  labels = round(values, -2),
  name="Difference in available services\n(reduced .. same .. increased)"
)
# l_ply can be used here as well
anim <- llply(sort(unique(new.old.bins$time)), function(t, d, grad) {
  ggmap(base) + 
    geom_hex(data = d[d$time == t, ], stat="identity", aes(x = lon, y = lat, fill = count.diff), alpha = 0.8) +
    grad + bbox +
    ggtitle(format(trunc(Sys.time(), "day") + t, "Public transport availability at %H:%M")) +
    theme(legend.position="bottom")
}, d = new.old.bins, grad = g)
print(anim)

Sources

Annual Speed and Traffic Volume Data in Sydney