Greetings from SeatGeek Research & Development!

I’m here today to take you behind the curtain of one of SeatGeek’s major features, Deal Score. For the uninitiated, Deal Score is a 0-to-100 rating that reveals whether a ticket is a great bargain or a major rip-off. We humbly believe it’s the best way to find tickets. I’d like to quickly tell you why and then spend most of this post discussing some of the math behind Deal Score’s calculation. This is the first in a series of two blog posts, the second coming soon.

Sorting vs. Searching

Why have Deal Score? The standard across ticket sites is, of course, sorting by price. On most ticket sites, a prospective buyer can select sections they want to sit in, filter tickets by price range, and spend a solid chunk of their day trying to figure out the best seats for the money. On most aggregators, listings from several ticketing websites are lumped together… and then sorted by price, whereupon the experience repeats itself with the added pleasure of more noisy data.

SeatGeek, however, is more than an aggregator, we’re a search engine. Using Deal Score, we sort tickets by value rather than price. As a quick example, let’s try to find some tickets for the Red Sox-Indians game May 12th at Fenway Park. If I sort the tickets by price, I need to wade through dozens of cheap listings for standing room only tickets and obstructed view seats. Cheap for sure, but anybody who’s been to Fenway Park can tell you there are some places you just don’t want to sit. I need to be vigilant in order to notice a listing for two tickets in the grandstand behind home plate for $53, the same price level as a listing in the back of the bleachers and in two neck-straining outfield grandstand seats.
[singlepic id=31 w=400 h=300]

How good of a deal is this? Sorting by price these three listings look the same, but behind the scenes SeatGeek’s proprietary price prediction has pegged these bleacher seats as being worth $29, the outfield grandstand seats at $34, and the infield seats at $69. Deal Score compares every ticket’s expected price to its listed price and takes the mental leg work out of ticket shopping.

The basic principle behind Deal Score is simple and intuitive: by searching rather than sorting, we can intelligently filter secondary market ticket listings, saving consumers large amounts of time and money.

How does it work?

The most important element of our Deal Score algorithm is to accurately estimate the current market value of a ticket listed on the secondary market. Most marketplaces have large amounts of transactional data on their products, often with supply and demand-side pricing signals. SeatGeek is in the undesirable position of trying to predict, on a daily basis, the price of millions of event tickets that have, by definition, never sold. Each seat at every event is a unique product; while its eventual price is informed by many other signals, the secondary market is both opaque and noisy.

Given our data constraints and the precision necessary, we made two assumptions about seats:

  1. Seat quality, within a given venue, has a consistent ordering. This means that for any given Red Sox game, we expect that Infield Grandstand 18, Row 12 is a better place to sit than Center Field Bleachers 37, Row 37.
  2. The relationship of seat price to seat quality follows a similar pattern across all events at a given venue. This means that a curve plotting sale price against seat quality for a weekend Red Sox-Yankees game at Fenway Park should look similar to a curve for a midweek Red Sox-Royals game, even though the market dynamics would be quite different.1

The first assumption allows us to use signals from many contexts to inform our predictions. The second assumption allows us to make confident predictions about prices after seeing as few as five or ten prices for each event. 

In today’s installment, I’m going to show you the math we use to derive a key metric called “Seat Rank,” the ordinal quality rank of all seats within a venue.

Seat Rank

In order to make the most of our first assumption, we determine the intrinsic “seat quality” of each seat relative to all others. Teams and promoters deal with this every day; they have to set face values for tens of thousands of seats in a stadium, but they have the advantage of only needing to compute a few dozen price levels, at most. In contrast, secondary markets have row-level pricing granularity, and thus require us to understand how much each row is going to sell for on the open market. Fenway Park, for example, has 4,022 distinct section/row pairs, and we must understand how they all rank on a relative basis. Using a little bit of cleverness along with vector coordinate data from SeatGeek’s venue maps, we reduce the problem slightly: we divide each venue into clusters of seats (we call them “seat groups”) whose physical locations and sale prices tend to be close enough to each other that they can be modeled together. These seat groups allow us to make use of less data to predict more prices.

 Some venues have as few as twenty groups; others, well into the thousands.  Fenway Park has 993.

To understand Seat Scores, consider a simple example where the set of listings \(\mathcal{S}\) consists of three seats indexed by \(i\): \[ s_i \in \mathcal{S} = \{s_1,s_2,s_3\} \]

Suppose these seats are equally priced, despite the fact that their quality \(\theta_{i} \) varies. In fact, \(s_1\) is twice as good as \(s_2\), which is twice as good as \(s_3\). Without loss of generality, we arbitrarily set \(\theta_{1} = 1\) and can define a vector \(\Theta\) of relative seat qualities: \[\Theta \qquad = \qquad \left[ \begin{array}{c} \theta_1 \\ \theta_2 \\ \theta_3 \end{array} \right] \qquad = \qquad \left[ \begin{array}{c} 1 \\ \tfrac{1}{2} \\ \tfrac{1}{4} \end{array} \right] \]

Unfortunately, while SeatGeek has a lot of data, we cannot directly observe the relative true quality \(\Theta\) of these seats.  However, we use a group of  different signals, including clicks on “buy” buttons and the physical location of a seat within a venue, to arrive at an estimated quality \( \hat{\Theta} \).  One of these signals is pairwise comparison.  Shoppers constantly make pairwise comparisons among seats. We use this tendency to our advantage.  In particular, we obtain our estimate of \( \Theta \) by assuming that users’ historical choices are proportional to the true relative quality of seats, revealing information about the true \( \Theta \). For simplicity’s sake, assume that \[ \Pr(\text{user chooses $s_i$ over $s_j$}) \propto \frac{\theta_i}{\theta_i + \theta_j} \forall s_i \neq s_j \in \mathcal{S} \]

For example, when faced with a choice between \( s_1 \) and \( s_3 \), users will pick \( s_1 \) with probability \( = \tfrac{1}{1+ 1/4} = 80\% \). In reality, the data will be much noisier. Each data point is a random realization of their perception of relative seat values. Some pick the first listing they see, others have disparate opinions about what makes for a quality seat, etc.2

Continuing with the Fenway Park example, after processing our input signals, we have a square matrix \(\mathbf{R}\) where each cell represents the processed results of pairwise comparisons between seat groups. In this matrix, \(\mathbf{R}\), we define each cell \(r_{i,j}\) as the observed relative quality \(s_i\) as compared to \(s_j\).

The rough values for \(\mathbf{R} \) are fairly noisy, as shown in the matrix below. The matrix below is sorted left-to-right, top-to-bottom by the raw “winning percentage” of each seat in pairwise comparisons. Each cell represents, roughly, the fraction of the time that a user clicked on the seat in the row (y-axis) when the seat in the column (x-axis) was available at an equal or lesser price. A row with mostly red is a seat that “wins” many comparisons, a row with mostly green tends to lose.

[singlepic id=52 w=350 h=350]

The initial \(\Theta\)’s implied by these raw winning percentages are a good start, but these data are far too noisy to be used as reliable estimates. This is a visual representation of what Fenway Park looks like with these raw seat scores:

[singlepic id=44 w=350 h=350]

To estimate \(\hat{\Theta}\) in the presence of noisy data, we use a method called maximum likelihood estimation, which iterates over candidate values for \( \hat{\Theta} \) to maximize the probability of observing the real data. We start with rough parameter values, \( \hat{\Theta} \) and follow the steps:
(1) calculate the probability of observing the data conditional on these values3 \[L \ ( \hat{\Theta} \ | \ \mathbf{R}) = \prod_{i, j} \left ( \frac{\hat{\theta}_{i}}{\hat{\theta}_{i} + \hat{\theta}_{j} } \right )^{r_{i,j}}\]

(2) adjusting the parameter values to increase this likelihood

Watch below as the seat scores converge from our initial values to the maximum likelihood (click the “play” link):

[portfolio_slideshow]

Presto! Once we’re finished, we end up with something that looks very similar Fenway’s actual seating chart, only with much more granular distinctions on price levels. With these seatscores, we would expect \(\mathbf{R} \) to look like this filled-in matrix instead of the noisy, sparse mess from above. [singlepic id=51 w=350 h=350]

With these powerful seat scores in hand, we’re halfway to our goal of predicting accurate prices for live events at any venue in the country. Come back for our next post to see how we go from our seat scores to market value predictions for thousands of events every day.
UPDATE: View part 2 ‘Using a Kalman Filter to Predict Ticket Prices’

Credits

In case you’re wondering what technology we use for these projects, here’s a sampling:

  • pandas, a python data analysis library, for signal processing
  • R for statistical analysis and postprocessing
  • ggplot2 to make the heatmaps seen above

Notes

1 If you read this far and wondered whether we were ever going to get around to this, then you’ll want to come back for part 2, when we explain how price predictions are derived from these seat scores.
2 Fenway park is actually a good example of this phenomenon, Green Monster seats in particular are heavily disagreed upon by our signals.
3 \( r_{ij} = 0 \) whenever \( i = j \), so we need not exclude these cases.