Make Way for the Soccer Geeks

With Manchester City opening up its data to the masses, the golden age of soccer analytics is set to begin.

With Manchester City opening up its data to the masses, the golden age of soccer analytics is set to begin.


This month Manchester City, the younger brother (and rival) of the better-known Manchester United, announced that it will release detailed data about the team for public consumption.

The club's press release noted that "the speed of growth for the discipline of performance analytics is essentially in the clubs' hands -- it is they who have bought the data at significant cost and the rest of the analytics community simply do not have access to the data at the same level ... [But while] there are many people in the analytics community right now who have the skills, desire and vision to make a difference in the performance analytics space...those people have no significant data to work with." By opening up this data and making it available to those within the analytics community Manchester City hopes to "encourage and inspire the next generation of analytics."

This move, while essentially unprecedented in the soccer world, fits clearly within larger cross-sector trends of making data open to harness the distributed human capital and innovative potential of hobbyists, enthusiasts, and geeks with pro-level skills. The history of success of making data available to the wonks who want to use it bodes well for the future of soccer analytics; we may be at a watershed moment.

The move to promoted innovation through openness is premised on the idea that innovation is often about cost. In particular, entry costs are important. For a pool of potential innovators (in basically any sector) the less costly the inputs required to begin innovating, the more likely it is that potential innovators will become actual innovators. If more equipment, materials, special skills or privileged information is required, fewer people will experiment, tinker, and discover. It follows that the more people are experimenting and trying to innovate, the more valuable innovation is likely to happen. This dynamic implies that in sectors in need of innovation, it is useful to assess the costs of entry and try to lower them.

A common explanation for the radically innovative tech scene in recent decades, is that the Internet lowered barriers to market entry, as basically anyone with a computer and enough time could write some killer code. Yochai Benkler, a scholar at Harvard's Berkman Center for Internet and Society, has made a career of looking at how radically low barriers to entry in labor markets can change the cost structures and organizations of production. This trend is nowhere more evident than the Open Data movement. This movement, which gets it philosophical inspiration from the older Open Source movement, holds that data should be freely available to anyone without restriction.

In knowledge discovery in datasets, the major barrier to entry is access to the data. When corporations, governments or other private firms jealously guard their proprietary data, the number of people playing with the data and trying to discover valuable things, or putting that data to good use, will remain small. When data is made public, anyone can put that data to work. In recent years governments have begun making large troves of their data publically accessible. The U.S. government's open-data project,, for example, has begotten over 200 citizen-developed apps. Similarly, the city of Vancouver, an early mover in the municipal open-data space, opened up their data in 2009, spawning valuable mashups of transit data, the water grid, and common spaces.

A common adage in open-source development known as Linus' Law states that "with enough eyeballs, all bugs are shallow," indicating that if you can get enough people involved, hard problems become easier. This is what open data does for knowledge discovery and innovation. When looking for a needle in the haystack of data, it helps to have a more people looking. The best way to get more people looking is to make it cheap to look.

Lowering the cost to look, and thus enabling more people to get involved is precisely what Manchester City has begun to do. Opening the data up promises to lower barriers to entry for experimenting with new data-driven ways of understanding the game. With more eyeballs, this problem can become shallow.

Normally "the only data you can get [publicly] is the really basic stuff: goals, assists, cards... [which is] nothing you can really work from," says Graham MacAree, SBNation's soccer editor, and one of the leaders in the field of public soccer analytics.

According to the club, some data will be entirely available for public consumption, but the most detailed data --"a time coded feed that lists all player action events within the game with a player, team, event type, minute and second for each action, together with the x/y/z co-ordinates for each event" -- will be sent to analysts who present a project submission that is approved by the club and their data provider Opta, the leaders in soccer data mining.

This more detailed data will be useful for experts like MacAree, a veteran of baseball's statistical revolution known as "sabermetrics" (think Moneyball), because it contains so much more information than can be gleaned from traditional soccer analysis, which has focused on individual actions in a vacuum -- that is, without context: Player X passes, Player Y dribbles, and Player Z shoots and scores.

"The most important thing for me is knowing where the ball is at all times, and where all the players are at all times," MacAree explains. "And City are proposing to release not just the what, but the where and when of the data. We're talking very much about space and time, which are very difficult to get out of the data set we've already had."

This is a foundational moment for the soccer-analytics community. The field of study, despite all the bluster about a soccer Moneyball or Jamesian moment (after the godfather of the sabermetric movement, baseball writer Bill James), has yet to progress past the equivalent of a box score. Large-scale advanced metrics are years of research away, especially because data has been so scarce. Most of the cutting-edge analytics have been painstakingly developed by hand. Previously, researchers without access to the kind of data Manchester City is making available have had to record every event in a match, watching frame-by-frame, then transcribe it to Excel, and write the code themselves to analyze it. Single match analyses like MacAree's radial-passing maps take more than a day of labor-intensive work to assemble.

In this data environment, researchers have little hope of coming up with testable, verifiable, predictive metrics.

"If you look at baseball, the sabermetric revolution came about because data was available before it was valuable," MacAree explains. In this environment the costs of entry to innovate were low, and Bill James, among others, was able to experiment. But "now that we know how valuable data is, there's no reason for it to be [freely] given to us... but our contribution [community analysts'] can also be valuable. And we've always been about showing that we're worth giving that data to."

This is what is so unique about Manchester City's decision to, at least partially, open up one of their most valuable assets to the public. They have decided to embrace the open-source nature of baseball's Jamesian revolution, and bring it, at least partially, to soccer.

Their press release speaks directly to the analytics community, describing areas of performance analysis that City would "like to discuss with you": "We will work directly with those of you who came up with good concepts, and also connect you to others who are working in the same research area," they crow.

There is a long way to go in soccer analytics, and this is but a small first step into a larger world. City's data is only for one year; for predictive models to be valuable, they must be based off, and tested against, various years of data. And this type of scientific peer review, based off years of data, will only be feasible if teams and organizations continue in City's footsteps. But City's move to begin opening up their detailed data represents a strong first step in capitalizing on the power of peer-production and decentralized expertise that we have seen yield meaningful results in other sectors. If the public proves that they can make something -- be it a real predictive model, or even an interesting concept -- worthy of investment with this data, it seems likely that other teams will follow City's lead.

And that's a challenge that MacAree, and others, are more than ready for.