Normally "the only data you can get [publicly] is the really basic stuff: goals, assists, cards... [which is] nothing you can really work from," says Graham MacAree, SBNation's soccer editor, and one of the leaders in the field of public soccer analytics.
According to the club, some data will be entirely available for public consumption, but the most detailed data --"a time coded feed that lists all player action events within the game with a player, team, event type, minute and second for each action, together with the x/y/z co-ordinates for each event" -- will be sent to analysts who present a project submission that is approved by the club and their data provider Opta, the leaders in soccer data mining.
This more detailed data will be useful for experts like MacAree, a veteran of baseball's statistical revolution known as "sabermetrics" (think Moneyball), because it contains so much more information than can be gleaned from traditional soccer analysis, which has focused on individual actions in a vacuum -- that is, without context: Player X passes, Player Y dribbles, and Player Z shoots and scores.
"The most important thing for me is knowing where the ball is at all times, and where all the players are at all times," MacAree explains. "And City are proposing to release not just the what, but the where and when of the data. We're talking very much about space and time, which are very difficult to get out of the data set we've already had."
This is a foundational moment for the soccer-analytics community. The field of study, despite all the bluster about a soccer Moneyball or Jamesian moment (after the godfather of the sabermetric movement, baseball writer Bill James), has yet to progress past the equivalent of a box score. Large-scale advanced metrics are years of research away, especially because data has been so scarce. Most of the cutting-edge analytics have been painstakingly developed by hand. Previously, researchers without access to the kind of data Manchester City is making available have had to record every event in a match, watching frame-by-frame, then transcribe it to Excel, and write the code themselves to analyze it. Single match analyses like MacAree's radial-passing maps take more than a day of labor-intensive work to assemble.
In this data environment, researchers have little hope of coming up with testable, verifiable, predictive metrics.
"If you look at baseball, the sabermetric revolution came about because data was available before it was valuable," MacAree explains. In this environment the costs of entry to innovate were low, and Bill James, among others, was able to experiment. But "now that we know how valuable data is, there's no reason for it to be [freely] given to us... but our contribution [community analysts'] can also be valuable. And we've always been about showing that we're worth giving that data to."