Donald Trump’s surprise victory poses the question: How did we get this thing this wrong? From the myriad polls and poll aggregators, to the vaunted oracles at Nate Silver’s FiveThirtyEight and the New York Times’s shiny forecasting interface, most serious predictors completely misjudged Trump’s chances of victory.

Though election night had the appearance of an unlikely come-from-behind victory by Trump, that narrative only exists because virtually all predictions—perhaps even from the Trump camp—started with the assumption that Trump was an underdog. In reality, when viewed with proper perspective, Trump sailed to a rather easy victory, challenged Clinton in several stronghold states, and realistically wrapped the election well before midnight. That kind of result doesn’t come out of nowhere, but few pre-election polls even began to pick up such large effects.

So what happened? Caveat emptor: If pollsters don’t really know the answer, we probably won’t really know it for some time. Also, as of the time of this writing, Hillary Clinton is ahead in the popular vote totals, meaning that polls showing her ahead by a few points in head-to-head matchups with Trump were wrong in magnitude, but not directionality.

National polls don’t usually show Electoral College vote counts, and don’t often maintain the granularity to make the kind of state-by-state predictions to make those projections, so their usefulness even in aggregate to forecast elections is limited. Given that electors are determined by congressional representation, that representation is only reapportioned every 10 years, and that the overall number has not increased in over 50 years, there is an increasing discrepancy between the popular vote and the actual outcome of elections, one that will make national overall polls that simulate the popular vote less relevant to predictions over time.*

Forecasting sites and models have keyed into this discrepancy and had success over the past few election cycles by aggregating smaller state and county-level polls, and then forecasting actual Electoral College votes from those aggregates. That approach has obvious advantages, but suffers sometimes from lack of available and reliable data. As a rule, many state and local polls are newer and more volatile than national polls, and several rely necessarily on unorthodox methods to achieve enough proper sample sizes, which are also often much lower than national polls. Also, the baseline statistics from Census products and other large surveys used for “weighting” state and local results become less reliable as they drill down.

Long story short: Statistical power is important, and any misrepresentation of the population in the sample or weights can lead to unusable results.

The problem with finding accurate and random samples of voters to poll has plagued polling since cell phones came into wide use. Prior to that technological development, the ubiquity of landline telephones made finding reasonably-random and representative samples easy, as pollsters could just pick random names out of phone books, call potential voters, and talk them through interviews, which supplied the kinds of rich context and human understanding necessary for properly analyzing their responses. That method also ensured reasonably high response rates and helped control  nonresponse bias, by which the polls themselves become skewed by the kinds of people who tend to answer.

But the rise of cell phones and the demographic differences of their adoption meant that random samples of landlines became increasingly inadequate in finding good samples. The problem with moving to cell phones or even attempting a hybrid approach is that cell phones are not usually publicly-listed, making it harder and harder to find representative samples. Various online survey methods have been used to supplement or supplant more expensive and less expansive phone methods, but they often also suffer from bias and are generally considered of lower quality than other polls.

The difficulties in polls are illustrated by FiveThirtyEight’s final forecast model of Pennsylvania, where only three of the model’s polls from the week before the election were rated by the site as an “A-” or above. The poll with the most weight in that model is the Remington Research Poll, a robo-call-powered poll run by former Ted Cruz manager Jeff Roe that does not appear to publish its sampling or weighting methodology, and thus has not been given a rating by FiveThirtyEight.

The most recent poll in that model came from the mixed landline and online Gravis Marketing poll, and featured results with a whopping 3 percentage-point margin of error and a sample that was weighted not to Pennsylvania demographics, but to national demographics. One other poll in the aggregate is the SurveyMonkey poll, which is likely limited by its reliance on a largely skewed group of voters—people who respond to SurveyMonkey polls. Each of these showed Clinton leads in the state that Donald Trump eventually won.

New forecasting models of aggregation like FiveThirtyEight’s are marvels in increasing predictive power, and work well in smoothing out the kinks of individual state polls by increasing their statistical power in groups, but when those polls suffer similar problems, those models might theoretically amplify their discrepancies.  

Namely, if polls tend to weight Democratic or Republican likely-voters and demographics based on 2012 elections patterns or older demographic distributions, they will naturally miss out on big shifts in the composition of likely-voters or where they live. If high numbers of the wealthy, white, educated pieces of the Obama coalition turned out for Trump, and he also picked up unprecedented turnout from rural voters, models that weight data to recent past elections might understate those effects. Many of these polls might be ill-suited to understanding sudden changes in the electorate or the way the electorate votes.

There are some solutions to this “likely-voter” problem in polls, but many of them involve methods that might make several cheap and accessible polls less so. Utilizing advanced statistics, analyzing previous similar election events, using machine-learning, and creating “kitchen-sink” models based on voter rolls are established ways to improve the underlying assumptions of polls. But those methods might be a bit too costly and time-intensive for polls that use online surveys and publicly-available annual Census data precisely because they tend to be cheaper than deep research.

Bad models happen, and the very nature of what appears to be the Trump constituency probably made most models worse. Forecasts are best at telling us what old data tells us about new data, and the thing about using existing data is that large deviations in the underlying assumptions of those data may go unnoticed. Those deviations are especially dangerous when they bolster existing confirmation bias among analysts and journalists, but the directionality of that bias is often unclear. Did we all believe Clinton would win because of bad data, or did we ignore bad data because we believed Clinton would win? There’s the question for the ages.

Perhaps the lesson here about the Trump presidency is that it was truly unpredictable. Good models often fail to accommodate events outside of the bounds of their sensitivity, and sounding the alarm on their flaws would necessarily involve knowing or suspecting more about elections than the data we fed the polls.

For many unfortunate Cassandras like Silver himself, caution was roundly ridiculed from this lack of perspective. But if this is the new normal, pollsters will have to adapt in order to maintain relevance.

* This article originally suggested that electoral votes had not been reapportioned in 50 years. We regret the error.