Ballot

Everything you wanted to know about election forecasting but were afraid to ask.

srinivas bhogle & rajeeva laxman karandikar 19 Mar 21

On 10 November last year, I settled down with a hot mug of coffee to track the results of the Bihar Assembly election. There was a sense of déjà vu: loud and pompous anchors competing for attention; studio guests trying to decide whether to smile or display gravitas; lurid pie charts and bar graphs in the background. The one difference was that the field reporters were wearing N95 masks.

Soon, the clash between the National Democratic Alliance (NDA) and the Mahagathbandhan (MGB) began to heat up, like a high-octane run chase in an IPL match. Scorecards filled up with rapidly swelling numbers. Discussions gained volume and acrimony. It was borderline pandemonium as the rival teams inched towards the winning target.

In the NDA versus MGB game, the winning target was 122 seats. But before this real contest, there had already been proxy contests between pollsters, each predicting how the total of 243 assembly seats would be split between the two coalitions. Many predicted a close race, but most were also saying that the MGB’s tally would surpass the NDA’s. The overwhelming impression, created by news channels and experts, was that it was curtains for incumbent chief minister Nitish Kumar.

Ultimately, the NDA pipped the MGB by 14 seats. So when Nitish Kumar returned as chief minister, the popular view was that the pollsters had got it wrong again.

Not true.

In fact, only two pollsters got it all wrong: one ^[1] predicted that Nitish Kumar would go down 80-150, while the other ^[2] declared that it would be 55-180. (Ironically, these were the two pollsters who had correctly predicted that the NDA would get 350 seats in the 2019 Lok Sabha elections.)

This is the exasperating part of the forecasting spectacle. Every election seems to mess up a perceived pattern, bring in fresh doubts and questions, open new debates or reopen old ones.

Do election forecasts really work? How do they work? Why do they work? Are they even supposed to work? Is the prediction methodology different in general and assembly elections?

I went to Professor Rajeeva Karandikar for answers. Rajeeva, currently the director of Chennai Mathematical Institute, has predicted Indian elections for over 20 years. During his most active phase (2005-2014), Rajeeva made 28 forecasts of Lok Sabha and Vidhan Sabha elections in association with the Centre for the Study of Developing Societies. ^[3] Of these, he rated 16 forecasts to be ‘better or much better than others,’ 8 to be ‘comparable with other forecasts’, and 4 to be ‘much worse than others.’

“Of course, election forecasts work,” he said. “This isn’t quackery or serendipity or just luck. It’s science. Science has limitations, but it also holds true power.”

If election forecasting seems more like a game than like science, it’s simply because a lot of new players have jumped into the contest. They throw their bats around at every opportunity and occasionally manage to send the ball soaring into the night skies. But it’s not a mindless game. Predictions are based on established laws of sampling and probability theory.

This essay explains why election forecasts can, and do, work. The explanation will involve talk about populations, samples, randomness, probabilities and error estimates. The eureka moment will be the discovery that even in a country with the largest number of voters in the world, and with a complicated parliamentary system, it is possible to predict the winner with a high degree of accuracy.

But there’s also a humbling caveat: this isn’t a magic show. The prediction can be good, or very good, but we must be mindful of both the power and the limitation of opinion polls.

Randomness

he election to the Lok Sabha is the world’s biggest poll. Over 90 crore citizens are eligible to vote to elect 543 members of Parliament. In the 2019 election, over 60 crore Indians actually cast their vote. Indian election forecasting attempts something that’s almost impossible to believe. We carefully select about 30,000 or more––a sample––of these 60 crore voters––the ‘population’––and ask them which party or alliance they voted for. Based on their answers, we try to accurately predict who is going to win the actual election.

The way we choose our sample is vitally important, and also somewhat intriguing. Here’s an example to show how it’s done. Let’s suppose voting has been completed in the Bangalore Rural constituency for the Lok Sabha elections. About 8 lakh votes have been cast. The results are to be announced 7 days later, but, before the final count, we are asked to predict the likely winner using a quick ‘day-after’ survey. What should be the minimum sample size for this survey?

First alert: In India’s first-past-the-post system, winning margins can be very small. So, while choosing the sample size, we have to decide what kind of a winning margin are we expecting? 5% more votes for the winner, or 2% or 1%? Second alert: We’re playing with probabilities, and there’s always a chance that we’ll get our prediction wrong. How often do we want to get it right? 99 times out of 100, or 95 times out of 100?

The pollster sees things differently. If it’s a straight contest with the leader maintaining a 2% lead over his rival, even a random sample of 100 will correctly predict the winner 65% of the time.

Now, the answer to the sample size question. If we assume the winner to get at least 2% votes more than the loser, and if we want to get our prediction right 99 times out of 100, then we can mathematically prove (using the Central Limit Theorem, for those interested) that a random sample of 4161 ^[4] voters will do the job.

But here’s what’s truly mind-boggling: What would happen if, instead of 8 lakh, 16 lakh votes had been cast? Our sample size, with a 2% winner advantage and a 99% probability of success, would still be 4161. In other words, even if the population is doubled—or is multiplied by 10 or 100—the sample size would remain the same.

In all this, however, there is a non-negotiable proviso: The sample must always be chosen randomly.

That does not mean the choices must be made arbitrarily. Far too often, the popular perception is that random is just another word for ‘arbitrary.’ A sample is random, in the statistical sense, only if every voter from the population has an equal chance of being selected.

I must remind those who already know how to make this distinction that while the statistician venerates randomness, the layman is often scornful. On a national TV channel, someone once asked Rajeeva: “But how can you rely on CSDS data when they openly declare that they only do randomised surveys?”

Size

uring the2012 Uttar Pradesh Assembly election, Rajdeep Sardesai on CNN-IBN announced that most polls were predicting a hung state assembly. Rajeeva, also seated in the studio, appeared to surprise the others by saying that data from the CSDS ‘day-after’ poll suggested an absolute majority for Akhilesh Yadav’s Samajwadi Party. Not convinced, but wanting to keep the fun going, Sardesai announced: “If this prediction is right, I take the credit; if it’s wrong, Rajeeva takes the blame!”

Another situation often played out in the CNN-IBN studio. When one of the two contesting alliances established a significant lead, Sardesai would ask the spokesperson of the losing alliance to accept defeat: “Come on, you must now admit that you are going to lose this election!” The response would typically be on the lines of the following: “We haven’t counted even 5% of the total vote so far! By looking at a few thousand counted votes, how is it possible to know how 10 lakh voters have really voted?”

Intuitively, it is easier to believe that the larger the fraction of the vote counted, the more reliable the prediction is likely to be. The pollster sees things differently. If it’s a straight contest with the leader maintaining a 2% lead over his rival, even a random sample of 100 will correctly predict the winner 65% of the time.

If the sample size is increased to 1000, then the prediction will have a 90% chance of being correct. And when the sample size is raised to 4161, as we explained earlier, there’s a 99% chance of success. Even a ten-fold increase in the sample size thereafter will merely increase the chance by a fraction of a fraction. So, increasing the sampling fraction makes no sense at all: it only makes the game more expensive.

The only situation when it will make more sense to increase the sample size is when the contest is very tight. In such situations, pollsters might want to work with tighter error margins. Instead of an error margin of plus or minus 2%, they could consider an error margin of plus or minus 1%. ^[5] With this tighter margin, the sample size for 99% success will go up from 4161 to 16,644. ^[6] In most cases, it simply isn’t feasible to insist on such high accuracy because of the cost factor.

It’s now easy to guess what might have happened with a large number of seats in the Bihar Assembly poll. A simple count shows that the margin of victory was less than 2% in as many as 40 seats out of 243. ^[7]

Seats

n India, elections can be messy. ^[8] What matters here is not which party gets the largest number of votes, but which party gets the largest number of seats. It’s possible to imagine that more votes mean more seats, and the two are often sufficiently correlated. But there can be situations where a party with a lower popular vote percentage can win a larger number of seats.

Take the 2018 Karnataka Assembly elections, where the Indian National Congress (INC) got 38.14% of the popular vote and the Bharatiya Janata Party (BJP) got 36.35%. But the BJP bagged 24 more seats than the INC—104 to the INC’s 80. The INC won with unnecessarily big margins in some constituencies and lost more of the close races. To win, the winner needs only one vote more than the loser.

In many ways, this is the central issue in calling elections in India: how does one correctly predict the winner, and also the number of seats that each party or alliance will win?

Poll prediction in India has two stages. In the first stage, we conduct surveys to estimate the likely percentage of votes every party or alliance will get. In the second stage, we convert these vote percentages into seat estimates. If the gap in the vote percentage between the largest and the second largest party or alliance is sufficiently large, it is easy to guess who will win the seat; if the gap is smaller, then it becomes tricky.

It is practically impossible, or very expensive, to survey every constituency, especially if it is a Lok Sabha poll. ^[9] The pollster therefore has to be innovative in choosing which constituencies to survey, and, more generally, in devising a sampling scheme that most faithfully represents the trend in the state and the country. Surveys are a hard grind, and not sufficiently remunerative. ^[10] There is also irony in the fact that while election telecasts attract big sponsorship, election surveys seem to lack sufficient funding, even though they determine the quality of the prediction.

When it comes to converting a party’s vote percentage into its number of seats, all pollsters claim to have a formula. In India’s first few election forecasts in the 1980s, led by Prannoy Roy, Ashok Lahiri and others, the key ingredient was the so-called ‘Index of Opposition Unity’ (IOU). The IOU relied on the classical first-past-the-post electoral axiom: all that matters while making an election forecast is to see who’s winning the race.

And, since this is hardly ever a two-horse race, the winner doesn’t even need to poll 50% of the votes. With three serious parties in the fray, even polling 38% of the votes fetches a win in most cases.

The best way to make it harder for the leading party is for all or most of the other parties to come together. The IOU was an attempt to measure the degree of togetherness amongst the rivals; the higher the IOU, the harder it is for the first in the race to score a facile win.

Swing

e normally associate ‘swing’ with what Mohammed Shami or Jasprit Bumrah can do to the cricket ball. But come elections and the swing acquires a completely different spin. Imagine that BJP’s popular vote percentage in a constituency is currently 39.7%, but, in the immediately preceding election, it had been 36.3%. There is therefore now a swing of 39.7 – 36.3 = 3.4% in favour of the BJP.

The swing is a key variable in making election forecasts, and must, therefore, be studied carefully: is the swing restricted to just the constituency, or does it also extend over a larger region?

The swing is often a double-edged sword. If there is a swing of +3.3% in favour of the BJP in a two-party contest, it means that the BJP has pulled away from its rival by 6.6%, which is significant.

Next, consider the scenario where BJP has suffered a 2% loss in swing, but the gain is distributed among several other parties. In this case, the BJP isn’t hurt as much.

It is easy to understand why the IOU becomes so critical in such situations. But, and here’s the rub, the exact mathematical formula connecting IOU to the seat count prediction is not easy to find. I searched through the big and small print of The Verdict by Dorab Sopariwala and Prannoy Roy, but the formula remained elusive.

Rajeeva suggests that it was likely based on simple heuristics: something like ‘if the IOU is less than 25%, give the first-placed party 75% of the seats.’ It may also have involved intelligent tweaking based on current survey data, historical data, informal feedback, expert opinion, gut feeling, and so on.

The Last Mile

his narrative now enters deeper waters. We will now describe the method devised by Rajeeva to convert vote percentages into seats, which Yogendra Yadav christened the ‘Probabilistic Count Method.’

Let us explain by considering a simple situation. Imagine that there is a constituency with only two parties. In the previous election, the winner polled 55% of votes and the opponent got 45%. Imagine that our survey indicates that there is now a 4% swing away from the winner. Yesterday’s 55-45 margin could today have become 51-49.

The catch is that we absolutely cannot rule out the possibility that the candidate currently trailing 49-51 could eventually win the race. There could be sampling errors. This is why we need that plus-minus 2% margin of error.

Now we must ask: What is the probability that the trailing candidate becomes the winner? The candidate is clearly a little less likely to win than the rival, so the winning chance is below 50%. But it isn’t 0%. A reasonable guess could be that there is still a winning probability of, say, 44%, or “0.44” if we want to pretend to be a probabilist of pedigree.

The spectator seeks objective answers; instead, he is offered a ridiculous construct called the ‘poll of polls’ that confirms that honest enquiry has truly metamorphosed into crass entertainment.

How do we compute such probabilities? We use probability distributions ^[11] using appropriate Bayesian models. ^[12] Such computations allow us to cross the last mile, and predict the seat count using rigorous methodology instead of relying on more subjective guess work.

Here’s how we proceed. Imagine that there are 40 constituencies (seats) in a state, and pre-poll surveys come in with vote share percentage data for the top two parties as follows: Seat 1: 49-51, Seat 2: 48-52, Seat 3: 56-44; Seat 4: 61-39, and so on till Seat 40: 47-53.

For each seat, we compute the winning probability for a party using the appropriate probability distribution. Let’s say for Seat 1 the party has a winning probability of 0.44, Seat 2 is at 0.40, Seat 3 is at 0.69, and so on.

Then, and here’s the icing on the cake, we add up these 40 individual probabilities to get a number like 24.52, rounded off to 25. So, a single point prediction could be: “The leading party will win 25 out of the 40 seats.”

I asked Rajeeva what he might do if the predicted number of 25 appeared “slightly off.” Would he ‘tweak’? He explained that mathematical models keep evolving, and do allow the option to vary parameters. But Rajeeva said he’s never tweaked the model in favour of a verdict. There’s only one true story and that’s the story the data is telling.

Instead of quoting just one number, it is customary to make a prediction over a range; something like “the number of seats is likely to be in the 21 to 29 range.” Every Indian pollster sends out such figures for the likely seat count. Lately, these intervals are getting wider because everyone wants to ‘trap’ the final seat count in their declared interval so that they can claim that they got it right. We are likely to see more of this range inflation as election prediction becomes a number-guessing game more than a scientific exercise.

We now have a dozen predictions before every election that span all possible extremes. The spectator is confused and seeks objective answers. He is entitled to such answers; instead, he is now offered a ridiculous construct called the ‘poll of polls’ ^[13] that confirms that honest enquiry has truly metamorphosed into crass entertainment.

This is a matter of concern, and not only in India. Any pollster joining the arena must be mandated to follow some disclosure norms. People deserve to know who conducted the survey, when it was conducted, who supervised it, and what the sample size and sampling methodology were. Independent and impartial regulators must also have access to data, raw percentages obtained from the survey and computation outputs. While TV channels may be unable to display this information on air, there should be no difficulty in posting such information online.

The Challenge

robability distributions can be a pollster’s best ally only if the forecast is truly based on a random sample. Violate the randomness, and the pitch is queered. In 2019, over 60 crore votes were cast in the Lok Sabha election. With as large a voting population as India’s, it can be overwhelming to consider how to draw a random sample, design a sampling scheme, and decide what a reasonable sample size would be.

If we choose a sample of 4161 from each of the 543 Lok Sabha constituencies, we are already looking at an impossibly large sample size of more than 22 lakh. Do we need such a large number? Can anyone afford such a large number? If this number has to be trimmed, what might be the best way to do it?

We are curious to know how other election pollsters handle this situation, because nobody discloses such information openly. Many rely on quota sampling, an idea borrowed from market research, where you identify attributes (age, gender, caste, religion, education level) that influence the population’s voting choice, and choose a sample that best represents this population. The benefit of quota sampling is that it is quick and relatively inexpensive. But it is definitely not random sampling. Quota sampling is probably a good option if you want to decide the colour of a new toothpaste brand. It doesn’t lend itself as well to a national poll.

Telephone surveys have been popular abroad, and are increasingly gaining currency in India. These are sufficiently inexpensive and infuse at least a modicum of randomness, but we can’t be sure if such surveys touch the entire voting population. It is not hard to imagine that this method would favour rich, urban and male voters. To be sure, tweaks and correction factors are possible, ^[14] and pollsters can get quite clever at this, but the overall integrity of the process is still somewhat compromised.

How, then, does one conduct an all-India election survey with a manageable sample size and while respecting the tenets of statistical randomness? In 1997, the team from CSDS, with Yogendra Yadav at the helm, was the first to consider the idea of circular random sampling for election surveys. But they needed a proven statistics expert to confirm that the scheme will work. They called Rajeeva, then of the Indian Statistical Institute in Delhi, to ask “Scheme chalegi?” Rajeeva replied in Delhi lingo:“Daudegi!” The scheme’s got wings.

When the I.K Gujral government collapsed in November 1997, a surprise mid-term poll was announced. The CSDS team, who thought they had a couple of years to develop their election methodology, now discovered that they had only two months. “Could we take off in just two months?” they asked Rajeeva. That was the starting point of an association that would last for more than a decade.

In the CSDS circular sampling method, only every fifth Lok Sabha seat becomes part of the sample. As a result, we’re extrapolating the result for 543 seats from a sample seat size of 108. To choose these seats, we number Lok Sabha seats 1, 2, 3, and so on till 543. Then, we randomly pick a number between 1 and 543—378, for example—and then select seats with numbers 378, 383, 388 and so on. While numbering seats, we also make sure that all seats in a state are in a single bunch so that the sample size is proportional to the total number of seats: Uttar Pradesh, with 80 seats, will therefore have the most (16) seats in the sample.

We continue (constituency -> booth -> voter) with the same circular process, randomly picking polling booths within a constituency, and voters within a polling booth. With 8 booths per constituency and 25 voters per booth, we get a sample size of 25 x 8 x 108 = 21,600; with 50 voters, the sample size doubles to 43,200. It can be proved that with a sample size of 21,600, the accuracy is at least 99%. With 43,200, the accuracy is at least 99.5%.

The first circular random sampling exercise succeeded. Based on the CSDS-conducted survey and Rajeeva’s prediction of the seat count, India Today forecast 214 seats for the Atal Bihari Vajpayee-led NDA coalition a week before the first phase of the 1998 polling. After polling was completed, the NDA seat count was revised to 251 based on the ‘day-after’ poll projections. The final NDA count was 252.

Finally, it’s time to put together all the pieces of the election prediction jigsaw puzzle. Let’s look at how the result of the 2019 Lok Sabha election was predicted. We’ll illustrate with the example of Karnataka, which sends 28 MPs to the Lok Sabha. For each of these seats, we already knew the percentage of votes obtained by each of the political parties in the 2014 Lok Sabha election.

The circular sampling exercise would’ve picked up the 2019 vote percent for about a fifth of these 28 seats—5 or 6 seats. For these seats, suppose that BJP gained an average of 3.3% of the vote in comparison to 2014, and INC has likewise lost an average of 2.2%. In the pollster’s language, this will be a +3.3% swing for BJP and a -2.2% swing for INC. We now make a crucial assumption: the swings of +3.3% for BJP, and -2.2% for INC, observed in reality over only the 5 or 6 sampled seats, will reflect in the remaining seats in the state. In other words, we assume a uniform swing over all of Karnataka.

The rest of the calculation is now simple. To get the 2019 vote percentages for each of the 28 Karnataka seats, add 3.3% to the 2014 BJP vote percentage and subtract -2.2% from the 2014 INC vote percentage. Finally, use the Probabilistic Count Method to predict the party seat tally for Karnataka.

We repeat this process for every state: first calculate the 2019 vote percentages for every party, and then use the Probabilistic Count Method to predict the number of seats for every party. By adding the party seat tallies over all the states, we arrive at an all-India seat prediction.

Details

he reader probably has a lot of questions. Since this is not an interactive story, we will guess the likely questions and attempt answers.

Is the assumption of a uniform state-wide swing tenable?

Yes, looking at past data, and recognising that many election issues are state-centric. If, however, there are reasons to believe that the swing percentages differ significantly within a state—between western and eastern UP, for instance—then we can assume a uniform swing only within a region. Another variable application of the method could be to assume different uniform swings for different voting phases. We could even combine the state, region, and phase swings for a sharper estimate.

Which works best: opinion poll, exit poll, or day-after poll?

Currently, pre-election opinion polls are not allowed once the campaigning for the first phase ends, because of the suspicion that it influences voting preferences. There is also strong evidence that unlike in the UK or US, with their numerous ‘safe’ seats, or steadfast red and blue loyalties, there is a lot of volatility in India; studies have shown that as many as 30% voters change their mind in the two-three weeks before polling day.

Opinion polls also fail to identify which of the voters in the sample will actually cast their vote. Therefore, opinion polls conducted a few weeks before the poll can, at best, indicate the mood of the nation at that point of time, but may be unable to accurately predict the final outcome.

Exit polls confirm that the voter has indeed voted, but it is hard to enforce random sampling. Additionally, respondents may be uncomfortable to talk in a crowded public space.

Day-after polls are ideal, because they can be random, and the voter is more relaxed. Asking them to cast their vote in make-believe ballot boxes also ensures more truthful responses.

What are the questions one asks in an opinion poll?

The top two questions in an opinion poll are, in order: which party will you vote for, and who is your preferred Prime Minister or Chief Minister? A lot depends on how the questions are asked. CSDS’s face-to-face meetings with randomly chosen voters have shown significantly better results than anonymous phone calls. CSDS’s method takes time and resources and that is what makes their data more expensive than others’.

Are Indian and US poll predictions different? What exactly does that Nate Silver guy do?

Nate Silver doesn’t do his own surveys. He relies on data published by others and makes appropriate adjustments based on the track record of these agencies. His ‘popular-vote-to-electoral-college-vote’ method is comparable to our ‘vote-to-seat’ model, but instead of probability computations, he uses scenario simulations. While Nate Silver was successful in 2008, 2012 and 2020, don’t forget that he failed in 2016—very likely because of overwhelming reliance on the prior classification of states as ‘red’ (Republican), ‘blue’ (Democrat) and ‘battleground’ states.

EVMs

lectronic voting machines have been used in Indian elections since 2004. The EVM was a splendid creation: it was robust and reassuringly primitive. The EVM’s ‘look and feel,’ and the order in which it displays candidate names, is identical to the old ballot paper. It is impossible to connect EVMs to any external device or network, and the EVM’s sole capability is to accept the voter’s input, and produce an instant count of the numbers of votes polled by each candidate.

In the early years, it was not possible to physically verify if EVM counts were correct. This weakness was corrected by the introduction of the Voter Verifiable Paper Audit Trail (VVPAT) system, starting 2014. Every vote cast by every voter on the EVM could now be audited.

When ballot papers were manually counted, the drama used to continue well into the evening of the day after.

This should have been the end of the matter, but every losing party started blaming the EVM for its defeat. At one point before the 2019 Lok Sabha election, there was the ridiculous demand that half the votes cast on the EVMs should be physically recounted using the VVPAT system.

A committee constituted by the Election Commission, which included Rajeeva, proposed a series of checks that could confirm, with a probability of over 99.9999%, that the poll outcome had not been compromised. In reality, nobody truly believed that EVMs could be tampered with; this was just idle bluster from the Opposition.

There is no denying that EVMs robbed the TV election coverage of its old romance. When ballot papers were manually counted, the drama used to continue well into the evening of the day after. With EVMs, it was all over by late afternoon the same day.

Rajeeva misses the good old days. His early seat projection model, based on probabilistic constructs, could predict the final seat tally faster, and with greater accuracy, than rival TV channels. Even with just 15-20% of the counting completed, one could see the end picture clearly.

How did that happen? In those days, they used to empty ballot boxes from all the different polling booths, mix the ballot papers well, and then start the counting. Because of this mixing, the ballot papers selected for the first round of counting were effectively equivalent to, say, a 10% random sample of all the votes cast. By the time the first 2-3 rounds of votes were completed, it was very easy to spot the winner unless things were really close.

Rajeeva’s partnership with CNN-IBN, CSDS and Yogendra Yadav was a long one. Among the most memorable polls he worked on was the Assam election of 2011. No one thought the incumbent INC government, led by the late Tarun Gogoi, would get more than 45 seats in a house of 126. So, when the CNN-IBN screens displayed a predicted vote share of 36% for INC, and an absolute majority with a range of 64-72 seats, the chief minister was delighted. INC eventually bagged 78 seats with a vote share of 39%.

When Gogoi later met Rajeeva at the inauguration of the Tezpur centre of the Indian Statistical Institute in July 2011, he told him in Bengali that only three people believed he would be re-elected: “Tumi, ami aar Yogendra!”

Rajeeva Laxman Karandikar is a mathematician, statistician and psephologist. He is currently the director of Chennai Mathematical Institute. He is a Fellow of the Indian Academy of Sciences and the Indian National Science Academy.

Srinivas Bhogle is an analytics enthusiast, aerospace watcher and a long-time teacher. He is currently honorary scientist at CSIR’s Fourth Paradigm Institute, Bengaluru.