Post-Election Reflection

Kelly Olmos

2024/11/18

My Model and its Predictions

I ran a state-level ordinary least squares regression. For the non-battleground states, I used the lagged vote shares to predict which candidate would win that particular state. I used polling data, fundamentals (Q2 gdp growth, and demographics (age, race, and education level) to predict which candidate would win in each state. Since there was not enough polling data for all states, I focused on the ones that did–battleground states. I ran an elastic net on my OLS regression model to deal with multicollinearity. Some variables like race and education level may be correlated. Additionally I ran 10,000 simulations to create 95% confidence intervals.

The total list of predictors in my OLS regression is: latest_pollav_DEM, D_pv2p_lag1, D_pv2p_lag2 +, q2_gdp_growth, white, black, american_indian, asian_pacific_islander, less_than_college, college, hispanic, age_18_to_29, age_30_to_64, age_65_75plus.

My model predicted a Trump victory with 285 electoral college votes and 258 electoral college votes for Harris. My model inaccurately predicted that Georgia and Arizona would go for Harris.

stateD2pvmy_predictionWinnerpred_Winner
Arizona47.1525350.24998TrumpHarris
Georgia48.8784550.04388TrumpHarris
Michigan49.3015648.95665TrumpTrump
Nevada48.3782949.74023TrumpTrump
North Carolina48.2988949.29332TrumpTrump
Pennsylvania48.9823548.92587TrumpTrump
Wisconsin49.5315648.62988TrumpTrump

In the bar chart below, I have plotted my predicted Democratic two-party vote share compared to actual Democratic two-party vote share in the battleground states.

Accuracy of my Model

To quantify the accuracy of my model, I calculated the MSE, RMSE, MAE, and Bias. The findings are presented in the table below.

The average squared difference between predicted and actual Democratic vote shares is 2.104.

The average error is approximately 1.45 percentage points.

The average absolute error between my predicted and actual Democratic vote share is 1.13 percentage points

My bias score indicates that I overestimated the Democratic party vote share by 0.76 points on average.

metricvalue
MSE2.104484
RMSE1.450684
MAE1.131759
Bias-0.759456

I performed a state-level analysis to test how well my model worked in certain states. My model performed the worst in Arizona where I overestimated the Democratic party vote share by a little over three percentage points. The model performed the best in Pennsylvania where the MSE value was 0.003.

stateD2pvmy_predictionWinnerpred_Winnermsermsemaebias
Arizona47.1525350.24998TrumpHarris9.59421803.09745353.0974535-3.0974535
Georgia48.8784550.04388TrumpHarris1.35822781.16543031.1654303-1.1654303
Michigan49.3015648.95665TrumpTrump0.11895960.34490520.34490520.3449052
Nevada48.3782949.74023TrumpTrump1.85488721.36194241.3619424-1.3619424
North Carolina48.2988949.29332TrumpTrump0.98888210.99442550.9944255-0.9944255
Pennsylvania48.9823548.92587TrumpTrump0.00318960.05647650.05647650.0564765
Wisconsin49.5315648.62988TrumpTrump0.81302310.90167790.90167790.9016779

Below is a heat map with the error metrics visualized. The bias in my model did not disproportionately overestimate the Democratic two-party vote share. In four states: Arizona, Georgia, Nevada, and North Carolina it did overestimate the Democratic two party vote share. However, my model did overestimate Trump’s winning margin in Michigan, Pennsylvania, and Wisconsin although the bias for Trump was less on average. I found this interesting because the states were I overestimated Trump’s winning margin were also the ones that were hypothesized by political pundits to go for Harris. Michigan and Wisconsin provided hope for Harris because they used to make up the blue wall. I knew immediately that my model was predicting Arizona wrong because polling data showed that out of the seven battleground states, Arizona was one of the least likely states to go for Harris.

To investigate why there is no direction in bias for my model I will go back to my OLS regression results and explore which of my variables might have affected my model.

I created a confusion matrix. The confusion matrix shows that I did not predict Harris would win any state that Trump actually won. It is easier to understand the errors because Trump had a clean sweep of the battleground states and my model only predicted he would win 5 which means that the error comes from predicting a Harris win in Arizona and Georgia where Trump won.

Reflection

I used a combination of demographics, polling data, and economic indicators to predict the electoral college results and the democratic two-party vote share in the battleground states with a OLS regression. One of the things that immediately stands out to me is that I may have made my model too complex. Although my model did not do a terrible job predicting the outcome of the election, I recognize from a statistical perspective that there were too many variables and some of them are not independent of one another which led to multicollinearity. For example, I have race and education level variables which are likely correlated in my linear regression. Additionally, I had a high r-squared value, around the high nineties which suggests that my model may be overfitting. To test whether the OLS regression model is not effective for predicting electoral outcomes, I could create an ensemble model or I could test the performance of my model with less variables.

In terms of the components of my model, I knew including demographic variables was a risk because the Kim & Zilinsky article we read a couple of weeks ago claims that five key demographic attributes can predict a voter’s choice about 68% of the time. I only use three of those key attributes so it may be even lower for my model in terms of its predicting capacity. Post-election exit polls show that Trump made gains with most demographic groups which indicates that demographic attributes may not have great predicting capacity in this election. To test this out I removed race variables from my model and this is how the update model performed below.

metricvalue
MSE0.6708597
RMSE0.8190603
MAE0.6860950
Bias0.2835597
stateD2pvmy_predictionWinnerpred_Winnerupdate 1update 2msermsemaebias
Arizona47.1525350.24998TrumpHarris48.5614046.716841.98492441.40887351.4088735-1.4088735
Georgia48.8784550.04388TrumpHarris48.6670349.580140.04469830.21141970.21141970.2114197
Michigan49.3015648.95665TrumpTrump49.1555849.220850.02130880.14597520.14597520.1459752
Nevada48.3782949.74023TrumpTrump47.5914648.671120.61909760.78682760.78682760.7868276
North Carolina48.2988949.29332TrumpTrump47.9014547.945930.15796210.39744450.39744450.3974445
Pennsylvania48.9823548.92587TrumpTrump48.3327349.551080.42200150.64961650.64961650.6496165
Wisconsin49.5315648.62988TrumpTrump48.3290549.610061.44602541.20250791.20250791.2025079

After removing the race variables from my model, it does perform better, accurately predicting a Trump victory in all battleground states, but has started to underestimate the Democratic two-party vote share in six out of the seven battleground states which I thought was interesting. The model once again overestimates the Harris margin which makes me wonder what about the variables and Arizona is so unique that it is not being captured by the variables in my model. I predict it has to do with the quality of polls in Arizona potentially overestimating Harris support in Arizona. The graphic below shows that Arizona substantially shifted to the right. To test whether the polls overestimated Harris support in the states like Arizona, we will have to wait until the votes are all counted and final election results are available and compare them to what the polls predicted.

Ultimately, it is not possible to know the impact of race until the voter files are available because exit polls are known to be less reliable. Once the voter file is available, I can test how demographic groups voted in the 2024 election.

I also looked back at my original OLS regression model and decided to select only the variables that were statistically significant. Interestingly, it was all of my race variables and the variable containing the latest polling data. I ran the same model again with these variables and this was the outcome (shown below).
metricvalue
MSE0.1755229
RMSE0.4189546
MAE0.3587306
Bias-0.1103432
stateD2pvmy_predictionWinnerpred_Winnerupdate 1update 2msermsemaebias
Arizona47.1525350.24998TrumpHarris48.5614046.716840.18982280.43568650.43568650.4356865
Georgia48.8784550.04388TrumpHarris48.6670349.580140.49236930.70169030.7016903-0.7016903
Michigan49.3015648.95665TrumpTrump49.1555849.220850.00651330.08070520.08070520.0807052
Nevada48.3782949.74023TrumpTrump47.5914648.671120.08575080.29283240.2928324-0.2928324
North Carolina48.2988949.29332TrumpTrump47.9014547.945930.12458390.35296450.35296450.3529645
Pennsylvania48.9823548.92587TrumpTrump48.3327349.551080.32345780.56873350.5687335-0.5687335
Wisconsin49.5315648.62988TrumpTrump48.3290549.610060.00616260.07850210.0785021-0.0785021

When I only include the variables that were statistically significant, the model performs better than my original better and the one updates without race variables. The aggregate level MSE value is 0.17 which is close to 0, indicating very good accuracy. I think this is really interesting because the Kim & Zilinsky article did not show the demographic attributes had high predicting capacity. This latest iteration of my model shows that polling data and race variables predict this election outcome very well.

What does this say about the Q2 GDP growth? This was my only economic variable in my model. I did consider including unemployment but in the end opted to exclude because it was showing multicollinearity with Q2 GDP growth. When I added Q2 GDP growth to both of the updated models it increased the Democratic two-party vote share to the extent where it flipped states. I think this is the result of Trump’s perception as a successful business man. A couple of weeks ago, I listened to The Daily and in this episode, there were many clips of interviews with Nevada voters who said they were voting for Trump because of the economy. Inflation in 2022 and 2023 has tainted the economic image of the Biden administration and Harris was not able to escape that. I can test whether voter’s perception of the economy is a better predictor for electoral outcomes by switching out Q2 GDP growth with inflation or consumer economic index and include it in my model and see how it performs.

In a future iteration of this model I would also like to account for incumbency and make it an interaction variable with some economic indicator. Post-election discussion that I’ve heard suggests that Trump won because voters were dissatisfied with the economy and wanted change. I think an interaction variable could have captured public dissatisfaction and performed better.