HaystaqDNA’s Automotive Microtargeting Engine

Download a PDF of the Case Study

Leading Microtargeting firm HaystaqDNA has developed an Automotive Microtargeting Engine that allows automotive OEMs and dealers to select conquest targets based on HaystaqDNA’s powerful predictive models. Compared to traditional list providers these models show up to 80% higher conversion rates on email and direct mail. When used with addressable TV, the HaystaqDNA models yield a 70% higher conversion rate. Marketers for automotive OEMs in the United States must not only sell more cars to existing customers, but also capture new customers from other brands in order to maintain and increase sales. The traditional method of ‘conquesting’ is to buy lists of target customers from generic consumer data vendors such as Experian or Axicom or from US Industry specific vendors such as Polk or AutoIntenders. Unfortunately for the OEM’s buying these lists, their competitors are often buying the exact same targets. These lists are based on basic demographics and targeted to a vehicle segment (example: Luxury Compact Sedans) rather than a brand or specific carline (example: Mercedes-Benz C-Class). Using the technologies and techniques developed in political microtargeting, HaystaqDNA can instead create specific targeting models for individual products. This is accomplished by ingesting existing customer data, augmenting it with original survey research, and using advanced data analytic methods find the individuals most likely to purchase the target product. The conquest targeting provided by HaystaqDNA is specific not only to a given brand, but also to a given car line. We rank every single consumer (~260M individuals) on their likelihood of buying that specific car. Our modeling methods go far past basic demographics and use over 1,000 distinct indicators to find our conquest targets. This technique is far more accurate than relying solely on age/gender/income/location based targets. Below is a case study of how a leading Luxury Automotive Brand used HaystaqDNA Automotive Microtargeting Engine to improve its conquest campaign results.

Like most automotive OEMs, our client traditionally bought lists for direct mail and email campaigns from commercial vendors. The brand was consistently delivering year over year sales growth thanks to regular significant new product introductions and excellent customer loyalty, but they understood they needed to dramatically increase conquest sales (automobile buyers who do not currently own a product from that brand) in order to achieve their future growth targets. They also required an easy to use interface to allow marketing staff across the organization to create and utilize lists of conquest customers. Based on HaystaqDNA’s success over several automotive pilot projects, the brand joined in partnership with HaystaqDNA to create and run the Automotive Microtargeting Engine (AME) in the fall of 2014.

AME interface

A screenshot of HaystaqDNA’s AME Interface..

The AME project consisted of the following parts:

  1. Setting up and receiving recurring feeds from the brand’s internal CRM system.
  2. Matching existing customers to a commercial database.
  3. Surveying new car buyers.
  4. Using Machine Learning Algorithms to model likely buyers and their preferences.
  5. QA and Validation of these models
  6. Scoring every individual in the country.
  7. Matching in vehicle ownership and in-market timing data.
  8. Creating an interface for marketers to identify, explore and pull conquest targets.

 

  1. Recurring Feeds: HaystaqDNA worked with the company’s IT department to receive recurring sales, dealer territory, dealer service, accessories, options and event attendance feeds. In the future Haystaq anticipates receiving additional feeds on inventory levels, inventory pipelines and financing received. These feeds come in across secure channels into a firewalled Amazon AWS cloud environment where they are cleaned and formatted. The data is all related back to itself via customer ids, vehicle ids and dealer ids.
  2. Consumer File: On behalf of the client, HaystaqDNA licensed the national infoGroup consumer file, consisting of roughly 260M US adult consumers. This file contains all of the Personally Identifying Information (PII) as well as over 1200 fields of additional data — Census Data, Property Data, Survey Data, Modeled Data and aggregated data bought from sources like magazines, retailers, airlines, hotels, insurance companies, financial institutions, etc. All of this data is converted to ‘indicator’ or ‘independent variable’ form where text fields are converted to binary flags and false numeric data is discarded. The historical brand sales data is then matched to this file using the PII in both.
  3. Surveys: Several times each year, HaystaqDNA conducts a survey of likely car buyers to find things like preferences for particular powertrains and lifestyle choices (like sports attendance and participation). These surveys and primarily conducted through IVR calls to landlines and live calls to cell phones, with online panels and SMS surveys used to supplement where needed.
  4. Dependent Variables and Modeling: Both the Customer and the Survey data is then transformed into ‘dependent variables’ or ‘DVs’. For example, for a specific car dependent variable, a person who is known to have bought said vehicle will be given a value of 1, while another who did not, will be given a value of 0. For a skiing DV, a survey respondent who indicates that they enjoy skiing will be given a value of 1 and another who answered that they never ski will be given a 0. Using our AWS infrastructure, we bring in both our data sets of DVs along with our massive table of independent variables. We have also found that people’s buying behaviors differ regionally, so we typically divide the US into four regions — Northeast, Southeast, Central and Western and model the DVs for each region independently. We use Python (and its SciKit-Learn and Pandas libraries) to model these dependent variables. We use a variety of algorithms (Logistic Regression, Decision Trees, Nearest Neighbor, Neural Networks, etc.) depending on the data sets, but we rely most heavily on the Logistic Regression and Decision Tree algorithms. We often blend the results of multiple models as we find that in doing this we often amplify the underlying signal and cancel out the noise. No one coefficient will determine a person’s score, our typical models will use in excess of 100 coefficients.
  5. Quality Assurance and Model Validation: We always withhold 1/3rd of the DVs to serve as a clean test set and we validate all models against this set. Using this test set we know what the model would predict for these individuals and we can compare that to their actual behavior (their car ownership or survey answers).
    QA checks

    An example of two of our QA checks against the test sample — a Hosmer-Lemeshow step chart and a Receiver Operating Characteristic chart are featured above..

    We also use visual tools to see how the different models correlate to one another.
    Visual tools checks

    This visualization takes a sample of people and looks at their scores across a number of vehicle segments. We expect there to be a high correlation between models at similar price points and some derivations where the price points are car types are very different. Here we see the Luxury Midsize Hybrid SUV and Luxury Compact SUV scores highly correlate while the Luxury Midsize Hybrid Sedan and Luxury Full-Size SUV scores do not correlate..

    At this point, the models also provide either coefficient weights or indicator importance ranks which allows us to see the attributes of individuals who score highly. We often report these attributes back to the client so they can gain insight into their customers.
    QA checks

    Some examples of recent positive and negative coefficients from a Full-sized Luxury SUV Model..

  6. Scoring: Once the best validated models are selected, our analysts set up a cluster environment on AWS. We use the Python models produced in the modeling phase, but we rely on Spark and Parquet to help us take advantage of the clustered environment. We can assign a score for every car line and lifestyle filter to every individual in a region and ultimately the country in a couple of hours. By default, our scores come out as a value between 0 and 1, but we convert these scores to a rank ordering of individuals within each region.
  7. Vehicle Ownership and In-Market Timing Data: Once we have our consumer file scored and ranked, we match in garage and auto intender data from the client’s chosen vendor. The garage data consists of over 168M individual vehicles that are or have been owned by over 110M individuals. Additionally, this vendor provides a database of Auto Intenders — an in-market timing file which indicates individuals likely to buy a car within the next three, six, or twelve months. This file usually consists of 10M-12M individuals. Both the garage data and in market timing data become filters in the AME Interface.
  8. The AME Interface: The client needed these conquest targets to be available to marketers at multiple levels (National staff – both brand and agency, Regional offices, and individual Dealers) for both exploration and list pulling, so HaystaqDNA created the Automotive Microtargeting Engine. This interface allows marketers to specify a geography (dealers are limited to their own boundaries), specify which car line or car lines they are interested in marketing, filter by different lifestyle, car preference, demographic or market timing filters, indicate the desired list size and optionally put in a distance from event limiter.

    AME interface v2

    A screenshot of HaystaqDNA’s AME Interface..

    In addition to creating a list query, Marketers can merge lists or exclude previously created lists and they can assign individuals in a list to specific car lines/collateral. The target channel for the lists can also be specified by a number of templates — AME lists have been used for direct mail, email and digital outreach.
    Using this interface, marketers can explore their target areas in real time and pull lists in near real time, greatly shortening the time to deployment vs. what the client had experienced with traditional list providers.

Results: The client has continually tested AME against its traditional list providers. Time and again AME has achieved better campaign conversion. A recent test showed AME with 50-80% higher conversion (depending on the specific car line). The targeting cost per sale also halved. In experiments with addressable TV, Haystaq has seen a 70% higher conversion rate on automotive vs. tests with a leading addressable TV vendor.

Support for the Affordable Care Act

Download a PDF of the Case Study

As the new Congress rushes towards a repeal of the Affordable Care Act, many are working against the opinion of the voters in their own districts. Research conducted by HaystaqDNA during the 2016 campaign showed that a majority of Americans support the ACA. However, members of Congress are more concerned with opinions of their constituents than they are with national numbers. Therefore, Haystaq looked at support levels by Congressional District. 253 of 435 or 58% of Congressional Districts show a majority of voters supporting ACA.

Not surprisingly, the majority of these pro-ACA districts are held by Democrats. However, 61 pro-ACA districts are currently held by Republicans. Many of these districts are relatively safely Republican, but in many, the difference in support in favor of the ACA is near or above the margin of victory in the 2016 election. This would suggest that voting to repeal the act puts these candidates at risk next year, even more so once voters realize how they will be personally affected by a repeal of the ACA.

The Haystaq microtargeting models have identified 98,942,762 likely ACA supporters nationwide, 41,697,492 of whom live in Republican districts.

METHODOLOGY

These numbers are based on a national survey of approximately 10,000 registered voters. The survey responses were used to build microtargeting models predicting how any individual voter would have an- swered the question had they been surveyed. The Congressional District percent in support of ACA is based on the number of voters in each district with an ACA support score of 50% or higher. The ACA support score predicts the likelihood that a voter would say that they support the ACA if surveyed. These numbers differ from poll results in that they are not weighted. A poll is likely to be weighted based on assumptions about likely turnout. The Haystaq models are applied to every registered voter.

The microtargeting models were built using a combination of the survey results and nearly 1,000 fields of commercial marketing data, Census demographics and proprietary derived indicators. Haystaq combines a variety of statistical and machine learning algorithms including Penalized Logistic Regression and Random Forests. The predictive models were validated against a hold-out sample to confirm that they accurately predicted the likely survey responses of individuals whose responses were not used in building the models.

Following is the question wording used in the survey:

Which comes closest to your opinion on the Affordable Care Act or Obamacare: that it is beneficial but doesn’t go far enough, that it is about right, or that it goes too far and should be repealed? Please press 1 if you think Obamacare is beneficial but doesn’t go far enough, press 2 if you like the law as it is, press 3 if you think Obamacare goes too far and should be repealed, or press 4 if you are not sure.

The model predicts the likelihood that a voter with an opinion on ACA would select option 1 (Support ACA but thinks it doesn’t go far enough) or option 2 (like the law as it is) vs. 3 (Goes too far and should be repealed). Because the model is predicting support only among those with an opinion, respondents picking option 4 (unsure) are not included.

The survey was conducted using a combination of live and IVR (automated phone calls) to a random sample of more than 10,000 voters nationwide.

ACA-support-HaystaqDNA-score-by-county

CD Name % of Vote in 2016 Election % of Voters Supporting ACA
TX23 Will Hurd 50.90% 72.40%
NY11 Daniel Donovan 63.30% 70.40%
FL27 Ileana Ros-Lehtinen 54.90% 67.20%
FL26 Carlos Curbelo 56.30% 65.30%
WA8 Dave Reichert 60.00% 64.90%
CA21 David G. Valadao 93.20% 63.80%
IL12 Mike Bost 57.80% 63.30%
MI11 David Trott 56.90% 61.40%
VA10 Barbara Comstock 52.90% 61.00%
KY6 Andy Barr 61.10% 60.60%
IL13 Rodney Davis 59.70% 60.50%
NJ11 Rodney Frelinghuysen 60.00% 60.40%
NJ7 Leonard Lance 55.70% 59.50%
VA2 Scott Taylor 61.70% 59.10%
MI8 Mike Bishop 58.80% 58.60%
IL6 Peter J. Roskam 59.50% 58.40%
FL18 Brian Mast 55.50% 58.10%
NM2 Steve Pearce 62.80% 57.90%
FL25 Mario Diaz-Balart 62.40% 57.90%
MI6 Fred Upton 61.70% 57.60%
CA25 Stephen Knight 54.20% 57.50%
CO6 Mike Coffman 54.70% 56.70%
FL2 Neal Dunn 69.20% 56.40%
NY24 John Katko 61.00% 55.70%
NY19 John Faso 54.70% 55.60%
AZ2 Martha McSally 56.70% 54.80%
CA39 Edward Royce 57.70% 54.60%
MI7 Tim Walberg 57.90% 54.60%
MI1 Jack Bergman 58.20% 54.60%
PA15 Charles W. Dent 60.60% 54.30%
PA18 Tim Murphy 100.00% 54.20%
PA8 Brian Fitzpatrick 54.50% 54.10%
IL14 Randy Hultgren 59.60% 54.10%
MI4 John Moolenaar 65.80% 54.00%
IA1 Rod Blum 53.90% 53.90%
WA5 Cathy McMorris Rodgers 59.50% 53.90%
TX32 Pete Sessions 100.00% 53.90%
NJ3 Tom MacArthur 60.60% 53.70%
WA3 Jaime Herrera Beutler 61.40% 53.60%
NJ4 Chris Smith 65.50% 53.60%
NJ2 Frank LoBiondo 61.60% 53.60%
MN3 Erik Paulsen 56.90% 53.60%
PA12 Keith Rothfus 61.90% 53.50%
KY1 James Comer Jr. 71.20% 53.30%
MI3 Justin Amash 61.30% 53.00%
ME2 Bruce Poliquin 54.90% 52.70%
GA6 Tom Price 61.60% 52.30%
VA5 Thomas Garrett 58.30% 52.10%
TX27 Blake Farenthold 58.90% 52.10%
LA4 Mike Johnson 65.20% 52.00%
NY2 Peter T. King 62.40% 51.90%
LA5 Ralph Abraham 100.00% 51.80%
TX7 John Culberson 56.20% 51.70%
NC13 Ted Budd 56.10% 51.50%
CA49 Darrell Issa 51.00% 51.40%
NY1 Lee Zeldin 59.00% 51.40%
PA6 Ryan Costello 57.30% 51.20%
FL15 Dennis A. Ross 57.50% 51.10%
OH14 David Joyce 62.70% 51.10%
GA12 Rick Allen 61.60% 50.70%
OH1 Steve Chabot 59.60% 50.40%

HaystaqDNA and Bernie Sanders 2016

Download a PDF of the Case Study

Modern presidential campaigns need to contact voters at high volume and on a sensitive timeline; to encourage persuadable voters to support them, to solicit donations, to engage volunteers, and to ensure that supportive voters turn out at the polls. Targeting each message to the most receptive audience is central to succeeding at all of these functions, not only because campaign resources are scarce, but also because contacting unsympathetic voters can produce harmful backlash. Bernie Sanders ran for the Democratic nomination for president in 2016 with little name recognition, little money, and little support. With HaystaqDNA’s help, the Sanders campaign attracted the support of almost half of Democratic voters and a substantial share of convention delegates, and Bernie Sanders was able to bring his message to a national audience.

Microtargeting and predictive analytics has been a cornerstone of modern campaigns at the presidential level ever since our CEO and founder, Ken Strasma pioneered the approach on John Kerry’s 2004 Democratic primary campaign. While the Democratic National Committee now offers some basic models for the use of Democratic campaigns, the Sanders campaign engaged Haystaq so that it could take advantage of state-of-the-art candidate- and state-specific modeling, data-informed delegate maximization strategies, and experimental design guidance. Haystaq’s models were used for a wide variety of applications, including targeted deployment of distributed volunteer calls, addressable television and digital advertising, segmentation of fundraising email lists, turnout tracking, and field program optimization.

1. Sanders Support Models

Using survey data combined with advanced statistical and machine learning modeling techniques, Haystaq created state-specific models predicting the extent to which any of a sample of eligible Democratic primary voters surveyed could be expected to support Bernie Sanders for president over Hillary Clinton. Once a state’s support model was validated and optimized using a test-set of survey records, it was applied to all eligible Democratic primary voters in a state, assigning each a probability score predicting the likelihood of that voter supporting Bernie Sanders. To ensure the models remained up-to-date and were incorporating all available data, they were regularly refreshed to include new data from our daily tracking surveys, and new data collected by the campaign’s field program.

In some states, Haystaq also created specific support models for hard-to-reach demographic groups such as African Americans and Hispanics. Those groups of voters were often scored low on our support models, but it was important to the campaign to expand its appeal among minority voters, so these models allowed the campaign to find which African-American voters, for example, were relatively more likely to be open to Sanders. These were used for specific outreach to those groups, to maximize persuasive impact while minimizing the risks of contacting unsupportive voters.

2. Primary Turnout Models

Using a similar process, Haystaq also created scores predicting each individual voter’s likelihood of participating in the Democratic primary. Primary elections typically draw significantly lower participation than generals; they are often open only to registered partisans and even then, only the most engaged activist voters tend to participate. In some states, like Iowa, which is always first on the primary calendar, delegates are chosen not in a regular election but a caucus. Since participating in these contests can require voters to stay at their caucus location for up to several hours, participation is extremely low. (In 2008, the year of the last contested primary on the Democratic side, Iowa caucus participation was a record high at 239,000.)

To create turnout models for such unusual contests requires a state-specific approach, and is a challenge because there can be no direct source for a 2016 turnout dependent variable before the election has taken place. The methodology we have honed defines the most recent similar election as a model. Then we “roll back” date-based indicators like age and past election voting history in order to create a model “predicting” 2008 primary turnout using only what was known before 2008, once this model is validated (that is, we verify that it tracked 2008 participation accurately), we “roll forward” the indicator set, to apply it to 2016 voters.

This is the most valid method available, but it is flawed in that it assumes repetition of past voting patterns, and no turnout model can be validated until the election is over. The outputs of turnout models generally need to be artificially adjusted to account for expected changes in voting patterns. This was especially true in this case since we were relying on participation patterns of 2008, which could not be expected to be repeated in part because of the different groups of voters motivated by Bernie Sanders versus Barack Obama, and differences in the reception to Hillary Clinton in 2008 versus 2016. In the case of the Sanders campaign, we knew that our supporters were much less likely than Clinton supporters to have participated in the 2008 primaries, and less likely to be registered as Democrats, and so they would be rated as less likely to participate under the traditional metrics. For this reason, we also modeled voters’ “self-reported” turnout likelihood using responses to a survey question about intent to vote. Our studies have shown that people generally exaggerate their likelihood of voting when asked, but that a self-reported turnout score does work well as a relative measure. The self-reported models allowed us to account for the increased enthusiasm felt by the many Sanders supporters who were first-time primary voters.

3. Campaign Engagement and Other Models

Among the other models Haystaq created for the campaign were volunteer, email-responsiveness, and fundraising models, each of which were based upon individuals’ direct engagement with the campaign. Using the voter records of people who volunteered with the campaign’s field team, canvassing or phone-banking to reach voters, we created scores to find people who had not volunteered with the campaign, but who were most likely to respond favorably if asked. These models allowed the campaign to more efficiently recruit the volunteers who were critical to its field program.

Similarly, we created “look-alike” models to help the campaign find the people most likely to donate, including multiple versions of this model to approximate the expected value of an individual’s donation. In this way, we helped the campaign to maximize their record-breaking small-donor fundraising success, and to expand the program from email-only to targeted digital ads.

4. Delegate Strategy

Because the delegates that ultimately decided the nomination were, in many states, awarded according to congressional district (or by precinct in IA), the campaign needed to work to optimize delegates rather than total votes. Delegates are often awarded in small integers in proportion to district-level vote totals, so there are specific times where gaining or losing a small number of votes can make the difference of a delegate. Using our support and turnout models, we were able to predict which districts were likely to be divided close to these points and recommend that the campaign direct its resources there. The simplest example of this is that in a district with an odd number of delegates, the winning candidate gets an extra delegate no matter how small the margin; whereas in a district with an even number of delegates, 50% of the vote plus 1 is worth dramatically less effort, since the candidates will split the delegates evenly unless one wins by a much larger margin.

5. Election Day turnout tracking

Haystaq also managed the campaign’s election day turnout tracking operation, which involved crowdsourcing reports of how many ballots had been cast at particular precincts at various times on election day and aggregating those reports upward to project voting trends. Precincts were categorized according to the average support score of eligible primary voters as favorable to Clinton or Sanders. Our turnout tracking system compared the actual number of ballots cast at base Sanders and Clinton precincts to the baseline expectation during the day to give the campaign an indication of whether it was likely to win the state, and then perhaps more importantly, how delegates apportioned at district levels were likely to be distributed. This allowed the campaign to direct its resources to the places where additional campaign phone calls or canvassing were most likely to make a difference in the delegate total.

6. Fundraising Models

Haystaq and the campaign also conducted a series of experiments aimed at optimizing the response to its fundraising emails by varying the “ask amount” referenced in the email text. In general, asking for larger donations does yield larger donations, but asking for smaller donations yields more donations. We found in a series of A/B tests conducted in early April that asking people who had not previously contributed to the campaign for $2.70 (1/10 the campaign’s oft-cited average contribution of $27) consistently produced a higher expected return than the campaign’s original practice of asking for $3. While the average donation amount decreased marginally with the smaller ask, this effect was moderated by the fact that people often donated more than the amount asked for once they reached the donation page.

Email used by the Sanders Campaign after HaystaqDNA Fundraising Models

 

7. Direct mail

Haystaq microtargeting models allowed for the creation of nuanced and very precisely targeted direct mail universes. We were able to increase the volume of mail sent to precincts or congressional districts where our modeling showed that the campaign was near the tipping point for winning additional delegates, and also to target specific messages to voters most likely to be sympathetic. Using a suite of more than 50 issues ranging from climate change to school choice support, we were able to direct messaging on climate change, for example, to the voters most likely to strongly agree with Sanders’ climate message. We were also able to direct additional mail to individuals with low television viewing or low social media usage scores, improving reach among voters unlikely to see the campaign’s messaging in other media.

8. Television Targeting

Microtargeting is often thought of as a tool only for direct voter contact like phones, door knocking or direct mail. However, given the share of resources spent on television advertising, this may be the most important application of microtargeting. Haystaq used ratings data by various different demographic and lifestyle indicators to calculate the likelihood of any individual voter watching a particular show, network or daypart. We then overlaid this data with our modeled lists of persuadable voters and likely supporters to calculate the cost of reaching any one of the campaign’s targets via television. This often led to our being able to identify much more efficient targets than would have been possible using traditional metrics like cost per point or cost per adult viewer.

Even greater precision was available through addressable television. We generated lists of individuals who were scored as highly persuadable for persuasion ads, and likely supporters who might not vote for turnout ads. These were also segmented by demographics and by congressional district so that we could maximize the ad spend in districts where we were the closest to the tipping point for winning additional delegates.

Results

While ultimately Sanders narrowly lost the nomination, the campaign dramatically outperformed expectations at its launch, engaged segments of voters previously ignored by the Democratic Party, and eventually garnered the support of 47% of Democrats. The campaign resulted in the elevation of Sanders’ stature within the party and an ongoing engagement for Haystaq with the campaign’s successor organization Our Revolution.