DOI : https://doi.org/10.5281/zenodo.20071533
- Open Access
- Authors : Medha Anand Chomal, Deepanshu Jindal, Akashdeep Singla, Harsh Sinha, Gurpreet Singh
- Paper ID : IJERTV15IS043920
- Volume & Issue : Volume 15, Issue 04 , April – 2026
- Published (First Online): 07-05-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Predicting Digital Ad Performance the Hard Way: A Hybrid Stack Built from Public Kaggle Data and Hand-Tagged Indian Competitor Ads
Deepanshu Jindal, Medha Anand Chomal, Akashdeep Singla, Harsh Sinha, and Gurpreet Singh
Department of Computer Science / Chitkara University
Abstract – This paper addresses the challenge of pre-flight creative performance prediction for digital advertising campaigns. We assembled a hybrid dataset from nine Kaggle sources, three LLM-generated synthetic files, and 120 manually tagged competitor ads from the Meta Ad Library and Google Ads Transparency Centre (collected 2028 April 2026). After pre-processing, the unified corpus comprises 21,643 campaign rows and 1,120 creative-metadata rows. We trained a stacked ensemble of HistGradientBoosting, LightGBM, XGBoost, and Random Forest with a Ridge meta-learner across 15 prediction tasks. Cross-validated R² reached 0.75 for CTR, 0.81 for CPC, and 0.75 for CPA. The high-CTR binary classifier achieved AUC = 0.994 with a Brier score of 0.033. Install-quality prediction attained R² = 0.80. An ablation study revealed two counter-intuitive findings: (a) stacking yields only marginal gains (~0.0006 R² on CTR) over a single HistGradientBoosting model, and (b) removing LLM-generated training rows improves D1 generalisation by
5.8 R² points. Both findings are reported in full.
Index Terms – click-through rate prediction, digital advertising, ensemble learning, gradient boosting, stacking meta-learner, LLM data augmentation, install quality prediction, ad performance modelling.
-
INTRODUCTION
A small scenario illustrates the motivation for this work. A three-person team managing a D2C baby-care brand holds INR 8 lakh in ad budget with Diwali ten days away. On a Sunday evening they lock in five creatives. By Wednesday, three are dead each having spent INR 80,000 with CTR below the team's internal floor. The remaining two carry the entire campaign. Had a model been able to flag, on Sunday night, which two creatives were likely to perform, approximately INR
2.4 lakh in wasted spend could have been avoided.
The prediction problem is harder than it first appears. Outcome depends simultaneously on creative content, audience targeting, platform, and bid strategy. Most published CTR benchmarks (Criteo, Avazu) hash creative content out of the feature set, eliminating the very lever that practitioners can control. Furthermore, marketing teams require model explanations that map to actionable decisions; deep neural networks that deliver marginal accuracy gains at the cost of interpretability offer limited practical value [9].
We held ourselves to three constraints: (1) Reproducibility all data sources are public or documented; (2) Actionability
prediction targets are metrics a media buyer already uses; (3) Honesty every R² is reported with a 95% bootstrap CI and a five-fold CV score, and negative ablation findings are reported, not suppressed.
-
RELATED WORK
CTR prediction has a 20-year lineage, beginning with logistic regression on sparse hashed feature vectors at early ad networks. The field progressed through factorisation machines [FM], field-aware factorisation machines [FFM], and latterly deep cross-networks deployed at scale by major platforms. Criteo and Avazu remain the canonical open benchmarks. Their value lies in scale; their limitation is that creative content is hashed away, removing the marketing-side use case we address.
Post-install retention is less well studied in the open literature. Mobile measurement partners (AppsFlyer, Adjust) instrument SDKs that emit D1, D7 and D30 retention windows. Those tables sit behind vendor agreements and no public Kaggle source emits real post-install retention at the ad level. We document our heuristic workaround in Section III and quantify its cost in the ablation (Section V-A).
-
DATA AND PRE-PROCESSING
-
Sources
Table I summarises the ten data sources. Sources 16 are public Kaggle datasets providing real campaign-level metrics. Sources 79 are creative-metadata files synthesised by ChatGPT, Claude, and Gemini respectively. Source 10 is a manual tag file covering 120 competitor ads from 20 Indian brands across baby-care, parenting, and edtech verticals (Mamaearth, FirstCry, Pampers, BabyChakra, Healofy, Huggies, Himalaya, Johnson's, Sebamed, MamyPoko, Meesho, Flipkart, Amazon India, BYJU's Early Learn, Khan Academy Kids, Pediasure, Cetaphil Baby, The Moms Co, Mother Sparsh, and Chicco), collected between 20 and 28 April 2026.
#
Source
Rows
Type
1
Kaggle Facebook Ad Campaigns [1]
1,143
Real
2
Kaggle Social Media Advertising 300k [2]
10,000 (sampled)
Real
3
Kaggle Ad Click Prediction 10k [3]
10,000
Real
4
Kaggle Social Media Optimisation [4]
500
Real
5
Kaggle Ad Campaign Relational DB [5]
400,000 events aggregated
Real
6
Kaggle Marketing Campaign 200k [6]
10,000 (sampled)
Real
7
ChatGPT-generated creative metadata
500
LLM
8
Claude-generated edge cases
200
LLM
9
Gemini-generated competitor metadata
300
LLM
10
Manual: Meta Ad Library [7] + Google Ads Transparency [8]
120
Observed
TABLE I. Data sources used in this study.
-
Pre-processing Issues
Several non-trivial cleaning challenges were encountered:
-
Source 1 contained 382 rows with columns shifted one position due to an unquoted comma in the campaign name field. Detected via unexpected string values in the campaign_id column; diagnosis required approximately half a day.
-
Sources 2 and 6 store monetary spend as '$xx.xx' strings. Failure to strip the dollar sign before casting causes Pandas to silently coerce the entire column to NaN.
-
Source 5 contains no per-ad metrics. The 400,000 event rows were aggregated by ad_id to recover impression, click, and conversion totals.
-
The three LLM-generated files (Sources 79) used inconsistent column naming conventions. A small mapping layer was written to harmonise them.
-
The manual tag file used eight creative theme categories; the LLM files used thirteen. A unified taxonomy of eight categories was adopted by collapsing the additional LLM categories, at the cost of some granularity.
-
-
Final Tables
After cleaning, two tables were produced: Table A (real campaign data), containing 21,643 rows and 12 columns; and Table B (creative metadata), containing 1,120 rows from Sources 710. Categorical features are label-encoded; missing numeric values are filled with 1 to allow tree-basd learners to split on absence rather than imputing a mean.
-
Leakage Audit
A critical leakage problem was discovered during initial modelling. The first CPM model returned R² = 1.0000. Investigation revealed that log_spend and log_impressions were included as features while CPM = (spend × 1000) / impressions was the target; any tree learner can reconstruct the target via arithmetic. Removing log_spend reduced R² only marginally, as platform encoding alone was sufficient to determine CPM in our dataset (Pinterest CPM $192 vs. Meta/Twitter/Instagram $124). Removing platform encoding from CPM features produced a final R² of 0.73 a genuine prediction. Per-target feature exclusions are documented in the released training script.
-
-
METHOD
The pipeline employs a stacked ensemble of four base learners HistGradientBoosting (HGB) [12], LightGBM [10], XGBoost [11], and Random Forest with a Ridge meta-learner ( = 0.5). Outliers are clipped at the 0.5th and 99.5th percentiles per target. Model evaluation uses a held-out 20% test split, five-fold cross-validation, and 200-resample bootstrap confidence intervals on R².
-
Algorithm
Algorithm 1: Stacked-Ensemble Training
INPUT: X (n×d feature matrix), y (target vector), B (base learners) OUTPUT: M (final stacked predictor)
-
Partition rows into 5 disjoint folds F, …, F
-
FOR each base learner b B:
FOR each fold k {1..5}:
Train b on rows NOT in F
Record b's predictions on F as the k-th block of Z[:, b]
-
Train each b B on the full training set X
-
Train Ridge meta-learner R on (Z, y) with = 0.5
-
At inference: build z = [b(x), b(x), b(x), b(x)] and return R(z)
-
-
Mathematical Formulation
The stacked predictor combines base learner outputs via a Ridge meta-learner:
(x) = _{b{hgb,lgb,xgb,rf}} b · b(x) + (1) The Ridge meta-learner minimises:
min_ y Z·² + ², = 0.5 (2)
Performance is measured by the coefficient of determination on the held-out test set:
R² = 1 (y )² / (y )² (3)
For binary classification tasks, the Brier score [13] and AUC are reported:
Brier = (1/n) (p y)² (4) AUC = ¹ TPR(t) d(FPR(t)) (5)
Permutation feature importance is averaged over five repeats:
PI = R²(model, X) mean_{r=1..5} R²(model, X with column j shuffled) (6)
Bootstrap 95% confidence intervals on R² are computed from 200 resamples:
CI(R²) = [P., P.] of {R²(y*b, *b) : b = 1..200} (7)
Mean absolute percentage error serves as a sanity check on multiplicative targets:
MAPE = (1/n) |y | / |y| (8)
-
-
RESULTS
Table II reports all 15 prediction tasks. R² values, 95% bootstrap confidence intervals, mean absolute error, and five-fold CV statistics are given for regression tasks. Accuracy, F1, AUC, and Brier score are given for classification tasks.
Task
Type
Headline
Detail / CI95
Error
CV5 mean±std
N(test)
CTR
Reg
R² 0.733
[0.70, 0.76]
MAE 0.034
0.747±0.014
4,329
CPC
Reg
R² 0.806
[0.80, 0.82]
MAE 2.075
0.809±0.003
3,492
CVR
Reg
R² 0.678
[0.60, 0.74]
MAE 0.039
0.650±0.038
4,275
ROI
Reg
R² 0.332
[0.31, 0.36]
MAE 1.595
0.336±0.014
2,166
Task
Type
Headline
Detail / CI95
Error
CV5 mean±std
N(test)
CPM
Reg
R² 0.733
[0.71, 0.75]
MAE 20.338
0.728±0.006
2,229
CPA
Reg
R² 0.767
[0.75, 0.78]
MAE 33.978
0.752±0.011
3,523
Engagement Score
Reg
R² 0.387
[0.35, 0.41]
MAE 1.854
0.392±0.019
2,100
High CTR Clf.
Clf
Acc 0.950
F1 0.907 | AUC 0.994
Brier 0.033
0.913±0.003
4,329
High Engagement Clf.
Clf
Acc 0.452
F1 0.464 | AUC 0.635
Brier 0.224
0.052±0.010
4,329
High CVR Clf.
Clf
Acc 0.635
F1 0.307 | AUC 0.550
Brier 0.235
0.157±0.018
4,275
D1 Retention Rate
Reg
R² 0.729
[0.52, 0.84]
MAE 2.075
0.723±0.030
224
D7 Retention Rate
Reg
R² 0.647
[0.54, 0.72]
MAE 0.018
0.660±0.026
224
D30 Retention Rate
Reg
R² 0.685
[0.60, 0.74]
MAE 0.012
0.695±0.023
224
Install Quality Score
Reg
R² 0.799
[0.72, 0.85]
MAE 5.648
0.738±0.031
224
Cross-Platform Lift
Reg
R² 0.368
[0.19, 0.51]
MAE 0.068
0.383±0.054
224
TABLE II. Results for all 15 prediction tasks.
The two most informative tasks in terms of practitioner utility are CTR (CV R² = 0.747 ± 0.014) and CPC (CV R² = 0.809
± 0.003). The high-CTR binary classifier is both discriminative (AUC = 0.994) and calibrated (Brier = 0.033), meaning the predicted probability can be used directly as a confidence score.
Permutation importance reveals that CTR is dominated by source identity, log-impressions, and platform identity consistent with the leakage audit findings. D1 retention, in contrast, is driven by source dataset, a has_offer flag, creative theme, and audience segment. This confirms a well-known practitioner heurism: discount-led creatives purchase clicks; expert-led creatives purchase retained users.
-
Ablation Study
Table III presents the CTR ablation and Table IV presents the D1 retention ablation.
Configuration
R² (CTR)
vs Full
Full 4-model stack
0.7325
Drop HGB
0.7290
0.0036
Drop LightGBM
0.7332
+0.0007
Drop XGBoost
0.7316
0.0009
tr>
Drop Random Forest
0.7327
+0.0001
Only HGB
0.7320
Only LightGBM
0.7252
Only XGBoost
0.7275
Only Random Forest
0.7259
TABLE III. Model ablation on CTR prediction.
Configuration
R² (D1)
vs Full
Full 4-model stack
0.7199
Drop HGB
0.7355
+0.0157
Drop LightGBM
0.7467
+0.0269
Drop XGBoost
0.7654
+0.0455
Drop Random Forest
0.7009
0.0190
Drop manual tags (D1)
0.7541
+0.0343
Drop LLM rows (D1)
0.7778
+0.0579
TABLE IV. Model and data ablation on D1 retention.
Two findings emerge. First, on CTR, all four base learners achieve within 0.01 R² of one another; the full stack improves on the best single learner (HGB, R² = 0.7320) by only 0.0006 within measurement noise. Second, removing the LLM-augmented rows from the D1 training set improves R² by 0.0579, reproducible across three independent random seeds. The most plausible explanation is that the heuristic used to synthesise missing D1 labels in the LLM files introduced a deterministic artefact that the model fitted, and this pattern did not transfer to the manually-tagged ground truth.
-
Baseline Comparison
Model
R² (CTR)
R² (D1)
Predict Mean
0.0015
0.0104
Linear Regression
0.2105
0.1325
Ridge Regression
0.2105
0.1340
Single HGB
0.7319
0.7429
Single XGBoost
0.7271
0.6936
Full Stack (4 learners)
0.7325
0.7199
TABLE V. Baseline comparison on CTR and D1 retention.
Predicting the training mean yields R² near zero, confirming the variance in both targets is real. Linear and Ridge regression capture only 21% of CTR variance and 13% of D1 variance, confirming that non-linear feature interactions are essential. Single HGB and single XGBoost perform comparably to the full four-model stack on CTR. Practitioner recommendation: begin with a single HGB or XGBoost; add stacking complexity only if the single model plateaus on additional tasks.
-
-
LIMITATIONS
Several limitations constrain the scope of conclusions that can be drawn from this work.
-
Source heterogeneity. Six Kaggle vendors define impression and click differently. Source identity is included as a feature to allow learners to absorb systematic bias; residual confounding is likely not fully eliminated.
-
Synthetic retention labels. D7, D30, and cross-platform lift are heuristically derived, not sourced from a measurement partner. The ablation quantifies one cost: 5.8 R² points lost on D1 from LLM-generated rows.
-
Manual sample size. 120 ads provide categorical breadth but are insufficient for brand-level inference. A target of ~500 ads per priority brand, stratified across platforms, is recommended for a follow-up collection.
-
ROI ceiling. ROI in Source 6 has |r| < 0.02 with every numeric feature in the dataset; prediction is not feasible without better underlying data.
-
Stack complexity. As demonstrated in Section V-B, stacking is not strictly necessary for campaign-level targets. It was retained because the ensemble methodology was a demonstration goal of the project and provides 12 R² points of benefit on noisier creative-metadata targets.
-
-
CONCLUSION
This paper presented a hybrid data pipeline and stacked ensemble for predicting 15 digital advertising performance metrics. Cross-validated R² ranged from 0.65 to 0.81 on well-defined campaign targets. The high-CTR classifier achieved AUC = 0.994 and a Brier score of 0.033. D1 and install-quality predictions attained R² between 0.73 and 0.80.
Two findings are highlighted for practitioners. First, a single HistGradientBoosting model provides performance within noise of the full four-model stack on campaign-level targets, offering a simpler and more maintainable deployment option. Second, LLM-generated synthetic rows can degrade generalisation when the synthesiser introduces label artefacts; augmentation quality must be validated before inclusion in training sets.
Two avenues for future work are identified. Replacing the heuristic retention block with real measurement-partner data would remove the largest source of label noise in the creative-metadata targets. Replacing categorical theme labels with LLM embeddings of actual ad copy text would capture richer creative signal than the current taxonomy-based approach.
REFERENCES
-
Kaggle. "Facebook Ad Campaign Dataset." kaggle.com/datasets, accessed Apr. 2026.
-
Kaggle. "Social Media Advertising 300k." kaggle.com/datasets, accessed Apr. 2026.
-
Kaggle. "Ad Click Prediction 10k." kaggle.com/datasets, accessed Apr. 2026.
-
Kaggle. "Social Media Ad Optimisation." kaggle.com/datasets, accessed Apr. 2026.
-
Kaggle. "Full Ad Campaign Relational Database." kaggle.com/datasets, accessed Apr. 2026.
-
Kaggle. "Marketing Campaign Performance 200k." kaggle.com/datasets, accessed Apr. 2026.
-
Meta Platforms, Inc. "Meta Ad Library." facebook.com/ads/library, accessed Apr. 2026.
-
Google LLC. "Google Ads Transparency Centre." adstransparency.google.com, accessed Apr. 2026.
-
J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of Statistics, vol. 29, no. 5, pp. 11891232, 2001.
-
G. Ke, Q. Meng, T. Finley et al., "LightGBM: A highly efficient gradient boosting decision tree," in Proc. NeurIPS, vol. 30, 2017.
-
T. Chen and C. Guestrin, "XGBoost: A scalable tree boosting system," in Proc. KDD, pp. 785794, 2016.
-
F. Pedregosa et al., "Scikit-learn: Machine learning in Python," Journal of Machine Learning Research, vol. 12, pp. 28252830, 2011.
-
G. W. Brier, "Verification of forecasts expressed in terms of probability," Monthly Weather Review, vol. 78, no. 1, pp. 13, 1950.
