Application of Negative Binomial Regression in Analyzing Health Data of Air Pollution: A Comparative Study

Deepali N.bramhpurkar; Dr.swati Desai; Dr.sangita Patil

doi:10.17577/IJERTCONV14IS020148

NCRTCS - 2026 (Volume 14 – Issue 02)

Application of Negative Binomial Regression in Analyzing Health Data of Air Pollution: A Comparative Study

DOI : 10.17577/IJERTCONV14IS020148

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 3
Authors : Deepali N.bramhpurkar, Dr.swati Desai, Dr.sangita Patil
Paper ID : IJERTCONV14IS020148
Volume & Issue : Volume 14, Issue 02, NCRTCS – 2026
Published (First Online) : 21-04-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Application of Negative Binomial Regression in Analyzing Health Data of Air Pollution: A Comparative Study

Deepali N.Bramhpurkar Statistics

Shri Jagdisprasad Jhambarmal Tibrewala University, Jhunjhunu, Rajasthan.

Dr.Swati Desai Statistics

Shri Jagdisprasad Jhambarmal Tibrewala University, Jhunjhunu, Rajasthan.

Dr.Sangita Patil Statistics

Mit Arts Commerce & Science College Alandi, Pune

Abstract – This study investigates the application of Negative Binomial regression for analyzing hospital admission data in the presence of air pollution. Daily hospital visits were modeled as count responses and examined using Generalized Linear Models implemented in MATLAB. The performance of the Negative Binomial approach was compared with Ordinary Least Squares and transformation-based methods, including logarithmic and square-root transformations. Model evaluation was carried out using confidence interval length, coverage probability, and goodness-of-fit measures. The results reveal substantial overdispersion in the health data, for which traditional methods provide inefficient and unstable inference. In contrast, the Negative Binomial model demonstrates superior reliability by producing adaptive confidence intervals and balanced coverage, making it a robust framework for environmental health assessment.

Keywords – Negative Binomial Regression, GLM, Overdispersion, Hospital Admissions, Air Pollution, Confidence Interval

INTRODUCTION

Hospital admission data are commonly recorded as count outcomes and are widely used in environmental epidemiology to assess the impact of air pollution on public health. In the present study, daily hospital visits were obtained by aggregating individual patient admission records over 727 consecutive days. The corresponding environmental data included daily measurements of Air Quality Index (AQI), PM2.5, and PM10 concentrations.

Preliminary exploratory analysis of the dataset revealed a mean daily hospital visit count of 21.32 and a variance of 58.43, yielding a dispersion ratio of 2.74. Since the variance substantially exceeds the mean, the data exhibit clear overdispersion. This violates the fundamental assumption of constant variance underlying traditional Ordinary Least Squares (OLS) regression. When applied to overdispersed count data, OLS may produce inefficient estimates, misleading standard errors, and unreliable confidence intervals.

To address non-normality and heteroscedasticity, transformation-based methods such as logarithmic and square-root transformations are often employed. Although these techniques attempt to stabilize variance, they introduce

interpretational challenges and may result in biased back- transformed estimates, particularly in the presence of zero counts and high variability.

Generalized Linear Models (GLMs) provide a principled framework for modeling non-normal response variables. In particular, the Negative Binomial regression model extends the Poisson model by incorporating a dispersion parameter, allowing the variance to exceed the mean. This flexibility makes it especially suitable for environmental health data characterized by heterogeneity and unobserved risk factors. Previous methodological studies have demonstrated, through simulation experiments, that the Negative Binomial GLM outperforms OLS and transformation-based methods in factorial experimental settings involving overdispersed responses. However, validation of these findings using real- world environmental health datasets remains limited.

The present study extends prior simulation-based evidence by applying OLS, logarithmic transformation, square-root transformation, and Negative Binomial GLM approaches to real hospital admission and air pollution data. All analyses were implemented in MATLAB, enabling direct comparison of confidence interval performance, expected interval length, and coverage probability through Monte Carlo simulation. The primary objective is to identify the most reliable and efficient inferential framework for modeling pollution- related hospital admissions under overdispersion.

OBJECTIVES

To examine the presence of overdispersion in daily hospital admission data associated with air pollution exposure.
To apply and compare different regression approaches, namely Ordinary Least Squares (OLS), logarithmic transformation, square-root transformation, and Negative Binomial Generalized Linear Models, for modeling pollution-related hospital visits.
To evaluate the efficiency of confidence intervals obtained from each method by computing the Expected Length of Confidence Intervals (ELOCI).
To assess the reliability of statistical inference by estimating coverage probabilities through Monte Carlo simulation.
To investigate the association between air pollution indicators (AQI, PM2.5, and PM10) and hospital admissions using real-world health data.
To validate previous simulation-based findings by extending them to an applied environmental health context.
To identify the most appropriate modeling framework for overdispersed health count data based on empirical performance measures.

METHODOLOGY
Monte Carlo samples were generated from the fitted Negative Binomial model using estimated dispersion parameters. For each simulated dataset, all four models were refitted and confidence intervals were recalculated. Coverage probability was computed as the proportion of intervals containing the true mean response.

RESULT AND DISCUSSION

Analysis using MATLAB code of real data set:

1. Mean = 21.32
1. Variance = 58.43
2. Dispersion = 2.74
3. Estimated dispersion k = 71.453

Overdispersion:

The variance of hospital admissions exceeded the mean, confirming overdispersion. The presence of overdispersion justifies the application of the Negative Binomial model, which explicitly accounts for extra-Poisson variation.

PM2.5 and AQI showed significant positive association with admissions.

GLM-NB provided realistic standard errors. OLS underestimated variance.

Individual Confidence Interval:

OLS: [19.44, 21.49], [19.81, 21.82],

LOG: [-1, ]

SQRT: [10¹², 10¹³]

NBGLM: [19.88, 21.10], [20.21, 21.41],

The OLS confidence intervals are symmetric and reasonable but relatively wide. The log-transformed intervals contain negative lower bounds and infinite upper bounds, which are meaningless for count data. The square-root method produces unrealistically large values due to numerical explosion after back-transformation. The Negative Binomial intervals remain within realistic ranges and preserve the count nature of the response variable. This highlights the practical superiority of the GLM approach.

Coverage Probability Analysis (Corrected Simulation):

Method	Coverage
OLS	0.949
LOG	0.743
SQRT	0.891
NBGLM	0.916

The OLS method achieved coverage close to the nominal 95% level. However, this high coverage is mainly due to excessively wide confidence intervals, indicating conservative inference. The logarithmic transformation performed poorly, covering the true mean in only 74% of cases, demonstrating severe undercoverage. The square-root transformation showed moderate performance but still failed to reach the nominal level. The Negative Binomial GLM achieved coverage of approximately 92%, which is reasonably close to the target level, reflecting reliable inferential properties. These results demonstrate that NBGLM offers a better balance between precision and reliability.

Expected Length of Confidence Intervals (ELOCI):

Method	ELOCI
OLS	3.433
LOG
SQRT	8.88 × 10
NBGLM	2.087

The OLS method produced wide confidence intervals, indicating low efficiency. The log-transformation method

lengths, indicating serious computational and statistical limitations. In contrast, the Negative Binomial model generated moderate and stable interval lengths. This confirms that NBGLM yields more efficient and interpretable confidence intervals compared to transformation-based approaches.

Comparative Performance of Methods:

Criterion	OLS	LOG	SQRT	NBGL M
Handles Overdisper sion		Partiall y accepta ble	Compara tive Performa nce of Methods	appropri ate
Coverage Accuracy	High (Overwid e)	Poor	Moderate	Good
CI Stability	Moderate	Poor	Poor	Good
Interpretab ility	Moderate	Low	Low	High
Overall Performan ce	Compara tive Performa nce of Methods		Compara tive Performa nce of Methods	appropri ate

CONCLUSION:

This study investigated the suitability of different regression approaches for modeling pollution-related hospital admission data characterized by overdispersion. Using real-world health and environmental datasets, Ordinary Least Squares, logarithmic transformation, square-root transformation, and Negative Binomial Generalized Linear Models were systematically compared through empirical analysis and Monte Carlo simulation.

Preliminary data exploration revealed substantial overdispersion, with the variance exceeding the mean by more than twofold. This violated the fundamental assumptions of conventional linear regression and justified the use of distribution-based models. Although OLS achieved near-nominal coverage probability, it produced excessively wide confidence intervals, leading to inefficient and conservative inference. Transformation-based methods exhibited numerical instability, infinite or inflated confidence intervals, and severe undercoverage, thereby limiting their practical usefulness.

In contrast, the Negative Binomial GLM consistently demonstrated superior inferential performance. It generated stable and interpretable confidence intervals, achieved reasonably high coverage probabilities, and maintained

resulted in infinite interval lengths, reflecting numerical instability during back-transformation. Similarly, the square- root transformation produced extremely large interval

moderate interval lengths by explicitly accounting for extra- Poisson variation. The balanced performance of the Negative

Binomial model highlights its theoretical and practical advantages for analyzing overdispersed health count data.

LIMITATIONS

OLS-Does not explicitly model overdispersion.
LOG-Cannot handle zero values directly without adding an arbitrary constant.
SQRT-Back-transformation introduces bias in estimated mean responses. Only partially stabilizes variance for highly dispersed data
NBGLM-Assumes a specific meanvariance relationship. More computationally intensive than OLS.

17.Schwartz, J. (2004). The effects of particulate air pollution on daily deaths. Environmental Research, 94(1), 713. https://doi.org/10.1016/S0013-9351(03)00018-6
18.Wald, A. (1943). Tests of statistical hypotheses concerning several parameters. Transactions of the American Mathematical Society, 54(3), 426482. https://doi.org/10.1090/S0002-9947-1943-0012401-3

CITATION:

Analysis of 2 Factorial Experiments with Negative Binomial Response Variables: A Comparative Study. international Conference on Bridging Disciplines, Shaping the Future: Integrative Approaches to Global Challenges; SJJTU/CONF/CSE/PAR/011/2025/257.

REFERENCES

Agresti, A. (2015). Foundations of Linear and Generalized Linear Models. Wiley.
Burnham, K. P., & Anderson, D. R. (2002). Model Selection and Multimodel Inference (2nd ed.). Springer.
Cameron, A. C., & Trivedi, P. K. (2013). Regression Analysis of Count Data (2nd ed.). Cambridge University Press.
Hilbe, J. M. (2011). Negative binomial regression (2nd ed.). CambridgeUniversityPress. https://doi.org/10.1017/CBO9780511973420
Cameon, A. C., & Trivedi, P. K. (1998). Regression analysis of count data. Cambridge University Press.
.Dobson, A. J., & Barnett, A. G. (2018). An Introduction to Generalized Linear Models (4th ed.). CRC Press.
Dominici, F., Peng, R. D., Bell, M. L., Pham, L., McDermott, A., Zeger, S. L., & Samet, J. M. (2006). Fine particulate air pollution and hospital admissions. Journal of the American Medical Association, 295(10), 11271134. https://doi.org/10.1001/jama.295.10.1127
Gentle, J. E. (2003). Random Number Generation and Monte Carlo Methods. Springer.
Hilbe, J. M. (2011). Negative Binomial Regression (2nd ed.). Springer.
Lawless, J. F. (1987). Negative binomial and mixed Poisson regression. Canadian Journal of Statistics, 15(3), 209225. https://doi.org/10.2307/3314912
McCullagh, P., & Nelder, J. A. (1989). Generalized Linear Models

(2nd ed.). Chapman & Hall.
Montgomery, D. C., Peck, E. A., & Vining, G. G. (2012). Introduction to Linear Regression Analysis (5th ed.). Wiley.
Myers, R. H., Montgomery, D. C., & Anderson-Cook, C. M. (2016).

Response Surface Methodology (4th ed.). Wiley.
Peng, R. D., & Dominici, F. (2008). Statistical methods for environmental epidemiology. Journal of the Royal Statistical Society: Series A, 171(1), 122. https://doi.org/10.1111/j.1467- 985X.2007.00545.x
Pope, C. A., & Dockery, D. W. (2006). Health effects of fine particulate air pollution. Journal of the Air & Waste Management Association, 56(6), 709742. https://doi.org/10.1080/10473289.2006.10464485
Ripley, B. D. (2009). Stochastic Simulation. Wiley.

Robert, C. P., & Casella, G. (2004). Monte Carlo Statistical Methods. Springer.
Samoli, E., et al. (2013). Acute effects of air pollution on mortality. Environmental Health Perspectives, 121(1), 1423. https://doi.org/10.1289/ehp.1104491

NCRTCS - 2026 (Volume 14 – Issue 02)

Application of Negative Binomial Regression in Analyzing Health Data of Air Pollution: A Comparative Study

Application of Negative Binomial Regression in Analyzing Health Data of Air Pollution: A Comparative Study

Keywords – Negative Binomial Regression, GLM, Overdispersion, Hospital Admissions, Air Pollution, Confidence Interval

INTRODUCTION

OBJECTIVES

METHODOLOGY

Data Sources and Preparation

Exploratory Data Analysis

Statistical Models

Ordinary Least Squares (OLS):

Logarithmic Transformation(LOG):

Square-Root Transformation(SQRT):

Negative Binomial GLM:

Confidence Interval Construction:

Perormance Evaluation:

Simulation Procedure:

RESULT AND DISCUSSION

Analysis using MATLAB code of real data set:

Overdispersion:

Individual Confidence Interval:

Coverage Probability Analysis (Corrected Simulation):

Expected Length of Confidence Intervals (ELOCI):

Comparative Performance of Methods:

CONCLUSION:

LIMITATIONS

CITATION:

REFERENCES