Strategic Insights into Vehicles Fuel Consumption Patterns: Innovative Approaches for Predictive Modeling and Efficiency Forecasting

DOI : 10.17577/IJERTV13IS060063

Download Full-Text PDF Cite this Publication

Text Only Version

Strategic Insights into Vehicles Fuel Consumption Patterns: Innovative Approaches for Predictive Modeling and Efficiency Forecasting

Kajal Sheth, New York Tech, Dhvanil Patel, Texas A&M University,

Gautam Swami, Tulane University

Abstract This study explores significant trends in fuel efficiency and carbon emissions within the passenger vehicle sector, utilizing extensive data from the U.S. Department of Energy and Environmental Protection Agencys Analyzing vehicle data from 1984 to the present, we identified key patterns in miles-per-gallon performance, tailpipe CO2 emissions, and economic impacts of vehicle efficiency. Our research highlights advancements in vehicle technology and shifts in consumer choices, underscoring their implications for environmental policy and sustainable transportation strategies. Employing advanced statistical analyses and machine learning in R and visualizations in Tableau, this study provides insights that support informed policy-making and strategic decisions in the transportation sector.

KeywordsFuel efficiency; Carbon emissions; Transportation sector; Environmental policy; Predictive analytics


    The transportation sector is a significant contributor to global carbon emissions, accounting for nearly a quarter of direct CO2 emissions from fuel combustion [1]. The urgency to reduce these emissions has intensified under the pressures of climate change and environmental degradation. Advances in vehicle technology have shown promise in improving fuel efficiency and reducing emissions, yet the pace of improvement and adoption varies widely across vehicle types and regions. This study leverages a comprehensive dataset from the U.S. Department of Energys official fuel economy information portal, [2], to examine the evolution of fuel efficiency and carbon emissions within the passenger vehicle sector.

    Despite abundant data on vehicle performance, there remains a gap in understanding the long-term trends and their implications for policy and consumer behavior. Most existing studies focus on short-term impacts or specific vehicle types, lacking a holistic view of the transportation sectors progress toward sustainability goals. This research aims to fill this gap by systematically analyzing fuel consumption patterns and emission trends over several decades, highlighting the technological and regulatory shifts that have influenced these trends. Specifically, the study seeks to identify key trends in fuel efficiency improvements, analyze the variance in emissions among different vehicle classes, and assess the economic implications of evolving fuel economy standards.

    Using data curated and maintained by the U.S. Department of Energy's (DOE) Office of Energy Efficiency and Renewable Energy and supplemented by the U.S. Environmental Protection Agency (EPA), this analysis spans a broad spectrum of vehicles from 1984 to the present. The dataset includes

    detailed metrics such as miles-per-gallon in various driving contexts, tailpipe CO2 emissions, and potential cost savings across diverse vehicle classes, including sedans, SUVs, and trucks [2]. This rigorous dataset, derived from standardized laboratory tests, provides a reliable foundation for comparing vehicle performance equitably. By exploring these comprehensive data, the study aims to provide strategic insights that can inform both policy-making and consumer decisions, ultimately supporting the DOE and EPA's mandate under the Energy Policy Act of 1992 to deliver precise and actionable fuel economy data to the public [1].


    1. Exploratory Data Analysis

      In our exploratory data analysis (EDA), we thoroughly examined the dataset, which consists of 43,156 entries across 83 distinct attributes. This extensive dataset provided a robust framework for our analysis. Initially, we focused on generating summary statistics for key variables that are critical to our research questions. These statistics offered initial insights into the data's central tendencies and variability, setting the stage for a deeper, more focused analysis. The subsequent section presents these summary statistics, detailing the measures of central tendency, dispersion, and the distribution of these selected variables.

      Fig. 1. Summary Statistics for EDA

      To better understand the distribution of our categorical variables, namely Make, FuelType1, FuelType2, Supercharger, Turbocharger, and Number of Cylinders, we conducted a frequency analysis. Frequency tables were generated to depict the occurrence of each category within these variables, providing a clear visual representation of the data distribution.

      table i. Frequency Table for FuelType1



      Regular Gasoline


      Premium Gasoline






      Midgrade Gasoline


      Natural Gas


      table ii. Frequency Table for FuelType2







      Natural Gas




      table iii. Frequency Table for Number of Cylinders

      Number of Cylinders




















      The frequency tables provide a straightforward means to examine the distribution of categorical variables within the dataset. For instance, the data reveal that the most common engine types in vehicles are 4 and 6 cylinders, as illustrated in Table III, with 14 vehicles equipped with 16 cylinders, highlighting the rarity of such configurations.

      Following the analysis of categorical data, we proceed to examine continuous variables through visual means. Box plots have been utilized to depict the distribution of the Miles Per Gallon (MPG) attributes, tailpipe emissions, and fuel consumption rates. These visualizations are essential for understanding the spread and central tendencies of these variables, as well as for identifying outliers that may influence subsequent analyses.

      Fig. 2. Spread of MPG Attributes

      Fig. 3. Spread of Tailpipe Emissions

      Fig. 4. Average Fuel Economy by Fuel Type

      From the spread of the MPG attributes, we notice that the mean values of City MPG, Highway MPG, and Combined MPG values are very close to each other and are around 20 miles per gallon. The spread of tailpipe emissions shows the mean emissions in the dataset are around 400 grams per mile, and the

      average number of barrels used for fuel consumption is around 15.

      The other variables we use in our dataset are the Model and the Make of the vehicles. The chart below shows the top manufacturers by the number of models. The top three model manufacturers in the dataset are Meredes-Benz, BMW, and Chevrolet. Cadillac, Suzuki, and Subaru have the least number of models in the dataset.

      Fig. 5. Top Manufacturers by Count of Model

      Two other key parameters for the analysis are the average MPG and the primary fuel type. Therefore, we plot the yearly trend of average MPG by fuel type. Intuitively, this trend is indicative of the improvement in efficiency in the transportation sector and vehicle technology in general. The results are shown below.

      Fig. 6. Average Fuel Economy by Fuel Type

      As can be seen in the image above, gasoline and electric vehicles show the maximum improvement in fuel economy, while diesel and CNG vehicle show fluctuations. Although the scale on the y- axis is relative for each graph, in general, electric vehicles have the highest fuel economy, followed by CNG vehicles, diesel vehicles and lastly, the gasoline vehicles showing the lowest fuel economy. This conforms with the current trends in the transportation sector.

    2. Data Manipulation and Cleaning

    The raw dataset contained 83 variables, out of which only 21 were relevant for the purpose of this analysis. Therefore, these were selected and stored in a new data frame. The volume of the vehicle was a parameter of interest for analysis, so six

    attributes for luggage and passenger volumes were summed together to get the total volume of all the vehicles. Therefore, the final dataset used for all the analysis contains 43,023 records and 16 variables. It was decided to keep all the records, not just the distinct ones since some manufacturers release models with the same name for multiple years.

    Next, the categorical variables were converted to factors for the analysis. These include Make, Model, FuelType1, Fueltype2, Supercharger, and Turbocharger.

    Next, the empty cells in various attributes were replaced with 0 or black values to ensure consistency in data. The summary statistics after cleaning the data are shown below:

    Fig. 7. Summary Statistics after Data Manipulation

    Next, the following 2 categorical variables were grouped together for narrowing down the scope of the analysis. This was done using regular expressions. The change in FuelType1 is described below:

    table iv. Cleaning of FuelType1 Variable



    Regular Gasoline


    Premium Gasoline



    Natural Gas

    Natural Gas




    Midgrade Gasoline


    The vehicle Class variable was also grouped from 34 unique values down to 5. These new levels are Other, Cars, Vans, Pickup Trucks, and SUVs.


    1. Efficiency of superchargers vs Turbochargers

      This part aims to investigate the impact superchargers and turbochargers have on IC engine vehicles. Both these components are used to increase the combustion efficiency of the engine and increase the power output. While the supercharger draws power directly from the crankshaft, the turbocharger uses the exhaust gases created by combustion to power the compressor. This is an inferential type question which compares 2 independent attributes.

      For the purpose of the analysis, the vehicles with either a turbocharger or a supercharger are filtered. The summary statistics are shown below:

      table v. Summary Statistics for Superchargers and


























      Fig. 8. Comparison of Superchargers and Turbochargers

      It can be ovserved that there is some intersection between the 2 attributes. The intersecting records are eliminated to ensure that both the attributes are independent. The MPG values can be assumed to be normally distributed and hence we can use t-test to run hypothesis test. For this test, the hypotheses are:

      • H0: Turbocharger = Supercharger (i.e. there is no difference in average fuel economies)

      • Ha: Turbocharger Supercharger (i.e. there is a difference in average fuel economies)

        The results of the t-test are shown below:

        Fig. 9. T-Test on Superchargers and Turbochargers

        Due to the extremely small p-value, the null hypothesis is rejected. Therefore, there is significant statistical evidence that fuel economy is different for vehicles with superchargers and turbochargers. Out of the two, vehicles with turbocharger have significantly better fuel economy.

    2. Fuel economy comparison

      This section aims to investigate the difference in fuel economies for different classes of vehicles such as cars, vans, pickup trucks, SUVs, and others. This analysis is performed on IC engine vehicles only to ensure comparison between similar entities. This question aims to check what effect the utility of the vehicle has on the MPG. This is an inferential type question

      which compares the means of a common attribute between different groups.

      Vehicles with Fueltype1 as Gasoline, Diesel, CNG and Fueltype2 as E85, Propane, CNG are filtered. This ensures no hybrid EV or battery EV is included for the analysis. The summary statistics are shown below:

      table vi. Summary Statistics for FuelType1
















      Pickup Trucks















      Fig. 10. Comparison of MPG by Vehicle Class

      From the figure above, we can see that the average MPG are not significantly different, however, it is important to note that the data contains outliers, which may skew the observation. The data is independent and can be assumed to be normally distributed, therefore, we will use ANOVA method to statistically model the data and test the hypothesis. For this test, the hypotheses are:

      • H0: Vans = SUV = Trucks = Cars = Others (i.e. all means are the same)

      • Ha: Vans SUV Trucks Cars Others (i.e. at least some of the means are different)

        To perform the ANOVA test, a linear regression is performed, and an ANOVA table is created using the anova function. The result is shown below:

        Fig. 11. ANOVA table

        Due to the extremely small p-value returned, the null hypothesis is rejected. Therefore, there is significant statistical evidence that fuel economy is different for different classes of IC engine vehicles. This difference throws light on the relation

        between MPG and the load carrying capacity of the vehicle as well. Cars, which are mainly passenger vehicles show the highest MPG, whereas, trucks and vans, whic are mainly cargo vehicles show the lowest MPG values.

    3. Engine Displacement and Vehicle Volume Correlation

      This part aims to find out if there is a correlation between the engine displacement and the volume of the vehicle. Before performing a correlation test on these variables, we choose all the vehicles which are not electric vehicles and consider a vehicle volume of more than 20 cubic feet but less than 230 cubic feet to eliminate outliers. The null values in the engine displacement attribute are also ignored.

      The summary statistics for the attribute vehicle volume show a total of 22788 rows with a mean of 120.23 cubic feet, standard deviation of 36.36 and a median vehicle volume of 111 cubic feet. These summary statistics are visualized using a box plot as shown below

      Fig. 12. Box Plot for Vehicle Volume

      Assuming the two attributes are not normally distributed and hence we use the Kendall Rank Correlation Test. The results of the test are shown below

      Fig. 13. Kendall's Rank Correlation Test

      Tau is the Kendall correlation coefficient. The p value for the test is 0.1385 and the tau value is – 0.0068. Since the value of tau is very close to zero, the two variables have no correlation. A scatter plot between the two confirms the same as shown below

      Fig. 14. Correlation between Engine Displacement and Vehicle Volume

      The above scatter plot is made using the ggpubr library function ggscatter which displays the correlation coefficient and the significance level on the plot. A regression line is added to the chart shown in red and the points are grouped by the volume of the vehicle.

    4. Forecast of tailpipe CO2 emissions

      As of 2018, the transportation sector in the US accounted for

      ~28% of the total greenhouse gas (GHG) emissions. While electric vehicles (EVs) are expected to be widely adopted in the near future, it is necessary to reduce the tailpipe emissions from internal combustion (IC) vehicles in the meantime. Thus, the aim of this question was to analyze the trend in the tailpipe CO2 emissions from 1984 until 2020 and forecast the emissions until 2030.

      To perform this analysis, autoregressive integrated moving average (ARIMA) forecasting method was used. ARIMA models are widely used in time series forecasting. In this method, the lag in time series as well as the lagged forecast errors are used to predict future values based on the past values of a time series. Since EVs do not have CO2 tailpipe emissions, they were filtered out the data frame using the function filter. Makes with 5 highest count of vehicles in the raw data set were considered. These included Chevrolet, Ford, Dodge, GMC and Toyota. An average of tailpipe emissions was taken for each year (1984-2020) using the aggregate function. This was followed by converting the average emissions by year into a time series format with a frequency of 1 year using the ts function from the forecast package. The auto.arima function was used on the time series. The auto.arima command was used as it returns the best ARIMA model, rather than creating a custom ARIMA model. Finally, the forecast function was used to get the predicted tailpipe emissions for 10 years and this information was plotted.

      Fig. 15. Tailpipe CO2 Emissions Forecast for Top 5 Manufacturers

      As seen from the plots above, Toyota and Ford have a decreasing trend in tailpipe CO2 emissions, while Chevrolet, Dodge, and GMC have a forecast that stays constant until 2030. Out of all the vehicles considered, Toyota has the least GHG emissions, with the forecast expected to reach ~230 grams/mile by 2030. This is followed by Ford, who is expected to reach ~350 grams/mile. Chevrolet, Dodge, and GMC are forecasted to have CO2 tailpipe emissions in the range of 450-

      500 grams/mile by 2030. It should be noted the data set only considers new models released every year. Thus, it can be concluded that the new cars released by Toyota are cleaner on average compared to other makes considered in this analysis.

    5. City MPG and Highway MPG values for Gasoline vs Electric Vehicles

      For this, we wanted to find out if there is a difference between the city and highway MPG values for gasoline and electric vehicles in the dataset. To analyze this, we choose only the Toyota vehicles to represent the gasoline vehicles and Tesla represents the electric vehicles in the dataset. The city and highway MPG values for gasoline and electric vehicles are represented by a ratio of the two to be able to account for both the factors. The statistical analysis is performed on a random sample of 100 gasoline vehicles and the first 100 electric vehicles. The summary statistics are shown below

      table vii. Summary Statistics for Gasoline and EV MPG values













      From the summary statistics shown in the table above, we notice the mean MPG ratio for electric vehicles is higher than the mean MPG ratio for gasoline vehicles. This is shown in the density chart below

      Fig. 16. Density plot for Gasoline and EVs MPG Ratios

      It can be noticed from the plot above that the mean MPG ratio for electric vehicles is higher than that for gasoline vehicles.

      To perform the analysis for this question, we use the F-test and the T-test to check if the two groups have the same variances and if there is a significant difference between gasoline and electric vehicle MPG ratios.

      The F-test is done to check if the two populations have the same variances. The hypotheses for this test are shown below

      • H0: ² (Ratio of City MPG/Highway MPG for Gasoline) = ² (Ratio of City MPG/Highway MPG for Electric Vehicles) [The variances of the two groups are the same]

      • Ha: ² (Ratio of City MPG/Highway MPG for Gasoline) ² (Ratio of City MPG/Highway MPG for Electric Vehicles) [The variances of the two groups are different]

        The T-test is done to check if the two populations have the same means. The hypotheses for this test are shown below

      • H0: (Ratio of City MPG/Highway MPG for Gasoline) = (Ratio of City MPG/Highway MPG for Electric Vehicles) [The means of the two groups are the same]

      • Ha: (Ratio of City MPG/Highway MPG for Gasoline) (Ratio of City MPG/Highway MPG for Electric Vehicles) [The means of the two groups are different]

        The results of the F-test and T-test performed are shown below

        Fig. 17. F-Test Results for Gasoline vs EV MPG Ratios

        Fig. 18. T-test Result for Gasoline vs EV MPG Ratios

        The p-value of the F-test is 0.3987, which is greater than the significance level alpha = 0.05. In conclusion, there is no difference between the variances of gasoline and electric vehicle ratios. Therefore, we can use the T-test, which assumes equality of variances. The T-test result gives a p-value less than 0.05, from which we can conclude that the ratios of the two groups are significantly different. Thus, we can also infer that the variability in the City MPG and Highway MPG values for Electric Vehicle is low compared to that of gasoline vehicles.

    6. The trend of new vehicle models each year by fuel type

      While gasoline has been the most popular fuel when it comes to passenger vehicles, the need to reduce GHG emissions in the transportation sector has led auto makers to diversify their portfolios to include cars that use alternative fuels. The aim of this question was to study the trend of vehicles by fuel types from 1984-2020.

      To perform this analysis, the fuel type of each vehile was identified. The fuel types included in this analysis were gasoline, hybrid electric vehicle (HEV), battery electric vehicle (BEV), diesel, natural gas and hybrid. Since the original dataset contained 2 variables for fuel types depending on whether the vehicle runs on 1 or 2 fuels, the mutate function was used to create a new column in the dataset which determined the fuel type of the vehicle. In this column, if the fuelType1 variable

      was gasoline and fuelType2 was empty, the result in the new

      column would return gasoline.

      Similar code was written for diesel and natural gas. If the fuelType1 variable was gasoline and fuelType2 was electricity, then the result in the new column would return HEV. If the fuelType1 variable was electricity, then the result in the new column would return BEV. If the fuelType1 variable was gasoline and fuelType2 was E85 or propane or natural gas, then the result in the new column would return Hybrid. This was done using the if_else function which checks the logical condition of the inputs.

      The data for year 2021 was filtered out as it did not have enough data points and thus could have misled the result. The count of vehicles was found using the tally function and each fuel type was filtered and plotted over time to see the trend. A linear regression line was added for gasoline, HEV and BEVs for further analysis.

      Fig. 19. Count of Vehicle Type with Model Year and the Linear Model

      As seen from the plot, vehicles were fueled by either gasoline or diesel until the early 1990s. After 2000, the portfolio of automakers widened in terms of the fuels used. Alternative fuels for vehicles were introduced around this time. From the plot, a steep increase in the number of HEVs and BEVs can also be seen. The regression line shows that there is a steep increase in the new models of HEVs and BEVs while new gasoline vehicles are on the decline. This shows the big picture in the auto industry which suggests a shift towards electrification and the use of cleaner fuels in general.

    7. Groups based on Average MPG and Fuel Savings

    This analysis was performed using k-means clustering which a machine learning method. K-means is a centroid-based clustering method which groups data based on distances to a point. Tableau Desktop was used to perform this analysis due to its ease of use and strong data visualization features. The variable youSaveSpend was placed in the Columns shelf and comb08 was placed in the Rows shelf. Averages of both were taken. Vehicle makes was placed in Label and Detail tabs under the Marks section. This showed the vehicle makes based on their average MPG and average fuel savings. Finally, a cluster was added to this model from the Analytics tab. Reference line showing averages of both variables was added to see which vehicle makes are above or below the average.

    Note: The youSaveSpend variable stands for 5-year savings/spendings compared to an average car. Negative savings indicate that money spent as opposed to money saved.

    table viii. Summary Diagnostics for Cluster Analysis

    Number of Clusters:


    Number of Points:


    Between-group Sum of Squares:


    Within-group Sum of Squares:


    Total Sum of Squares:


    table ix. Cluster Analysis Results



    Number of Items

    Average MPG

    Average Savings ($)

    Cluster 1




    Cluster 2




    Cluster 3




    Cluster 4




    Cluster 5




    Cluster 6




    Not Clustered


    Fig. 20. K-Means Clustering for Avg. MPG and Savings by Vehicle Make

    6 clusters were formed to group distinct characteristics of the data which also helped identify some of the outliers. Cluster 1 consists of only 2 vehicle makes and this represents makes which have the least average MPG and the lowest savings. On the other hand, cluster 6 consists of only 1 make, Tesla and represents the vehicle make with the highest MPG as well as the most savings over 5 years. Cluster 5 includes makes which mainly have EVs in their portfolio. Cluster 3 has some of the popular makes such as GM, Fiat, Nissan etc. which have decent mileage and are not very expensive to maintain either.

    Cluster 2 includes makes such as BMW, Porsche, Audi, etc. which have some high-performance cars in their portfolio. Cluster 4 includes makes such as Lamborghini, Pagani, Bentley, etc. which have low mileage cars due to their high- horsepower engines and are also expensive to maintain on average.


    This study has meticulously analyzed the EPA fuel economy dataset to discern key trends and changes within the transportation sector over an extended period. The data, notably clean with minimal inconsistencies, facilitated a robust analysis of fuel economy trends, primarily focusing on miles- per-gallon (MPG) distributions and tailpipe CO2 emissions.

    Our findings indicate that MPG values generally follow a normal distribution, skewed right due to outliers represented by electric vehicles, which exhibit exceptionally high MPG. These outliers were selectively filtered from the analysis to maintain focus on internal combustion engine vehicles. Statistical and machine learning models were employed to test hypotheses and yielded statistically significant results, underpinning the robustness of our analytical methodologies.

    Key insights from the study include:

    • Vehicle Performance: Vehicles equipped with turbochargers generally demonstrated better fuel economy compared to those with superchargers.

    • Class Comparison: Fuel economy varied significantly across different vehicle classes, inversely related to their load-carrying capacities. Passenger cars showed higher MPG compared to heavier vehicles like vans and trucks.

    • Engine and Volume Correlation: There was a negligible correlation between engine displacement and vehicle volume, with a slightly negative trend.

    • Fuel Type Trends: The analysis of fuel types showed that electric vehicles consistently outperformed others in terms of fuel economy, aligned with global shifts towards more sustainable vehicle technologies.

    Additionally, the forecast for tailpipe CO2 emissions indicated a declining trend for manufacturers like Toyota and Ford, highlighting industry-leading practices in emissions reduction.

    However, some manufacturers like Chevrolet, Dodge, and GMC showed little to no reduction, pointing to areas where further improvements are necessary.

    The transportation sector's shift towards Battery Electric Vehicles (BEVs) and Hybrid Electric Vehicles (HEVs) represents a promising trend towards reducing greenhouse gas emissions. This shift is further evidenced by the growing proportion of these vehicles each year, contrasting with a decline in conventional gasoline vehicles.

    Lastly, our cluster analysis revealed that electric vehicles, particularly those from manufacturers like Tesla, not only offer superior fuel economy but also present considerable savings over five years ompared to average internal combustion engine cars.


  1. US EPA, Office of Air and Radiation. (2015, December 29). "Sources of Greenhouse Gas Emissions." Overviews and Factsheets, US EPA. Available: emissions

  2. U.S. Department of Energy. (n.d.). "Fuel Economy Web Services." Retrieved December 11, 2020, from

  3. Selva, P. (2019, February 18). "ARIMA ModelComplete Guide to Time Series Forecasting in Python." ML+. Available: series-forecasting-python/

  4. Kassambara, A. (n.d.). "Correlation Analyses in R." Easy Guides WikiSTHDA. Retrieved December 11, 2020, from

  5. Kassambara, A. (n.d.). "Correlation Test Between Two Variables in R." Easy GuidesWikiSTHDA. Retrieved December 11, 2020, from variables-in-r

  6. Kassambara, A. (n.d.). "F-Test: Compare Two Variances in R." Easy GuidesWikiSTHDA. Retrieved December 11, 2020, from

  7. Kassambara, A. (n.d.). "ggpubr: Publication Ready Plots." Easy GuidesArticlesSTHDA. Retrieved December 11, 2020, from plots/

  8. Kassambara, A. (n.d.). "Unpaired Two-Samples T-test in R." Easy GuidesWikiSTHDA. Retrieved December 11, 2020, from