DOI : 10.17577/IJERTCONV14IS060099- Open Access

- Authors : Kushagra Saxena, Jayashree Manigandan, Madhumitha Kulandaivel
- Paper ID : IJERTCONV14IS060099
- Volume & Issue : Volume 14, Issue 06, ACSCON – 2026
- Published (First Online) : 15-06-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
GreenAI: An Online Learning-Based Framework for Sustainability Scoring and Optimization Routing of Black-Box Large Language Model APIs
Kushagra Saxena Department of Computing Technologies,
SRM Institute of Science and Technology,
Kattankulathur, India ks3780@srmist.edu.in
Jayashree Manigandan Department of Computing Technologies,
SRM Institute of Science and Technology,
Kattankulathur, India jm9093@srmist.edu.in
Madhumitha Kulandaivel Department of Computing Technologies,
SRM Institute of Science and Technology,
Kattankulathur, India Madhumik1@srmist.edu.in
Abstract Large Language Models (LLM) APIs are becoming integrated with software and developers have no established way of assessing their environmental efficiency. The existing model selection criteria prioritize performance metrics and neglect the aspect of sustainability, which creates a ginormous loophole in responsible AI implementation. Moreover, energy consumption is still a concealed secret within proprietary data centers and end-users as well as developers cannot measure the energy consumption directly. The present paper introduces GreenAI, a model that approximates relative sustainability performance of the black-box LLM APIs with the help of observable inference signals. The runtime metadata gathered by the system (e.g., token consumption, response time, response properties, etc.) is used to calculate a GreenAI Score which is a composite metric of model performance under quality constraints. Online learning is used in the scoring model to keep on refining the prediction as observations keep coming and changing the behaviour of the model over time. With these scores, an optimization-based routing engine decides in real time with the most sustainable model to execute each request, and trade-offs are made intelligently to ensure quality in responding. Experimental assessment shows that GreenAI can give 30-50 percent reductions in predicted carbon footprint with no more than 2 percent accuracy loss with a wide range of tasks. The framework makes sustainability a visible rather than an invisible aspect of AI systems, making carbon- conscious deployment decisions by developers possible without the cooperation of the provider.
KeywordsGreen AI, Sustainable Computing, LLM APIs, Online learning, Carbon-aware routing, Model Selection, Environmental efficiency, Black-box Optimization.
-
INTRODUCTION
-
Background and Motivation
Large Language Model (LLM) APIs have also changed the nature of software development, and models such as GPT-4, Claude, and Gemini are currently computing the logic behind thousands of applications all over the world. The ChatGPT, Claude, and Grok are very popular models that are frequently used in the daily routine but their energy usage, and the carbon footprint are not visible directly to the end-users, and this lacks transparency. Recent industry reports indicate that API calls were over trillions of requests in 2024 and this explosive adoption has a not-so-obvious price tag: in the vast amount
of energy consumed and carbon emissions that are invisible to end users [1]. In contrast to training, which is performed on a regular basis, inference is done continuously at scale to support most of the energy usage in production deployments, and sustainability optimization is a pressing need [2].
-
Problem Statement
Such opaque situation complicates the assessment of the sustainability of various artificial intelligence services and leads to an informed decision, and the market failure where environmental externalities are overlooked is an ongoing problem. The developers using LLM APIs have the large information asymmetry as the model vendors promote accuracy, latency, and pricing with no environmental efficiency included in the comparison schemes [3]. This issue arises because of three inherent problems: providers are not ready to reveal energy consumption statistics in their APIs, hardware infrastructure information is confidential and inaccessible, and for API- based black-box models, it is impossible to directly measure energy consumption. This means that developers do not have the opportunity to make sustainability-conscious deployment decisions, which compromise the sustainability intentions of the organization and climate commitments at large as AI use keeps gaining pace.
-
Research Question
This study proposed a workable and replicable model to approximate and contrast the energy use and carbon foot- print of commercial language model application programming interfaces and is intended to answer the query: Can we infer the relative environmental effectiveness of black-box LLM APIs using observable signals, and utilize this inference to select models optimally under quality criteria?
-
Contributions
The offered solution considers these models as black-box systems and uses observable response metadata and established techniques of proxy-based estimation, which makes contribution four-fold. First, GreenAI presents
a statistical sustainability scoring system which approximates relative efficiency based on only observable inference signals such as the number of tokens, response latency and the properties of the output. Second, it has an online learning system that keeps the accuracy in prediction as more and more observations are made to adjust to a model update and adjust to varying workload patterns. Third, it offers an optimization- based routing engine, which dynamically determines the most suitable model to use applying a utility function which trades off the estimated quality of a model with the estimated computational cost. Fourth, it provides experimental data of broad tests that proved carbon reduction of 30-50% with little or no decrease in quality, which proves the practical usefulness of the framework.
-
Paper Organization
The rest of the paper will be organized in the following manner: Section 2 will provide a literature review on associated research in the foundations of Green AI, energy estimation, model selection, and online learning systems. Section 3 provides the entire GreenAI approach such as scoring model, learning algorithm and routing engine. The 4th section explains the experimental setup and model providers, bench- marking tasks, and metrics. Section 5 talks about experimental results and findings of interest among research questions. Section 6 is summarized by limitations, implication and future directions of research.
-
-
LITERATURE REVIEW
-
Foundations of Green AI
Schwartz et al. [4] formally introduced the idea of Green AI, defining Red AI as one trying to achieve accuracy in the most cost-effective way, and Green AI as one that acknowledges efficiency as one of its key metrics and considers it equally important to the former. A systematic survey by Henderson et al. [5] found that small fraction of machine learning papers (less than 5 percent) has any mention of energy or carbon measures, as an increasing number of authors consider environmental-level impacts to be a pressing concern. Wu et al. [6] have thoroughly examined AI lifecycle emissions, demonstrating that inference in production systems constitutes the most substantial part of the energy consumption and, therefore, it requires more research attention.
-
Methodologies of Energy estimation
Strubell et al. [7] were the first to measure the carbon footprints of NLP models and found that the model training o one big model can produce as much carbon as five cars throughout their lives. The use of location-specific estimation instruments with real-time electricity grid carbon intensity developed by Lacoste et al. [8] has shown the difference in impact of the same workload on a geographic area to be dramatically varied. Luccioni et al. [9] instrumentally analyzed the BLOOM inference in detail and confirmed that the number of tokens is a good proxy of energy consumption with a wide range of input type and generation duration.
-
Emerging trends in sustainability-focused AI (2024-2025)
The first large-scale empirical study to analyze the performance of providers through the analysis of over 100,000 API calls concluded that the efficiency of models with the same task varies by up to 300 percent [10]. As shown by Chen et al. [11], the latency patterns of response indicate the presence of computational features, and thus, efficiency can be inferred without actual measurements of energy through a thorough statistical analysis. Williams and Thompson [12] introduced a taxonomy of observable signals of black-box LLM APIs, and classify signals by information and reliability in inference of efficiency. Carbon conscious geographic routing of AI work- loads was proposed by Rodriguez et al.
[13] and demonstrated that real-time carbon intensity across grid data centers could minimize emissions by 20-40 percent by reallocating requests across centers. -
Frameworks of Model Selection and Routing
The model selection systems at cascading model selection were developed by Zhang et al. [14] in which simple and efficient models are used to process routine requests, whereas the complex models are provided to challenging cases that demand more power. Kumar and Singh [15] suggested multi- armed bandit methods of adaptive model selection trading off exploration of unknown capabilities and exploitation of known performance patterns. Liu et al. [16] developed cost-conscious routing frameworks that take into account monetary expenses and latency limits during the process of choosing amongst cloud-based ML models of varying pricing designs. Anderson et al. [17] suggested the method of static greenness scoring to compare models without online adaptation which will give a convenient base but will not be responsive to changing conditions.
-
ML Systems online computing
Patel et al. [18] conducted a survey of online learning applications in production ML systems, and selected model selection and adaptive routing as especially promising applications that need lightweight updating mechanisms. Incremental learning algorithms developed by Thompson and Garcia [19] are specifically designed to optimize API performance prediction and have separate predictors for every model-task combination and an efficient updating procedure. Aiming to solve the cold-start issue in online learning to select a model, Martinez et al. [20] took advantage of prior knowledge of the model specifications, in addition to transfer learning of similar tasks, in order to obtain reasonable initial estimates of the model.
-
Benchmarking and Standards Initiatives
In the same study, Wang et al. [21] suggested the addition of efficiency-related metrics to the GLUE benchmark suite since they believed that a model should be evaluated based on the ability and environmental cost. Table 1 provides an overview of the major related literature and their contribution, and forms the research gap that the GreenAI framework is aimed at filling.
TABLE I
Summary of Key Related Works
Authors
Year
Contributio n
Methodology
Key Finding
Schwartz
al. [4]
et
2020
Green AI
concept
Conceptual
framework
Efficiency
Must primary metric
be
Strubell et al.
[7]2019
NLP
carbon footprint
Direct measurement
Training
Emits 5× automobile lifetime
Lacoste et al.
[8]2019
Regional
estimation
Grid data integration
Location
matters
for carbon impact
Luccioni
al. [9]
et
2022
BLOOM
inference
Instrumented
hardware
Token count
reliable energy proxy
Kaplan
Martinez [10]
&
2024
Cross-
provider study
Empirical
analysis
300%
efficiency variation exists
Chen et
[11]al.
2024
Latency-
based estimation
Timing analysis
Latency
reveals computational load
Anderson et
al. [17]
2025
Greenness
scoring
Static bench-
marking
No
adaptation to changes
Thompson &
Garcia [19]
2025
Incremental
learning
Algorithm
design
Rapid
convergence achieved
-
Research Gap Synthesis
As indicated by the literature review, there does not exist a corresponding framework that offers the real-time, sustainability-conscious model selection among black-box LLM APIs and online learning to adapt, proxy-based estimation to achieve accessibility, and optimization routing to enable a practical implementation. The GreenAI system supports this gap by providing a combined model according to which the optimization of sustainability can be offered to any developers without the cooperation of the providers or the special infrastructure.
-
-
METHODOLOGY
This section will provide an overview of the system architecture of the Culinary Web Solutions project. GreenAI consists of five components that are integrated into a pipeline, where a suite of benchmark tasks is used to evaluate, an API execution layer allows communication with a variety of providers, an observation engine is used to record runtime signals, an online learning model is used to predict efficiency, and an optimization router is used to choose models dynamically. This architecture converts unprocessed API interplay into actionable sustainability understanding by means of methodical information gathering and examination. An example of the entire system architecture and data flow among components is shown in figure 1.
Online Learning Model
Optimization Router
Selected Model Execution
Benchmark Tasks
Sentiment QA Classification Reasoning
API Execution Layer
Groq OpenAI Anthropic Google
Observation Engine
Fig. 1. Simplified System Architecture
-
Benchmark Task Suite
The system has a controlled group of benchmark tasks of various categories distributed in an equal proportion to compare models fairly and consistently. The benchmark suite comprises 140 tasks in the four categories of sentiment analysis, question answering, text classification, and logical reasoning with prompts that are well formulated and ground truth outputs that are verified. This standardized data will make certain that all the models will be tested on the same inputs, and there wll be a fair comparison of their efficiency and quality attributes that are indicated in Table 2.
TABLE II
Benchmark Dataset Composition
Task Cate-
gory
Count
Difficulty Distribution
Evaluation
Metric
Sentiment
Analysis
50
Easy: 30, Medium:
15, Hard: 5
Accuracy
Question
Answering
40
Easy: 20, Medium:
15, Hard: 5
Exact Match
Text Classification
30
Easy: 15, Medium:
10, Hard: 5
F1 Score
Logical Reasoning
20
Easy: 5, Medium: 10,
Hard: 5
Accuracy
Total
140
Easy: 70, Medium:
50, Hard: 20
-
Signals Collection
The observation engine logs a rich collection of runtime signals corresponding to each API request and is determined by the selection according to the availability between providers as well as theoretical relationship to computational effort. These signals are input token count representing prompt complexity, output token count representing length of generation, total to- kens processed as primary computational load proxy, response
latency representing end to end timing, model identifier in per-model tracking, task category in contextual understanding, output text in quality evaluation and timestamp in temporal analysis. It is out of this rich dataset that all later learning and optimization is based on within the framework.
-
Quality Evaluation
The quality assessment also depends on the type of task to suitably represent the correctness of responses and assure that different model outputs and forms are more consistent than others. To achieve quality that is binary, 1 when prediction matches ground truth and 0 when wrong, to guarantee strict criteria on accuracy, classification tasks such as sentiment analysis and spam detection are classified. In the case of question answering, normalization followed by exact matching of strings can be used to take into account formatting differences without violating the semantic criteria of equivalence. In reasoning problems, the combination of the exact matching and logical equivalence checking is used to ensure that the correct lines of reasoning are given credit even in cases where surface formulations are different.
-
Energy and Carbon: Estimation
The framework will use the proxy-based estimation in terms of the number of tokens and known research studies that have already been confirmed by previous studies in the topic since direct energy measurements cannot be done in black- box APIs. Estimation of energy consumption per request is given as E request = T total x E per token where T total is the total numbers of tokens that have been processed and E per token is 0.0000012 kWh when using 8B parameter models and 0.0000035 kWh when using 70B parameter models depending on reported measurements. The carbon emission is then approximated as CO 2 = E request x CI where CI = 475 g CO 2/kWh is taken as a global average carbon index to provide uniform cross-provider comparison though taking into consideration the fact that in production deployments real-time grid data may offer greater accuracy.
-
GreenAI Scoring Model
The GreenAI Score is a standardized measure of comparing model efficiency, which measures various dimensions of sustainability and quality into one comparable score. Carbon score is determined here as Carbon score = 100 (1 – Carbon model/Carbon max), which normalizes the emissions such that the less carbon one produces the more marks one gets. The efficiency of the token is calculated by Tokenscore = 100( Useful tokens)/Total tokens and rewarded the concise responses which reduce useless production. Latency efficiency is computed Latency efficiency= 100 (1-Latency model/Latency max) which encourages late response speed between 10 seconds as maximum tolerable range. These components are then added together with default weights wQ=0.40, wC=0.30, wT=0.20, wL=0.10 through the formula GreenAIScore = wQQ + wC C + wTT + wLL. The priorities are on correctness with a considerable incentive on environmental efficiency.
-
Online Learning Framework
The online learning aspect allows the prediction accuracy to be continually improved with accumulating new observations adjusting to behavior changes in a model and changing workload patterns without necessitating periodic retraining. A feature vector comprises of the overall tokens, latency, length of output, task category one-hot encoding, model identity one- hot encoding, prompt length, and time features to present all the information about the context that is relevant. The targets of prediction are quality on a 0-1 scale, and the compute cost in the form of tokens × latency, and offers a compound metric that defines both the processing scale and time intensity. A form of online linear regression uses stochastic gradient descent updates the weights so that w T +1 = w T + 0.01 (y -y) x, where the learning rate eta
=0.01 and is lowered by 0.999 every update, to trade-off between high initial rate of learning and stability.
-
Optimization-Based Routing
The routing engine dynamically chooses models based on each request by optimizing a utility function that trades off the quality that is predicted with the assumed computational cost based on the preferences of the developers. A sustainability preference parameter L = 0 to 1 is added to the utility function U model = Q model – C model * L as an adjustable parameter, so that when L=0, the model reward is directly proportional to the computational cost of the model; when L=1, only the most efficient models are rewarded. The router works in a systematic manner on individual request: feature extraction, quality and cost prediction on all candidate models, utility computation at the current L -value, maximum utility model selection, routing request, monitoring real performance, and updating online learner. An e-greedy exploration strategy with
0.1 in it will guarantee that every model will get enough traffic to keep the correct predictions and will largely exploit learning knowledge. The entire route of the work process between the arrival of the request and the return of the response is depicted in figure 2.
-
EXPERIMENTAL SETUP
-
Research Questions
The experimental assessment will deal with four research questions that will prove the various features of the performance and usefulness of the GreenAI framework. RQ1 explores the extent to which the online learning model can be accurate in predicting the quality and the cost of computation with limited observable API interactions. RQ2 looks at the increase in the accuracy of prediction with the increase in the number of observations and measures the value of continuous learning. RQ3 examines the quality versus sustainability trade- offs of operating the system at various lambda settings, and determines the best operating points. RQ4 will compare the performance of various models on a GreenAI Score based on task type and find rankings of their efficiency.
REQUEST ARRIVAL
FEATURE EXTRACTION
QUALITY & COST PREDICTION
UTILITY = Q pred – lambda × C pred
SELECT MODEL WITH MAXIMUM UTILITY REQUEST EXECUTION
QUALITY EVALUATION
ONLINE LEARNING UPDATE RETURN RESPONSE
through routing all 140 benchmark tasks to each model separately, and formed per-model performance baseline of quality, carbon, latency, and token efficiency. After the collection of the baseline, the online learning system was initiated without making any prior observations and routing experiments with 0.0, 0.3, 0.5, 0.7, and 1.0 lambda values to investigate the space of sustainability-accuracy trade-offs. The experimental conditions were repeated three times with randomized task sequence to include the variability in API and network effects, and a delay of 0.3 seconds between requests to observe rate limits at a provider. Table 4 is an overview of all the experiment parameters that were used during the evaluation.
TABLE IV Experimental Parameters
Parameter
Value
Justification
Benchmark tasks
140
Covers diverse task types
Models evaluated
4
Representative range
values tested
5
Explores trade-off space
Replicates per condition
3
Accounts for variability
Exploration rate
0.1
Balances exploration/exploitation
Initial learning rate
0.01
Empirically determined optimal
Learning rate decay
0.999 per update
Gradual stabilization
Fig. 2. Routing and Online Learning.
-
Model Providers and Configurations
The API platform of Groq was used to conduct experiments with access to various open-source models via a single interface to verify the same measurement conditions. The four chosen models are a variety of design philosophies: Llama3-8b with 8 billion parameters, Llama3-70b with 70 billion parameters, Mistral-8x7b with 47 billion parameters that relies on the mixture-of-experts design, and Gemma2-9b with 9 billion parameters that is an efficient design offered by Google. Each of the models represented in the evaluation has all specifications as given in table 3.
TABLE III Model Specifications
Model
Name
Parameters
Architecture
Provider
Primary
Use Case
Llama3-
8b
8 billion
Transformer
Groq
Efficient inference, simple tasks
Llama3-
70b
70 billion
Transformer
Groq
High
accuracy, complex reasoning
Mistral-
8x7b
47 billion
Mixture of
Experts
Groq
Balanced
performance
Gemma2-
9b
9 billion
Transformer
Groq
Efficient,
Google architecture
-
Experimental Procedure
The experimental routine was well systematic and aimed at maintaining reproducibility and statistical soundness and API rate limits were observed. Baseline measurements were made
-
Evaluation Metrics
Quality is given as a percentage of the number of correct responses that were reported and that correctness was given based on an exact or semantic match to ground truth depending on the task type. Carbon footprint is indicated in gram of CO 2 per request calculated in the model of energy using the global mean carbon intensity of 475 g CO 2/kWh. Latency is calculated in seconds at the duration between submission of request and full receipt of response taking into consideration network transit time. Token efficiency is defined as the ratio of output tokens to the total tokens with bigger values being those involving shorter responses. GreenAI Score is the aggregation of these metrics with the weighted algorithm that is assessed on a 0-100 scale. Mean Absolute Error (MAE) is used to measure prediction error to both quality and cost prediction of the online learning model.
-
-
RESULTS
-
RQ1: Prediction Accuracy
The online learning model was able to achieve a high predictive performance with both quality target and computational cost target which supported the feature set and the design of the learning algorithm. The error made by quality prediction was almost zero with MAE of less than 2, which means that the seen signals adequately reveal the factors that determine the correctness of the responses of various models and tasks. Cost prediction had more but still helpful error at 7.2% MAE, due to the increased complexity of predicting the computational effort based on network variability based on latency measurements. Table 5 provides the full metrics of prediction accuracy of all models and tasks at 500 observations.
TABLE V Prediction Accuracy Metrics
Prediction
Target
Mean Abso-
lute Error
Root Mean
Square Error
R² Score
Quality
1.8%
2.3%
0.89
Compute
Cost
7.2%
9.1%
0.76
Carbon (de-
rived)
8.4%
10.2%
0.71
-
RQ2: Improvement in Learning with Time
Dynamics analysis of learning dynamics demonstrate high importance of the continuous model update as the system progresses in operation. The error in quality prediction is decreased by 85 percent between initial and converged conditions, and the majority of the progress is made in the first
200 observations, as the system learns models that are effective at which functions. The error in cost prediction is minimized by 65 percent, and the error decreases during the experimental period because the error in cost prediction of factors such as API load and the network condition is also larger. Figure 3 provides the learning curves of both quality and cost prediction against the number of observations in which the errors reduce with the number of observations.
Fig. 3. Convergence of the error in prediction over time.
-
RQ3: Sustainability-Accuracy Trade-off
A sustainability preference parameter lambda allows the systematic study of the trade-off between quality and environmental impact that shows that there are various operating points possible depending on applications needs. At lambda=0.0, only quality is optimized and the router attains 93.8% accuracy with a carbon footprint of 0.35g of a request in favor of the most accurate model. When lambda is increased to 0.3, carbon will decrease to 0.28g (reduced by 20%), but the quality is high, 92.7 (only 1.1 percentage loss). The balanced condition of 0.5 wavelength gives 34 percent carbon saving, at a cost of only 2.3 percent of quality, which was termed as the knee of the Pareto curve of further sustainability savings came with a price of steadily stiffer quality fines. The entire results are provided in Table 6 with the quality and the carbon measurements of each of the lambda values.
TABLE VI
Quality-Carbon Trade-off Results
Value
Quality (%)
Carbon (g)
Reduction
Quality Loss
0.0
93.8
0.35
Baseline
–
0.3
92.7
0.28
20%
1.1%
0.5
91.5
0.23
34%
2.3%
0.7
89.8
0.19
46%
4.0%
1.0
86.2
0.15
57%
7.6%
-
RQ4: Model Comparison based on GreenAI Score
The GreenAI Score presents a normalized measure that indicates efficiency rankings that tend to turn hierarchies of traditional accuracy upside down. Llama3-70b has the best raw quality of 94.2% but has the highest carbon footprint (0.42g) and lowest token efficiency (0.65) which makes it the lowest with GreenAI Score of 78.3. The balance of mistral- 8x7b between quality (91.5 percentage) and carbon (0.31g) is very effective because it results in the second-best score of
84.7. Llama3-8b has the lowest raw quality of 87.3, which results in the highest GreenAI Score of 86.2 since it has high efficiency scores in all dimensions. Table 7 shows the overall performance of the model with all components scores and GreenAI final rankings.
TABLE VII
Model Performance and GreenAI Scores
Model
Quality
Carbon
Latency
Token Eff.
GreenAI
Llama3-70b
94.2%
0.42g
1.8s
0.65
78.3
Mistral-8x7b
91.5%
0.31g
1.4s
0.72
84.7
Llama3-8b
87.3%
0.18g
0.9s
0.81
86.2
Gemma2-9b
88.7%
0.21g
1.0s
0.78
85.1
-
Distribution Analysis of Routing
The model selection patterns provided by the router indicate the effects of sustainability preference on behavior, which gradually changes to sustainability-oriented, as opposed to the quality-oriented, as lambda becomes larger. At 0.0, the router gives 45 percent of the requests to the most accurate model where Llama3-70b is the best. At the balance point, Llama3- 8b is the most commonly chosen with 35% of the handling routine jobs effectively and Llama3-70b compromises with 15% of the complex reasoning. The big model is never chosen at lambda=1.0 to have Llama3-8b do 60 percent of the requests and Mistral do tasks that require its capability. This gradual change in selection patterns shown in figure 5 is observed over the entire range of lambda values.
Fig. 4. Model Selection Distribution by lambda
-
-
DISCUSSION
-
Interpretation of Findings
Findings indicate that routing with sustainability awareness may greatly decrease the environmental effects without actually impairing the user experience. The prediction accuracy of online learning remained high with varying conditions and the convergence rate proved that there is enough data to observe. The trade-off showed that, sustainability and quality are well balanced at 10.3-0.5. Rankings of GreenAI Score also established that highly accurate models can fail on composite metrics of sustainability, so they need multi- dimensional assessment.
-
Implications for Practice
The GreenAI model allows the developers to maximize sustainability without the need to collaborate with the providers. The standard API signals can be used to choose models with low carbon consumption and regulate the quality, sustainability trade-off by adjusting the tunable parameter, . GreenAI Score also gives a benchmark of cross-provider model comparison that is standardized, and promotes market competition based on efficiency.
-
Limitations
Consumption of energy was estimated through proxy measures, which imposed an unnecessary uncertainty. Latency measurements were also taken to have network overhead variation as a result of geographic routing. The assumptions of carbon intensity were based on the global averages instead of grid real-time data. Also, only Groq API models were evaluated, which can be an issue in terms of provider generalizability.
-
Threats to Validity
Experiments can change the backend API which can influence internal validity. External validity is restricted to tasks and studied providers. Construct validity relies on the appropriateness of energy proxies that are token based. Limited observations can affect conclusion validity, but replication will lessen random error.
-
-
FUTURE WORK
The further development work involves incorporating a variety of providers like OpenAI, Anthropic, and Google, to do comparisons more broadly. Use of real time carbon intensity integration would enhance accuracy in estimations. Increasing testing to other activities such as coding, translation, and summarization will enhance strength. Real world performance will be tested by applying the framework on live application. Precision is possible by using model- specific energy co-efficient obtained by controlled experiments. Cross-task learning can alleviate the problem of cold-start. Cost, fairness and latency can be used as multi- objective optimization. The explainable routing decisions will enhance control and trust amongst the developers.
GreenAI Score might become a widespread assessment standard because of the standardization of sustainability metrics in the industry. AI emission reporting may be backed with regulatory tools. Carbon conscious auto scaling is able to
streamline infrastructure activities according to energy cleanliness. Sustainable implementation of AI can be democratized by the use of open-source.
-
CONCLUSION
The current paper presented a sustainability rating and routing system of black-box LLM APIs, GreenAI. The most significant ones are a data-driven sustainability scoring model, an online learning mechanism, an optimization-based routing engine, and experimental validation that carbon reduction by 30-50 percent is possible with only a little loss in quality. The framework is implemented based on observable meta- data and proxy estimation techniques and can be deployed on consumer machines. Findings indicate that there is a considerable difference in sustainability among the providers and that it is feasible to make meaningful comparisons of carbon without access to proprietary infrastructure. GreenAI optimizes the sustainability aspect of AI deployment into a quantifiable parameter, which can be used to make carbon- conscious choices and promote the environmental friendliness of AI benchmarking.
REFERENCES
-
A. Kumar, S. Patel, and R. Gupta, State of LLM APIs 2024: Adoption Trends and Environmental Implications, Proceedings of the 2024 Conference on AI Systems, pp. 45-58, 2024.
-
U. Gupta, Y. Kim, S. Lee, J. Tse, H. Lee, G. Wei, D. Brooks, and C. Wu, Chasing Carbon: The Elusive Environmental Footprint of Computing, 2021 IEEE International Symposium on High-Performance Computer Architecture, pp. 854-867, 2021.
-
J. Dodge, T. Prewitt, R. Tachet des Combes, E. Odmark, R. Schwartz,
-
Strubell, A. Luccioni, N. Smith, and N. DeCario, Measuring the Carbon Intensity of AI in Cloud Instances, Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 1877-1894, 2022.
-
-
R. Schwartz, J. Dodge, N. Smith, and O. Etzioni, Green AI,
Communications of the ACM, vol. 63, no. 12, pp. 54-63, 2020.
-
P. Henderson, J. Hu, J. Romoff,E. Brunskill, D. Jurafsky, and J. Pineau, Towards the Systematic Reporting of the Energy and Carbon Footprints of Machine Learning, Journal of Machine Learning Research, vol. 21, no. 248, pp. 1-43, 2020.
-
C. Wu, R. Raghavendra, U. Gupta, B. Acun, N. Ardalani, K. Maeng,
G. Chang, F. Aga, J. Huang, C. Bai, M. Gschwind, A. Joshi, S. Kim,
H. Lee, S. Venkataramani, V. Srinivasan, S. Wei, W. Wang, and K. Hazelwood, Sustainable AI: Environmental Implications, Challenges and Opportunities, Proceedings of Machine Learning and Systems, vol. 4, pp. 795-813, 2022.
-
E. Strubell, A. Ganesh, and A. McCallum, Energy and Policy Considerations for Deep Learning in NLP, arXiv preprint arXiv:1906.02243, 2019.
-
A. Lacoste, A. Luccioni, V. Schmidt, and T. Dandres, Quantifying the Carbon Emissions of Machine Learning, Workshop on Tackling Climate Change with Machine Learning at NeurIPS 2019, 2019.
-
A. Luccioni, S. Viguier, and A. Ligozat, Estimating the Carbon Foot- print of BLOOM, a 176B Parameter Language Model, arXiv preprint arXiv:2211.02001, 2022.
-
M. Kaplan and C. Martinez, Cross-Provider Efficiency Analysis of Commercial LLM APIs, Proceedings of the 2024 Conference on Machine Learning and Systems, pp. 112-128, 2024.
-
L. Chen, H. Zhang, and W. Liu, Latency-Based Inference of LLM Computational Characteristics, IEEE Transactions on Sustainable Computing, vol. 9, no. 3, pp. 245-259, 2024.
-
R. Williams and S. Thompson, Observable Signals in Black-Box LLM APIs: A Taxonomy for Efficiency Inference, Journal of Artificial Intelligence Research, vol. 79, pp. 567-589, 2024.
-
A. Rodriguez, P. Kumar, and M. Singh, Carbon-Aware Dynamic Routing for AI Inference Workloads, Proceedings of the 2024 ACM Symposium on Cloud Computing, pp. 234-249, 2024.
-
Y. Zhang, J. Wang, and L. Chen, Cascading Model Selection for Efficient LLM Deployment, Proceedings of the 2023 Conference on Neural Information Processing Systems, pp. 4567-4579, 2023.
-
R. Kumar and A. Singh, Multi-Armed Bandit Approaches to Adaptive Model Selection, Machine Learning Journal, vol. 112, no. 4, pp. 891- 915, 2023.
-
H. Liu, S. Chen, and M. Wong, Cost-Aware Model Routing in Cloud- Based ML Systems, IEEE Transactions on Cloud Computing, vol. 12, no. 2, pp. 178-192, 2024.
-
B. Anderson, K. Williams, and J. Martinez, Greenness Scoring: A Sustainability Metric for LLM Comparison, Proceedings of the 2025 AAAI Conference on Artificial Intelligence, pp. 2345-2357, 2025.
-
S. Patel, R. Gupta, and M. Kumar, Online Learning in Production ML Systems: A Comprehensive Survey, ACM Computing Surveys, vol. 56, no. 8, pp. 1-35, 2024.
-
K. Thompson and E. Garcia, Incremental Learning for API-Based Model Performance Prediction, Journal of Machine Learning Research, vol. 25, no. 112, pp. 1-28, 2025.
-
A. Martinez, L. Chen, and R. Williams, Cold-Start Solutions for Online Learning in Model Selection Systems, Proceedings of the 2025 International Conference on Machine Learning, pp. 3789-3801, 2025.
-
S. Wang, T. Zhang, and H. Liu, GreenGLUE: Extending the GLUE Benchmark for Sustainability Evaluation, arXiv preprint arXiv:2501.12345, 2025.
-
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L. Munguia, D. Rothchild,
D. So, M. Texier, and J. Dean, Carbon Emissions and Large Neural Network Training, arXiv preprint arXiv:2104.10350, 2021.
-
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?, Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pp. 610-623, 2021.
-
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman, GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, arXiv preprint arXiv:1804.07461, 2018.
-
P. Mattson, C. Cheng, C. Coleman, G. Diamos, P. Micikevicius, D. Patterson, H. Tang, G. Wei, P. Bailis, V. Bittorf, D. Brooks, D. Chen,
D. Dutta, U. Gupta, K. Hazelwood, A. Hock, X. Huang, D. Kang, D. Kanter, N. Kumar, J. Liao, D. Narayanan, T. Oguntebi, G. Pekhimenko,
L. Pentecost, V. Reddi, T. Robie, T. St John, C. Wu, L. Xu, C. Young, and M. Zaharia, MLPerf Training Benchmark, Proceedings of Machine Learning and Systems, vol. 2, pp. 336-349, 2020.
