DOI : 10.17577/IJERTCONV14IS010036- Open Access

- Authors : Nidhi, Rakshitha P, Mr. Hareesh B
- Paper ID : IJERTCONV14IS010036
- Volume & Issue : Volume 14, Issue 01, Techprints 9.0
- Published (First Online) : 01-03-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
Yield Prediction Through Scalable Tree-Based Learning Frameworks
Nidhi
dept of computer applications St Joseph Engineering College Mangalore, India
Rakshitha P
dept of computer applications St Joseph Engineering College Mangalore, India
Mr. Hareesh B
Associate Professor
Department of Computer Applications St Joseph Engineering College Vamanjoor, Mangalore, Karnataka
AbstractPrecise vaticination of crop yield is essential for sustainable husbandry and optimal resource application. In this paper a machine literacy- grounded model for wheat yield vaticination grounded on history and present climate data is proposed. In discrepancy to recent work on mongrel models with low interpretability and high mean absolute error ( MAE 345) our model employs interpretable models Random Forest and Decision Tree that yield much bettered performance. The data were resolve grounded on the crop type and wheat data were insulated for in- depth analysis.The units of yield were converted from hectograms per hectare (hg/ha) to kilograms per hectare( kg/ha) performing in a considerable drop in MAE. once climate data from Kaggle were preprocessed, regularized, and outlier values were removed. A position- apprehensive API- grounded Flask web operation employs present climate data. Random Forest handed the stylish performance among all the models tried with an R ² score of 0.9549. These results prove that operation of interpretable models with real-time data is largely effective for smart husbandry purposes.
Index TermsCrop Yield Prediction, Random Forest, Decision Tree, Machine Learning, Real-Time Prediction, Flask Web App, Smart Agriculture.
-
INTRODUCTION
Crop yield is among the factors determining food security and agrarian planning. In our design vaticination of wheat yield is fulfilled with the aid of high- perfection and flexible models using machine literacy. The current styles are superior to the conventional estimation by means of environmental and input parameters like rainfall, temperature, fungicides, and area [3] For ease of use, weve developed a real- time web in- terface for parameter input and reclamation of prognostications from trained models. Climate data handed to the model are stoked with mean temperature and downfall situations deduced from the NASA POWER API [7]
-
LITERATURE REVIEW
These five research articles [6] [10], of which two are in 2025 [9], [10]spoke about hybrid models, deep learning, multi- crop scalability, and real-time prediction. Paper [10], using a hybrid ML model for time-series prediction, resulted in high MAE and low interpretable output. Paper [9] dealt with global yield mapping but lacked flexible frameworks.
A key challenge in existing models lies in their limited transparency and tightly coupled preprocessing steps. We took
-
as a reference and enhanced its limitations by carrying outcrop-wise dataset splitting and yield conversion to kg/ ha. We adopted decision tree-based models to enhance accuracy while retaining model transparency and adaptability
-
-
PROPOSED SYSTEM
-
System Overview
The system predicts wheat crop yield based on preprocess- ing of publicly accessible data, feature extraction, and training and testing of machine learning models. A web application built on Flask acts as the interface, by whose input fields users can input agricultural data and receive real-time yield predictions.
The rest of the research [6] [8] dealt with multi-view fusion crop selection and land suitability.
Fig. 1. System architecture for wheat yield prediction
B. Feature Descriptions
The data includes parameters like year, country, crop type (item), temperature, rainfall, application of pesticide, and yield. Initially yield was recorded in hg/ha, which was later converted to kg/ha for uniformity. The data was brought into a single DataFrame (denoted as yield_df) and was divided into several CSV files based on crop type to enable concurrent processing.
-
-
MACHINE LEARNING MODEL IMPLEMENTATION
-
Dataset
The dataset utilized in this project was downloaded from Kaggle [13] It has global crop yield information from 1990 2013 with features such as country, crop type (Item), yield in hg/ha (converted to kg/ha), average rainfall, pesticide use, and temperature.
-
Data Preprocessing
This preprocessing was initiated due to issues seen in a 2025 hybrid model trial [10], in which the initial units of hg/ha were not dropped and resulted in increased Mean Absolute Error (MAE). We first cleaned our data into a master DataFrame, yield_df, which then got automatically split into individual CSV files by crop type. This module-based organization allows training of more accurate models and allows filtering for crop- wise yield prediction here i.e., for wheat.
-
Algorithm and Code
We employed a broad assortment of regression models to contrast the predictability of different machine learning models in wheat crop yield prediction. For baseline evaluation, the models tested included linear regression along with its regularized formsridge and lasso, closest neighbors k (knn), decision tree, and random forest. The models were chosen considering the trade-off between computational complexity, interpretability, and predictability. Linear, ridge, and lasso re- gressions were employed as a baseline model for linear models with reduced assumptions. Knn was employed to examine its non-parametric and local approximation character. But when all the models were subjected to comparison among the tested models, decision-tree and random-forest achieved the highest predictive accuracy and offered clear interpretability.
Decision tree algorithms are widely preferred due to their straightforward logic and interpretability making them suitable for application in industries such as agriculture, where the process of decision-making is inherent to stakeholders. The random forest model, which is a collection of decision trees, significantly improved predictive accuracy and reduced vari- ance, thereby enhancing performance in all testing measures. We also explored the application of xgboost [8], a robust and mature gradient boosting library however, it was ultimately not selected for implementation due to its increased complexity in the training process and slower processing speed. Our system ultimately achieves an optimal balance between simplicity and effectiveness by integrating the benefits of both random forest
and decision tree methods to achieve maximum performance, interpretability, and computational efficiency.
Algorithm 1 Wheat Yield Preprocessing and Model Training Pipeline
1: Load the kaggle dataset STATE Filter crop where Item == Wheat
2: Convert yield from hg/ha to kg/ha
3: Remove extreme outliers split the data into crop-specific CSV files
4: Perform feature scaling and encoding
5: Train RF and DT models Measure using MAE, R2, and RMSE metrics
This pseudocode represents the modular preprocessing phase in which the global dataset is divided into individual crop- specific CSV files. This structured approach facilitates more targeted and efficient model training, significantly improving prediction accuracy and reducing error, especially for wheat yield forecasting.
Algorithm 2 Flask Route for Real-Time Prediction
1: Define route:
@app.route(/predict, methods=[POST])
2: Read user inputs: crop, year, area, rainfall, temperature,
pesticide
3: If climate values missing:
4: Fetch location using IP address
5: Predict yield using trained Random Forest model
6: Return predicted result to frontend
This algorithm prescribes the backend process of the Flask web application for real-time prediction of wheat yield. When there is a POST request on the /predict path, the server receives input farm features from the user through the form, including crop type, area, temperature, rainfall, and pesticide usage.
In situations where certain climate values such as tempera- ture and precipitation are missing, the system uses the users IP addition to perform an educated assessment of the location and obtain appropriate historical data on weather in the NASA Power API. Improved properties are then introduced into the previously existing regression model of random forest or decision models that produce assessments of cultures. The estimated indicator then returns to the front and is illustrated by the user.
-
Results and Visualizations
The productivity of all automated learning model perfor- mance was measured using common regression indicators: MAE, R2 score, and RMSE The most impressive results were obtained using the tested models – linear regression, lasso, ridge, KNN, decision tree, and random forest. The absolute mean error for the model is the lowest with the highest value of r2 0.9549 indicating strong correlation and great accuracy in predicting actual results compared to predictions. Improved
productivity is explained by the ability to grasp complex relationships using overall shortages and reduce variance.
TABLE I
MODEL EVALUATION METRICS
Model
MAE
R2 Score
RMSE
Linear Regression
328.38
0.9331
458.59
Lasso Regression
350.21
0.9267
480.13
Ridge Regression
341.87
0.9297
470.14
KNN
292.46
0.9363
447.69
Decision Tree
247.28
0.9283
474.88
Random Forest
214.56
0.9549
376.46
it. The heatmap supported the significance of these features, and was also validated by the feature importance scores in the random forest model.
Fig. 3. Diverging correlation heatmap
In addition to comparing models, divergent correlation heat map were created to visually represent the correlations between important input characteristics. Visualizations clearly show that temperature, precipitation and pesticide treatments are closely related to productivity, and thus were important factors in prediction. heat map were used to confirm the importance of these features. This was also supported by estimating the importance of indications from random forest regimes.
Fig. 2. Comparison of model performances based on MAE
As depicted in Figure 2, the Random Forest model demon- strated superior performance over other algorithms by achiev- ing greater accuracy and reduced prediction errors. Although the Decision Tree regressor maintained relatively consistent results with a Mean Absolute Error (MAE) of 247.28, Random Forest further lowered the MAE to 214.56 by leveraging ensemble learningaveraging outputs from multiple trees to mitigate overfitting and enhance generalization.
In addition to comparing models, a diverging correlation heatmap was generated to examine the relationships among the primary input features. This visualization showed that temperature, rainfall, and pesticide usage are closely linked to crop yield, and were therefore significant factors in predicting
-
-
RESULT ANALYSIS
Random Forest significantly yields the lowest MAE (214.56) and highest R2 value (0.9549), outperforming all the baseline models and the 2025 hybrid study [10]. The strong performance of Random Forest highlights its robustness in capturing complex, non-linear patterns within the wheat yield dataset. Compared to the Decision Tree, which achieved an MAE of 247.28, Random Forest reduced the error by over
13%, indicating improved generalization across varying climatic and agricultural conditions.
The analysis using random forest regressor-based feature importance also supports the conclusion that the use of pes- ticides, temperature, and rainfall play crucial roles in crop production. The results, although statistically significant, are also practical in the agricultural context because these three factors directly influence plant health, growth rate, and pest control, all of which are crucial for maximizing crop yield.
Random Forests resistance, which is rooted in its ensemble nature, where many decision trees combined prevent over- fitting and prediction variability. This property is especially valuable to apply to agricultural data sets plagued by noise, missing values, and spatial heterogeneity. Compared to that,
Linear as well as Lasso Regression models were not successful enough in such intricate complexity and thus registered higher values for MAE and lower values for R2.
Additionally, the correlation heatmap (figure 3) visually sup- ports these findings, and rainfall and temperature are strongly positively correlated with yield. The use of the pesticide had a moderate correlation, indicating its indirect but significant influence on the level of production. The models credibility is enhanced by statistical evidence and adherence to agronomic principles, making it more reliable in practical applications.
The random forest model did better than other machine learning methods at predicting results and gave clear expla- nations that helped people make smart decisions in farming. This tool can be used in real-time systems that help farmers, government officials, and businesses in agriculture.
To check if the random forest model was reliable, a scatter plot was made to compare the predicted wheat harvest with the real harvest. As shown in figure 4, the predicted numbers are very close to the real ones, showing that the model is trustworthy and has little error.
Figure 4 graphically represents the proximity between actual and predicted values of wheat yield. The points are well grouped around the lines of the diagonal references suggesting that random forest models predict with minimal movement and maximum accuracy. This is proof of its capability to generalize well under diverse conditions of environment and region of data. This uniformity makes it suitable for inclusion in real- time decision-making systems in agriculture.
Fig. 4. Predicted vs. Actual wheat yield
In this study, we introduce a complete system for precisely forecasting wheat crop yields using explainable machine learn- ing techniques, including decision trees and random forests. When we tested various models we found that the random forest model performed better than other common methods, such as linear regression, ridge regression, lasso, k-nearest neighbors, and decision trees, in both accuracy and reliability. The random forest model had the highest mean absolute error (mae) of 214.
56 and the highest r2 value of 0. 9549, which is better than the
hybrid model from 2025 mentioned in [10].
Table I and figure 2 clearly show the big difference in how well ensemble models work compared to simpler regression models.
Besides making sure the backend model is accurate, we also created a real-time web app using Flask. This app gives users an easy-to-use interface where they can enter daa. The user interface, shown in figure 5, offers:
-
Cleanly labeled fields (with Country first)
-
Auto-filling support for rainfall, temperature, and country based on location access
-
Rounded output prediction values displayed in kg
-
Side-by-side comparison of both Decision Tree and Ran- dom Forest model predictions
Fig. 5. Flask-based web application interface for real-time wheat yield prediction
This lets the system change so that not only researchers and politicians, but also farmers and workers can grow their agrot and bridge the gap between AI and real agriculture.
Future Work:
-
Supports additional crop objects such as corn and rice
-
Real-time integration of weather data
-
article users – a friendly deployment of mobile for rural areas. Regions with limited connectivity.
-
Seamless integration of modules and data sources
-
item integrating feedback from farmers and providing support in multiple languages
REFERENCES
-
L. Breiman, Random forests, Machine Learning, vol. 45, no. 1, pp. 5 32, 2001.
-
J. R. Quinlan, Induction of decision trees, Machine Learning, vol. 1, no. 1, pp. 81106, 1986.
-
S. Khaki, L. Wang, Crop yield prediction using deep neural networks, Front. Plant Sci., 2019.
-
K. Jha, A. Doshi, P. Patel, and M. Shah, AI in agriculture: review,
Artificial Intelligence in Agriculture, vol. 2, pp. 112, 2019.
-
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel et al., Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol. 12, pp. 28252830, 2011.
-
Anonymous, Adaptive Fusion of Multi-view RS Data for Crop Yield Prediction, 2024.
-
NASA POWER Project, Prediction Of Worldwide Energy Resources (POWER). NASA Langley Research Center.
-
T. Chen and C. Guestrin, XGBoost: A scalable tree boosting system, KDD, 2016.
-
Anonymous, Mapping Global Yields of Four Major Crops at 5-minute Resolution, 2025.
-
Anonymous, Crop Yield Time-Series Data Prediction Based on Hybrid ML, arXiv preprint arXiv:2502.10405v1, 2025.
-
J. You, X. Li, M. Low, D. Lobell, and S. Ermon, Deep Gaussian Process for Crop Yield Prediction Based on Remote Sensing Data, in Proceedings of AAAI, vol. 31, no. 1, 2017.
-
A. Shukla, A. Dadhich, R. Kumar, and A. Sharma, Application of machine learning in agriculture: Review and perspectives, Materials Today: Proceedings, vol. 46, pp. 97669770, 2021.
-
R. Patel, Crop Yield Prediction Dataset, Kaggle. Available: https://www.kaggle.com/datasets/patelris/crop-yield-prediction-dataset. Accessed: July 2025.
