Global Research Platform
Serving Researchers Since 2012

Forecasting Crop Yield Using Market Price Trends Using Machine Learning Approach

DOI : 10.17577/IJERTCONV14IS010073
Download Full-Text PDF Cite this Publication

Text Only Version

Forecasting Crop Yield Using Market Price Trends Using Machine Learning Approach

Varshitha Rai

Student, St Joseph Engineering College, Vamanjoor, Mangaluru, India

Sumangala N

Assistant Professor, St Joseph Engineering College, Vamanjoor, Mangaluru, India

Abstract – Agriculture is one of a major role in sustaining economies and ensuring food security, especially in developing nations where millions rely on stable crop production and fair market prices. Reliable predictions of crop yields help farmers, traders, and communities to manage resources efficiently, and respond to market changes with greater confidence. This study examines how historical market price trends can be used to forecast future crop yields, providing an additional signal for agricultural planning. While time-tested statistical models such as ARIMA are commonly used to analysis time series data, they often fall short in capturing the complex and often nonlinear patterns found in agricultural systems. To direct this gap, this reach combines established statistical methods with advanced tree-based techniques such as RandomForest and XGBoost. Using twenty years of crop and market data, this study compares multiple prediction models to assess accuracy and reliability. The results show that blended approaches using modern machine learning outperform traditional statistical models by achieving lower prediction errors and higher goodness-of-fit scores. By highlighting the connection between price signals and yield outcomes, this research demonstrates how integrated forecasting can support better planning and decision-making in agriculture. The findings can inform policies, guide cultivation choices, and contribute to more resilient food systems.

Index Terms – Crop Yield Prediction, Market Price Analysis, ARIMA, Random Forest, XGBoost, Agricultural Forecasting

  1. INTRODUCTION

    Agriculture remains a crucial contributor to economic growth and food security across many developing countries, providing employment and income to millions of rural households. Farmers and supply chain stakeholders depend on accurate information about expected harvest volumes and commodity prices to plan production, allocate resources efficiently, and reduce exposure to market risks. Sudden shifts in output or prices can disrupt entire communities, strain food distribution systems, and affect national food security targets. Historically, crop yield and price forecasts have relied on well-known statistical approaches such as AutoRegressive Integrated Moving Average (ARIMA) models, valued for their simplicity and interpretability. However, the real-world behavior of agricultural systems is rarely linear. Yield and price trends are shaped by unpredictable factors like seasonal weather, pest infestations, policy shifts, and global market changes, making simple models insufficient on their own. Recent advances in data science have opened the door to modern machine learning methods that can handle these complex patterns more effectively. Algorithms like Random

    Forest and Extreme Gradient Boosting (XGBoost) are designed to process large, multivariate datasets and detect non-linear interactions among influencing factors.These methods are highly adopted for agricultural forecasting tasks that demand greater precision than traditional time series models can deliver alone. One key idea explored in this research is the link between historical market prices and future production outcomes. Commodity price trends often influence farmers planting decisions: a steady increase in the market price of a crop may encourage more farmers to shift land use to that crop, affecting total output in the following season. Integrating price signals with yield forecasting models can help capture this adaptive behavior and produce forecasts that better reflect real-world dynamics. This study proposes a hybrid forecasting framework that brings together conventional statistical models and advanced machine learning algorithms. A twenty-year dataset covering crop production and market prices was analyzed to train, test, and compare multiple models. Performance was evaluated using standard measures of predictive accuracy, including Root Mean Square Error (RMSE), Mean Absolute Error (MAE), and the coefficient of determination ( R2. ). By demonstrating the benefits of combining price trends with modern modeling techniques, this work aims to support better planning for farmers, market analysts, and policymakers alike. Improved forecasting can help stabilize farm incomes, balance supply with demand, and strengthen the resilience of the agricultural sector against market shocks.

  2. LITERATURE REVIEW

    Accurate forecasting of agricultural yield and market prices has received increasing attention in recent years as researchers aim to address the limitations of traditional statistical approaches through advanced machine learning (ML) and hybrid models. [1] introduced a hybrid ARIMA LSTM model enhanced with a random forest lag selection mechanism to forecast volatile price indices for pulses. Their study demonstrated significant improvements over standard ARIMA and ARIMAGARCH models, achieving 825% lower RMSE and comparable gains for MAPE and MASE, showcasing the potential of combining statistical and deep learning approaches for volatile time series.

    Similarly,[2] proposed an integrated Temporal Convolutional Network (TCN)XGBoost model to capture both linear and

    nonlinear dependencies in agricultural price fluctuations. By leveraging TCNs strength in extracting temporal features and XGBoosts robust nonlinear modeling capability, their approach surpassed traditional ARIMA and LSTM models as well as other hybrid frameworks like TransformerXGBoost and CNNXGBoost. The model demonstrated an RMSE of

    0.26 and MAPE of 5.3%, indicating its robustness in handling seasonal and abrupt price changes in crops like rice, wheat, and corn.

    In the domain of yield prediction,[3] compared Extreme Gradient Boosting (XGBoost) with conventional deep learning techniques for soyabean yield estimation using remote sensing satellite images. Their work highlights XGBoosts capacity to process complex, high-dimensional remote sensing data while offering interpretable feature importance insights, such as the influence of the near-infrared spectrum. This demonstrates how feature-based ML pipelines can offer practical alternatives to black-box deep learning models, especially where limited data or interpretability is a concern.

    A related study by[4] focused on forecasting annual rice production in Bangladesh by comparing ARIMA with XGBoost. Their experiments over a 60-year time series found that XGBoost consistently outperformed ARIMA in key error measures, including MAPE, MAE, and RMSE. The XGBoost model achieved a lower MAPE (5.38%) than ARIMA (7.23%) and projected a steady upward trend in annual rice production over the next decade. This study provides compelling evidence for the superiority of tree-based ensemble learning for time series prediction in agriculture.

    Addressing the challenge of limited labeled data for supervised learning. [5] developed a self-training random forest algorithm for winter wheat yield prediction. By iteratively expanding the annotated dataset using initial model outputs, they demonstrated notable improvements in prediction accuracy, achieving an R² of 0.84 and an RMSE of about 628 kg/ha on the test set. Compared to the standard random forest, the self-training approach reduced prediction errors and showcased a scalable solution for yield prediction where annotated data is scarce.

    Together, these studies highlight the growing shift towards hybrid and, machine learning models that combine he strengths of statistical baselines and advanced nonlinear techniques. They underline how robust forecasting models can help communities, farmers, and agribusiness stakeholders make informed, data-driven decisions to improve food security and resource allocation.

  3. DATA AND METHODOLOGY

    1. Data Description

      The data for this research was collected from publicly available sources, including the Ministry of Agriculture & Farmers Welfare, Government of India, and the AGMARKNET database. The time span covers two decades, from 2000 to 2020. The dataset consists of annual crop production statistics (in tons per hectare) and associated

      wholesale market prices (in INR per quintal) for key staple crops such as rice, wheat, and maize. To expand the scope, climate-related indicators such as yearly rainfall and average temperatures were sourced from the Indian Meteorological Department (IMD). By combining market trends with weather patterns, the study aims to better capture the factors influencing crop yield variation over time.

    2. Data Preparation

      Before modeling, the raw data underwent cleaning and preparation to ensure accuracy and reliability. Datasets were merged on a year-crop basis to build a good structure. Missing figures were filled using time-series linear interpolation, and mean substitution was applied where appropriate. Outliers especially in price values that can be volatile due to sudden market shocks were managed through the Interquartile Range (IQR) technique to reduce distortion. To ensure all variables contributed equally, numerical features were scaled between 0 and 1 using Min- Max normalization. The combined dataset was then split into training (80%) and testing (20%) portions in chronological order, mimicking real-world forecasting where past data is used to predict future values. To strengthen the models ability to detect time-based effects, new features such as previous years prices and yields were added.

    3. Models Applied

      To meet the research goal of predicting future crop yield based partly on past market prices, four models were developed and compared:

      Multiple Linear Regression (MLR): Acts as the baseline by estimating yield and price outcomes using straightforward relationships between explanatory variables like rainfall, fertilizer usage, and past yields and prices. Its simple structure makes the results easy to interpret.

      Equation:

      Y = + X + X + … + X +

      Where:

      • Y = Predicted yield or price

      • X, X, …, X = Independent input variables (e.g.,

        rainfall, fertilizer use, prior yield, prior market price)

      • = Intercept term

      • … = Coefficients learned by the model

      • = Error term

        Autoregressive Integrated Moving Average (ARIMA): This time-series model was used to capture repeating trends and time-dependent patterns. Optimal parameters (p, d, q) were chosen using autocorrelation and partial autocorrelation plots. Seasonal factors were included where clear patterns existed.

        Equation:

        Y = c + Y + Y + … + Y + + + …

        + _qq +

        Where:

      • Y = Value at time t

      • c = Constant term

      • … = Autoregressive coefficients

      • … _q = Moving average coefficients

      • = Random error at time t

      • p = Number of AR terms

      • d = Differencing order

      • q = Number of MA terms

        Random Forest Regressor: This ensemble learning method combines many decision trees to model complex interactions in the data. Hyperparameters such as the number of trees and their maximum depth were fine-tuned with Grid Search to boost prediction accuracy.

        Equation:

        = (1 / M) h(X)

        Where:

      • = Final predicted yield or price

      • M = Number of individual trees

      • h(X) = Prediction of the m-th tree

      • The final prediction is the average of all individual tree predictions.

        XGBoost Regressor: XGBoost was chosen for its strong performance with structured datasets and its ability to handle non-linear relationships efficiently. The learning rate, tree depth, and the number of boosting iterations were optimized through repeated trials.

        Prediction Equation: = f(X)

        Objective Function: Obj = l(y, ) + (f)

        Regularization Term:(f) = T + ½ w²

    4. Evaluation Metrics

      Each models performance was evaluated using Root Mean Square Error(RMSE), Mean Absolute Error (MAE) and R- squared (R²). K-fold cross-validation was used to check that the models performed well on unseen data and did not simply memorize training examples.

    5. Tools Used: Python was the primary tool for data handling and model building. Key libraries included Numpy, Pandas for cleaning and managing data, Scikit-learn for machine learning, Statsmodels for time series analysis, and Matplotlib and Seaborn for creating plots and charts.

      Through this robust methodology, the study aims to compare the predictive capabilities of traditional statistical models with modern machine learning algorithms to provide actionable insights for yield and price forecasting in the Indian agricultural sector.

  4. RESULT ANALYSIS

    The performance of the four forecasting models Multiple Linear Regression (MLR), ARIMA, Random Forest, and XGBoost was evaluated using the prepared dataset covering crop yield and market price trends from 2000 to 2020. After model training and testing, results were visualized

    through time-series plots, scatter plots, and error comparisons to highlight differences in prediction accuracy and reliability.

    1. Market Price Trends

      Figure 1: Market Price Trent

      Figure 1 illustrates how the average market price of quintal has changed over the period from 2000 to 2024. The price trend shows seasonal shifts and unexpected spikes that reflect factors like supply-demand gaps, policy decisions, and economic events. Understanding this pattern is useful for stakeholders to identify when prices tend to rise, which can guide decisions about what and when to plant or sell.

    2. Crop Yield Trends

      Figure 2: Crop Yield Trend

      Similarly, Figure 2 shows the yield trend for the same period, measured in tons per hectare. Fluctuations in yield levels can be linked to changes in weather, farming methods, and the use of inputs. Studying the yield pattern alongside prices reveals whether production levels are aligned with market demand and helps detect possible mismatches that affect farmers income.

    3. Combination of Market Price and Yield

      Figure 3: Combined Yield & Market Price

      Figure 3 merges both actual and predicted trends for crop

      yield and market prices. Historical data is marked with solid lines, while dotted lines show the future estimates for 2025 and 2026. The graph uses two axes: one for yield and another for price. A red horizontal line shows a price point that can be considered a profit margin threshold. If the predicted market price remains above this line while yield stays stable or grows, farmers could choose to continue or increase cultivation for better returns. This combined view demonstrates how integrating production and price forecasts supports smarter planning.

    4. Actual vs. Predicted Yield

      Figure 4: Actual vs. Predicted Yield

      Figure 4 shows how actual recorded crop yields compare with predicted values over time. The black line marks the real yield, while the colored lines show the forecasts. When the lines stay close, the predictions follow real trends well. Visible gaps highlight where estimates differ from actual yields. This simple comparison help understand how well yield changes are captured and supports better farming and planning decisions.

    5. Model Performance Comparison

      To quantify predictive accuracy, key performance metrics were computed. The MLR and ARIMA models delivered reasonable accuracy for yield forecasting but lagged behind in capturing abrupt price swings. Random Forest reduced prediction errors by approximately 20% compared to ARIMA, while XGBoost consistently achieved the lowest RMSE, demonstrating its robustness in handling high-dimensional, non-linear data.

      Figure 5 presents a bar chart comparing the RMSE values of all models. Shorter bars signify fewer prediction errors, indicating better model accuracy. This figure visually confirms that ensemble learning methods (Random Forest and XGBoost) outperform traditional linear and time-series

      models.

      Figure 6 shows the R² scores across the models. Higher bars imply a stronger ability to explain the variation in the yield data. Here, XGBoost achieves the highest R², reaffirming its predictive strength.

      Cross-validation results further validated that Random Forest and XGBoost generalize better on unseen data, an essential attribute for real-world deployment.

      Figure 5: RMSE Comparison Across Models

      Figure 6:R2 Comparison Across Models

    6. Overall Findings

      The results demonstrate that:

        • Combining historical yield and price data improves prediction reliability.

        • Simple linear or purely time-series models alone are insufficient for capturing complex agricultural dynamics.

        • Modern tree-based ensemble models like Random Forest and XGBoost can better learn hidden patterns and deliver more reliable yield forecasts.

        • By forecasting whether future prices will remain above an economic threshold, these models provide valuable decision support for farmers, helping them choose whether to expand or reduce cultivation of specific crops.

    7. Limitations and Practical Considerations

    While the machine learning models demonstrated strong predictive power, certain limitations persist. The absence of real-time weather data, soil moisture levels, and pest incidence

    restricts the models from capturing micro-level variations. Moreover, the interpretability of ensemble models remains a challenge for direct field-level adoption by farmers without technical support.

    Nevertheless, the demonstrated improvement in forecasting accuracy highlights the significant potential of integrating modern analytics into agriculture. By combining advanced models with user-friendly interfaces and decision-support tools, stakeholders can leverage these insights for better planning and risk management.

  5. FUTURE SCOPE

    In the coming years, integrating satellite imagery and remote sensing can help improve forecasting accuracy by providing real-time data on crop health and soil conditions. Combining this with deep learning methods like LSTM can better capture changing patterns over time. The use of IoT devices and smart farm sensors can further add precise data on soil nutrients, moisture, and weather at the field level. This can enable more accurate, location-specific predictions to help farmers make timely choices. Developing simple mobile applications can ensure that even small and remote farmers benefit from forecast insights and receive practical advice on planting, watering, and when to sell their crops. Future work can also explore combining statistical, machine learning, and simulation models for better performance. Open data sharing and collaboration among farmers, scientists, and policymakers will be key to making these tools widely useful. Advancing these technologies will help make farming more resilient, profitable, and sustainable in the long run.

  6. CONCLUSION

Accurate forecasting of crop yields and market prices plays an important role in supporting sustainable agriculture and stable farm incomes. This study explored how combining historical production data, price trends, and climatic conditions can improve prediction quality. By comparing different statistical and machine learning methods, the analysis showed that modern models can better handle the complex and changing nature of agricultural data.

The results highlight how valuable it is to use multiple methods and data sources together. Reliable predictions can guide farmers in selecting suitable crops, planning inputs, and making informed selling decisions based on expected price trends. For policymakers, better forecasts can support early warnings for supply shortages, help stabilize markets, and guide strategic planning.

Going forward, improving the availability and quality of agricultural data, along with user-friendly tools, can make advanced forecasting accessible to even small and medium farmers. With continued research and practical applications, data-driven yield and price forecasting can help create a more resilient, profitable, and sustainable farming system.

REFERENCES

  1. S. Ray, A. Lama, P. Mishra, T. Biswas, S. S. Das, and B. Gurung,An ARIMALSTM model for predicting volatile agricultural price series with random forest technique, Applied Soft Computing, vol. 149, Art. no. 110939, Dec. 2023.

  2. T. Zhao, G. Chen, S. Suraphee, T. Phoophiwfa, and P. Busababodhin, A hybrid TCNXGBoost model for agricultural product market price forecasting, PLOS ONE, vol. 20, no. 3, Art. no. e0322496, 2025. [Online]. Available: https://doi.org/10.1371/journal.pone.0322496

  3. S. Yang, L. Li, S. Fei, M. Yang, Z. Tao, Y. Meng, and Y. Xiao, Wheat Yield Prediction Using Machine Learning Method Based on UAV Remote Sensing Data, Drones, vol. 8, no. 7, Art. no. 284, Jun. 2024.

  4. F. Huber, A. Yushchenko, B. Stratmann, and V. Steinhage, Extreme Gradient Boosting for Yield Estimation Compared with Deep Learning Approaches, Computers and Electronics in Agriculture, vol. 202, Art. no. 107466, 2022.

  5. M. Noorunnahar, A. H. Chowdhury, and F. A. Mila, A tree-based eXtreme Gradient Boosting (XGBoost) machine learning model to forecast the annual rice production in Bangladesh, PLOS ONE, vol. 18, no. 3, Art. no. e0282754, Mar. 2023.

  6. Y. Shen, B. Mercatoris, Q. Liu, H. Yao, Z. Li, Z. Chen, and W. Wang, Use Self-Training Random Forest for Predicting Winter Wheat Yield,

    *Remote Sensing*, vol. 16, no. 24, Art. no. 4723, Dec. 2024.

  7. Zhang C, Tao Z, Xiong J, Qian S, Fu Y, Ji J, et al.. Research and application of a novel weight-based evolutionary ensemble model using principal component analysis for wind power prediction. Renew Energy. 2024;232:121085. doi: 10.1016/j.renene.2024.121085 – DOI

  8. Hashemi MGZ, Tan P-N, Jalilvand E, Wilke B, Alemohammad H, Das

    NN. Yield estimation from SAR data using patch-based deep learning and machine learning techniques. Comput Electron Agric. 2024;226:109340. doi: 10.1016/j.compag.2024.109340 –

  9. Guan S, Wang Y, Liu L, Gao J, Xu Z, Kan S. Ultra-short-term wind power prediction method based on FTI-VACA-XGB model. Expert Syst Appl. 2024;235:121185. doi: 10.1016/j.eswa.2023.121185 – DOI

  10. Ghimire S, Deo RC, Casillas-Pérez D, Salcedo-Sanz S. Boosting solar radiation predictions with global climate models, observational predictors and hybrid deep-machine learning algorithms. Appl Energy. 2022;316:119063. doi: 10.1016/j.apenergy.2022.119063 – DOI

  11. Wang L, An W, Li F. Textbased corn futures price forecasting using improved neural basis expansion network. J Forecasting. 2024;43(6):204263. doi: 10.1002/for.3119