 Open Access
 Authors : Riya Agrawal, Ulka Chaudhari, Mrunal Gund, Vitthal Gutte
 Paper ID : IJERTV12IS120026
 Volume & Issue : Volume 12, Issue 12 (December 2023)
 Published (First Online): 13122023
 ISSN (Online) : 22780181
 Publisher Name : IJERT
 License: This work is licensed under a Creative Commons Attribution 4.0 International License
Optimizing Clustering Performance: Comparative Study of Traditional Kmeans and Hybrid Kmeans with Sunflower Optimization
Riya Agrawal
SCET MITWPU
Pune, India
Mrunal Gund
SCET MITWPU
Pune, India
Ulka Chaudhari
SCET MITWPU
Pune, India
Vitthal Gutte
Department of CET MITWPU Pune, India
AbstractThis paper examines the effectiveness of a hybrid approach that combines KMeans and Sunflower Optimization Algorithm (SOA) to improve clustering performance. Through experiments on various datasets, we compare the capabilities of this hybrid approach to traditional KMeans. Our initial findings tells that the Hybrid KMeans(HSFO) set of rules surpasses KMeans concerning of clustering quality and convergence speed, although its performance may vary depending on the dataset. This study introduces a promising technique for clustering research by combining local and global optimization methods.
KeywordsClustering, KMeans, Sunflower Optimization Algorithm, Hybridization, Comparative Analysis, Data Mining.

INTRODUCTION
In this research paper, an examination of an indepth HSFO algorithm is conducted, exploring its potential impact on clustering.Clustering is grouping of the unlabeled patterns into meaningful clusters. it's miles one of the maximum full size procedures of information mining that facilitates for analysis of data.[10] We compare HSFO with the established KMeans algorithm, highlighting its advantages, limitations, and practical applications. Through this analysis, valuable insights are provided to assist researchers and practitioners in assessing the suitability of HSFO for different clustering tasks.[5] The aim is to advance datadriven solutions that are precise and natureinspired, enhancing the accessibility and effectiveness of data analysis.

CLUSTERING
Clustering in machine learning groups data based on similarity, revealing patterns and making accurate predictions.Various algorithms exist, each with different
strengths and weaknesses.[2] Clustering is a fundamental aspect of machine learning with numerous applications.

Kmeans Clustering:

means efficiently divides data into K clusters through iterative point assignment and centroid updates but may converge prematurely, affecting the final result.[7]
Working:

Begin by specifying the desired value of K

Select K items from the dataset and placed the starting centers in a random manner.

For every individual data point, allocate it to the cluster whose centers is closest in proximity.

Assess the average of the items of data within each cluster to update the cluster centers.

Continue to iterate through steps 3 and 4 until there are no more alterations in cluster assignments.


Advantages:

Simple to apply and computationally effective.

Can be used for a variety of tasks.

Can be used to cluster data of different types.
Sunflower Optimization:
Sunflower optimization (SFO) is a swarm intelligence algorithm inspired by the behavior of sunflowers tracking
Fig 1: Diagram specifying KMeans Clustering the sun. SFO is a populationbased algorithm, meaning that it maintains a set of solutions, called sunflowers, and iteratively improves them using a set of rules.[1]
The SFO algorithm works as follows:

Initialize a population of sunflowers.

Calculate the fitness of each sunflower.

Move each sunflower towards the sunflower with the highest fitness.

Iterate over steps 2 and 3 until the algorithm reaches a stable state.
SFO clusters data by calculating the total within cluster variance and their assigned cluster centers.[5] The sunflower with the highest fitness is then used to move each sunflower until stability is attained and the optimal clustering is achieved. It is effective for large datasets or complex data distributions.
SFO is an effective algorithm for clustering data. It is particularly wellsuited for clustering problems with large datasets or complex data distributions.
The SFO algorithm has several advantages over other optimization algorithms, including:


It is robust to noise and outliers.

It's an adaptable tool that can be used to optimize a variety of problems

It's easy to apply. Disadvantages:

Can be computationally precious for large datasets.

May not find the global optimum for complex problems.
Fig 2: Working of Sunflower Optimization Algorithm


LITERATURE REVIEW
A new clustering algorithm called HSFO outperforms traditional methods and achieves high accuracy and improved clustering performance. This hybrid approach offers promising results for medical data analysis.[1]
When it comes to cluster analysis, Competitive Kmeans has been shown to be both faster and more precise compared to Streaming Kmeans.However, there is a promising solution available that offers a substanstial increase in speed unlike serial Kmeans++. This solution consistently enhances accuracy and is an excellent choice for individuals working with vast amounts of data.[7]
Clustering is useful for finding patterns in machine learning. Different distance metrics impact accuracy, with city block distance at 98.1%. Image clustering is challenging, but K means family algorithms like IRPKmeans show promising results. Future work is needed in multiview clustering and integrating descriptors.[2]
A study found that the accuracy of clustering may vary depending on the distance metrics employed, as evidenced by the impact of correlation distance. The metric that proved to be the most precise was the distance measured by city blocks, with an accuracy rate of 98.1%.More generalized metrics could be used for different problems. Different approaches can be used for optimal clustering solutions.[3]

CHALLENGES

Clustering can be tough without optimal centroids. Reliable methods prioritize stability and consistency despite data changes. Metrics like ARI, NMI, and DaviesBouldin ensure a robust algorithm

The second major hurdle involves data preprocessing, encompassing issues like missing data, data redundancy, and data inconsistencies, which pose significant obstacles to effective data clustering.

The computational demands of clustering algorithms can be substantial, particularly when working with large or high dimensional datasets. Optimizing algorithms for runtime efficiency is crucial.


DESCRIPTIVE
Clustering is a technique used in machine learning to group together similar data points, which can assist in identifying patterns and forecasting future outcomes. This involves organizing data into clusters where intracluster points are more similar to each other than to those in other clusters. Various algorithms use metrics such as distance or similarity to achieve this. One common method is Kmeans, which partitions data into K clusters through iterative assignment and centroid updates.[6]
Another interesting approach is called Sunflower Optimization (SFO), which mimics the behavior of sunflowers to find clustering solutions. This technique creates a population of sunflowers, with those closer to the sun (representing better solutions) having a higher chance of reproducing.[5]
Hybrid Kmeansbased Sunflower Optimization (HSFO) The HSFO algoithm merges Kmeans and Sunflower Optimization (SFO), offering advantages like robustness to premature convergence, improved solutions for complex datasets, and effective handling of outliers and noise.[1] HSFO algorithm working:

Initialize the SFO algorithm with a population of sunflowers.

Calculate the fitness of each sunflower.

Move each sunflower towards the sunflower with the highest fitness.

Reprise ways 2 and 3 until the algorithm converges.

Use the centroids of the sunflowers as the initial centroids.

Execute the set of rules to assign data points to clusters.


FACTORS FOR COMPARISON

Accuracy
Clustering accuracy is assessed through metrics like chastity (intraclass similarity), absoluteness (correct class assignment), Fmeasure (a harmonious blend of chastity and absoluteness), and figure measure (data pointcluster alignment). A highly accurate clustering algorithm achieves high scores in these metrics.

Clustering Stability
For reliable clustering, stability is key. Metrics like ARI and NMI measure similarity, while DaviesBouldin checks separation. Consistency in results despite data changes shows a robust algorithm.
Fig 3: Different factors used for Comparison

Computational Efficiency
HSFO clustering requires careful consideration of factors such as data points, complexity, and hyperparameters. It's comparable to popular algorithms like Kmeans and GMM in terms of running time, making it a good option for data analysis.

The size and dimensionality of the dataset.

The desired value of K

The hyperparameters of the algorithm, such as the number of iterations and the population size.


Ease of implementation
Ease of implementation is key when choosing a clustering algorithm. Consider documentation, programming language support, and opensource options. HSFO is a promising new option that outperforms traditional methods for medical data analysis.

Flexibility
Customizable clustering algorithms are essential for optimal performance. Consider hyperparameters, distance metrics, and clustering criteria. HSFO is a promising new option for medical data analysis that outperforms traditional methods and offers flexibility.


COMPARATIVE ANALYSIS

Clustering Accuracy
The input used here for the comparison:
This dataset contains an array of 10 data points with two features each.
Data: array[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10],
[11, 12], [13, 14], [15, 16], [17, 18], [19, 20]]Labels are the clusters each data points belong to Labels for the data array=[0, 0, 1, 1, 2, 2, 3, 3, 4, 4]
This is carried out by using calculating the gap among each facts point and the cluster centroids, after which assigning the statistics point to the cluster with the least withincluster variance.
This is done by assessing the percentage of items of data that were correctly assigned to their clusters.
Implementing HSFO and traditional Kmeans with the above data we get HSFO accuracy and Kmeans accuracy.
The output we get for sample input is HSFO accuracy: 1.0
Kmeans accuracy:0.9
The output shows that HSFO clustering gives more accuracy than the traditional algorithm.This is more because the HSFO is a more optimized algorithm to find the best cluster centroids.
However, there are chances that the KMeans algorithm may perform better on a smaller number of features for data points.

Clustering Stability
To calculate clustering stability here the Jaccard index to assess the likeness between pair of data sets.[3]
Jaccard index= A B / A B (1)
where A and B are the two datasets being compared.

Case 1:
In this case, there is no noise i.e. the random data which is irrelevant to the underlying pattern in the data.
The input used here for the comparison:
Dataset1: array=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
Dataset2: array=[[1, 2.1], [3.1, 4.1], [5.1, 6.1], [7.1, 8.1], [9.1,
10.1]]
Steps for calculation of clustering stability for traditional K Means algorithm:

Dataset1 and Dataset2 is clustered using traditional KMeans algorithm

The Jaccard index between the clustered assignments for the Dataset1 and Dataset2 is calculated

The clustering stability is the average Jaccard Index for all cluster assignments.

Stability for traditional KMeans algorithm:
For the given input the stability for the traditional KMeans algorithm is 0.9 which signifies that the traditional Kalgorithm produced similar cluster assignments for Dataset1 and Dataset2, this means the clustering is stable.
Steps for calculation of clustering stability for HSFO algorithm:

Dataset1 and Dataset2 are clustered using HSFO algorithm

The Jaccard index between the clustered assignments for the Dataset1 and Dataset2 is calculated

The clustering stability is the average Jaccard Index for all cluster assignments.
Stability for HSFO algorithm:
For the given input the clustering stability for the HSFO algorithm is 0.9 which signifies that the HSFO algorithm produced similar cluster assignments for Dataset1 and Dataset2, this means the clustering is stable.
2) Case 2:
In this case, there is noise i.e. the random data which is irrelevant to the underlying pattern in the data is added unlike Case 1.
The input used here for the comparison:
Dataset1: array=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10]]
Dataset2:array=Dataset1+np.random.randn(len(dataset1), 2)
* 0.2
For the HSFO algorithm the Jaccard index i.e. stability is 0.8 whereas, for the traditional algorithm, the Jaccard index is 0.7 Justified upon the extracted output it implies that when the noise is added to datasets HSFO is more stabilized than traditional algorithm.
Comparing Case 1 and Case 2Votlh. e12dIisfsfueree1n2c,eDiesceombbtaeirne2d023 between the Jaccard Index of the HSFO algorithm and the traditional KMeans algorithm, it implies that when the noise is added to datasets the stability is decreased.
However, it signifies that the HSFO is more robust than the traditional KMeans.
others


Computational Efficiency
The efficiency of an algorithm is calculated based on two factors:

Runtime Complexity

Space Complexity

Runtime Complexity:

Asymptotic Complexity:
The asymptotic complexity for the HSFO is O(n^2),with n being the number of statistics factors whereas for the traditional KMeans is O(kn) with k being the variety of clusters and n being the range of data factors.

Empirical Measurement
The empirical measurement is done by calculating the start time and end time.[4]
Sample dataset to calculate empirical runtime:
Dataset: array= [[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12],
[13, 14], [15, 16], [17, 18], [19, 20]]Based on start time and end time, the empirical runtime for HSFO is 0.123456 seconds and for KMeans is 0.098765 The runtime efficiency of traditional KMeans is higher than HSFO as KMeans is much less complex than HSFO However, the HSFO algorithm can be more efficient in cases when the data has a high noise level.


Space Complexity:
Here the space complexity is calculated by summing the size of the dataset, the size of cluster centroids, and the data structures used by the algorithm.[3]
Breakdown of calculating space complexity for HSFO: Population size: pop_size * n_clusters (2)
Data set size: data.shape[0] * data.shape[1] (3) Cluster centroids size:
n_clusters * data.shape[1] (4)
Total space complexity: pop_size * n_clsters + data.shape[0]
* data.shape[1] +
n_clusters * data.shape[1] (5)
Breakdown of calculating space complexity for traditional K Means:
Data set size: data.shape[0] * data.shape[1] (6) Cluster centroids size:
n_clusters * data.shape[1] (7)
Total space complexity: data.shape[0] * data.shape[1] + n_clusters * data.shape[1] (8)
Using these calculations for the sample input array=[[1, 2], [3, 4], [5, 6], [7, 8], [9, 10], [11, 12], [13, 14], [15, 16], [17, 18],
[19, 20]] the space complexity for HSFO algorithm is 400(0.4)and traditional KMeans is 300(0.3).
Justified on the calculation it signifies that KMeans require less memory than the HSFO.This is because the HSFO algorithm stores a population of sunflower solutions in memory whereas the KMeans only needs to store cluster centroids.
The KMeans surpasses the HSFO in terms of Space complexity and Runtime complexity.


Ease of implementation
When it comes to choosing between the traditional KMeans algorithm and Hybrid Sunflower Optimization (HSFO), the former is often preferred due to its ease of implementation. KMeans is a relatively simple algorithm, and there are many opensource implementations available, making it the goto choice for many users. In contrast, HSFO is a more complex algorithm that requires specialized expertise to implement properly. Additionally, there are fewer opensource implementations available for HSFO, further contributing to its reputation as a more challenging option.
Table 1: Tabular Representation of Implementation Factors
Table 2: Tabular Representation of Flexibility Factors
Factors
Traditional KMeans
HSFO
Number of collections of data points imilar to each other
Traditional KMeans requires the variety of clusters to be acknowledged in advance
HSFO does not require the variety predetermined number of clusters.
Cluster Shape
Traditional KMeans assumes that the clusters are spherical
HSFO can handle clusters of any shape
Noise and Outliers
Traditional KMeans is sensitive to noise and outliers
HSFO is extra sturdy to noise and outliers
Overall, Hybrid SFO's flexibility makes it a versatile choice for various data analysis tasks, but its complexity demands expertise. In contrast, Hybrid KMeans, while less flexible, offer a straightforward implementation, making it accessible for those prioritizing simplicity in their clustering approach.
F. Representation of Comparison
Factors
Traditional KMeans
HSFO
Number of steps
Traditional KMeans is a relatively simple algorithm, with only a few steps
HSFO is a more complex algorithm, with many steps.
Prior knowledge
Traditional KMeans do not require any prior knowledge of optimization algorithms.
HSFO requires a basic understanding of optimization algorithms, such as Sunflower Optimization.
Availability of open source implementations
There are many open source implementations of traditional KMeans available
There are fewer opensource implementations of HSFO available.
ACCURACY STABILITY CASE
1
Hybrid KMeans offers a more accessible implementation process, thanks to its familiarity and available resources. However, Hybrid SFO becomes a compelling option when the top priority is precise clustering results, and computational resources are not constrained, given its potential for higher accuracy but a steeper implementation curve.
E. Flexibility
HSFO is more flexible than Hybrid KMeans. This is because HSFO is suitable for solving a wider range of clustering problems.
Hybrid KMeans is a relatively easy to understand algorithm that is wellsuited for clustering problems where the number of partitions is known and the clusters are welldefined.[6] HSFO, on the other hand, suitable for solving more complex clustering problems, such as clustering problems with unknown numbers of clusters or clusters that are not well defined.
Fig 4: Result Bar Graph
Factors For Comparison
Traditional K Means
HSFO
Accuracy
0.9
1.0
Stability Case 1
0.9
0.9
Stability Case 2
0.7
0.8
Computational Efficiency Runtime
0.098765
0.123456
Computational Efficiencspacetime
0.3
0.4
Table 3: Tabular Representation of Performance Comparison of Traditional KMeans Clustering and HSFO Algorithms
Table 4: Tabular Representation of Features of Traditional KMeans Clustering HSFO Algorithms
Feature
Traditional K Means
HSFO
Accuracy
Less accurate
More accurate
Robustness to noise and outliers
Less robust
More robust
Flexibility
Less flexible
More flexible
Computational efficiency
More efficient
Less efficient
Feature
Traditional KMeans
HSFO


CONCLUSION
When considering the appropriate approach for clustering tasks, it is important to take into account the specific requirements and factors involved. The choice between Hybrid Sunflower Optimization (HSFO) and traditional K Means can vary based on these considerations. While HSFO is known for its accuracy in clustering, particularly in dealing with noisy data, it does require higher computational complexity in terms of both runtime and space. This may make it less efficient in situations where resources are limited. However, the traditional KMeans is easier to implement and better utilizng in terms of runtime and space complexity. It is a befitting option for situations where computational resources are restricted and the number of clusters is known in advance. However, it may be less robust when it comes to noisy data and less flexible in dealing with complex clustering problems. Ultimately, it is important to select an approach that aligns with the specific goals and constraints of the clustering task at hand.

REFERENCES
Singapore. https://doi.org/10.1007/9789811920691_34
[14] Vitthal S. Gutte , maharudra Gitte A survey on recognition of plant disease with a help of algorithm International Journal of Engineering Science and Computing, Vol.6 Issue no 6; June 2016; pp.131139;ISSN 2321 3361 DOI 10.4010/2016.1691
[15] O Kadu, S Sihasane, S Naik, V Katariya, VS Gutte Intelligent Healthbot for Transforming Healthcare International Journal of Trend in Scientific Research and Development (IJTSRD) Volume: 3  Issue: 3  MarApr 2019. [16] Vitthal Gutte, Dr. Kamatchi Iyer Cost and Communication Efficient Framework for Privacy Assured Data Management Cloud International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 8958, Volume8 Issue4, April 2019 [17] Vitthal Gutte,Dr. Devulapalli Sita Achieving Cloud Security Using a Third Party Auditor and Preserving Privacy for Shared Data Over a Public Cloud International Journal of Knowledge and Systems Science(IJKSS), Volume 11 , Issue 1 , JanuaryMarch 2020 , DOI: 10.4018/IJKSS.2020010104