DOI : 10.5281/zenodo.20282761
- Open Access

- Authors : Dr. Pradeep Kumar Atulker, Dr Rajendra Gupta, Dr. Ankur Khare
- Paper ID : IJERTV15IS051611
- Volume & Issue : Volume 15, Issue 05 , May – 2026
- Published (First Online): 19-05-2026
- ISSN (Online) : 2278-0181
- Publisher Name : IJERT
- License:
This work is licensed under a Creative Commons Attribution 4.0 International License
SCIE_EOC – A Sinusoidal Chaotic and Information Entropy based Elephant-Herding Optimization for Clustering the Large Datasets to Improve the Data Search Capability
Dr Pradeep Kumar Atulker
Assistant Professor, Department of Computer Science and Application, Govt. College Khimlasa, Sagar, (M.P.), India
Dr Rajendra Gupta
Associate Professor, Department of CS & IT, Rabindranath Tagore University, Raisen (M.P.), India, Dr Ankur Khare
Assistant Professor, Department of CS & IT, Rabindranath Tagore University, Raisen (M.P.), India,
Abstract – The data clustering is extensively employed for a vast diversity of applications in many sectors, like learning, curative practices, innovation, and enetrprises in diverse decentralized scenarios. The substantial diverse data are managed and scrutinized using multiple categorization techniques to optimize the excellence of crucial information exchange across decentralized scenarios. In this paper, a Sinusoidal Chaotic and Information Entropy based Elephant-Herding Optimization (SCIE_EOC) scheme is developed and applied to classify the significant volume of diverse datasets, aiming to optimize the data search capability. The primary focus of SCIE_EOC is to attain proficient and precise dispersal of data values in dataset by incorporating the principles of information entropy and employing sinusoidal chaotic for population formation. The incorporation of sinusoidal chaotic and information entropy are intended to boost the assessment and utilization abilities of search agents in optimization scheme and to achieve optimal and exact assortment of cluster center and members. The execution of SCIE_EOC scheme utilizing MATLAB 2022a software on six extensive datasets reveals the remarkable efficacy of SCIE_EOC scheme in comparison to earlier schemes like K-Means, PSO, GWO, and EO as evidenced by factors including Root Mean Square Error, F-Measure, Standard Deviation, Purity Index, Accuracy, Intra-Cluster Distance, Time Complexity and Statistical Exploration.
Keywords: Clustering, Elephant-Herding Optimization, F-Measure, Information Entropy, Purity Index, Sinusoidal Chaotic, Statistical Exploration.
-
INTRODUCTION
The researchers discuss the handling and exploration of big data [1, 2] in many applications, leveraging aspects like storage, performance, and database administration. Pre-processing is utilized to reduce data degradation and boost data accuracy and productivity. A dominant dispersed architecture is employed, enabling fast and economical data examination for both serial and parallel communication [3, 4]. A number of population based optimization scehems [22], like Grey Wolf Optimization (GWO) [5], Genetic Algorithm (GA) and Particle Swarm Optimization (PSO), are used to optimally and accurately examine the vast volume of data in several scenarios [6, 7, 8]. The ornamental cognitive and deep learning using neural networks are applied in big data exploration; in which, case related perceptive is employed for query handling, aiming to moderate query retrieving rate and boost retrieving time. Health information is examined to depict illness with qualitative and quanitative justifications, guranteeing most appropriate medical care [10, 11].
Convex optimization is emphasized as one of the top techniques for large-scale data examination in dissiminated environment, particularly for diverse sources. The convex signal handling techniques is leveraged with the tiniest square scheme for sampling and estimation of millions of data. The proximal operator, unpredictability and disseminated estimation are essential aspects in the optimization of data examination [13, 26]. The deep learning is helped in the execution of dispersed clustering on diverse data utilizing an auto-encoder. The auto-encoder autonomously determines the count of centers and assigns members of clusters with the help of deep learning scheme. The outcomes are examined with the renowned clustering scheme, K-Means [15, 27].
Additionally, a amalgamation of Convolutional Neural Network (CNN) and machine learning schemes is employed to enhance transparency and strictness in disease category picking and handling. The parallel computation, including the auto-encoder, is employed to choose the best grouping strategies for medical health administration and perception established learning around sicknesses [16]. The decision making regarding sickness is executed with the help of artificial intelligence, verified and validated on medical database. Various forms of cancers are examined utilizing support vector machine to ensure appropriate and exact disease handling [17]. The machine learning is additionally utilized with the help of tensor and central operators to efficiently handle and validate the datasets; where, the data validation is carried out through deep learning, applied in diverse platforms like Yahoo and dynamic applications [19, 20].
Grouping, akin to clustering, involves classify numerous datsets into categories for diverse aims. In the medical context, a decision support scheme is employed to group clinical data and forecast diseases [21, 29]. For categorization of medical information, a CNN is utilized with an innovative loss function. The multi-pool system is utilized to confinement the spatial information of primary features in healthcare information, assisting in assessing the severity of disease and deciding suitable therapies [23].
The above literature shows the use of various clustering schemes and optimization schemes like GWO [5], GA and PSO, and K-Means [15, 27] on different datasets to improve data extraction and data searching capability. While some clustering techniques show promising results on particular datasets, their efficiency may vary on other datasets due to the no-free-lunch theorem, indicating that no single clustering scheme is universally optimal for all problems. The rate of convergence is also a primary consideration of optimization methods.
To address these challenges, a new approach called a Sinusoidal Chaotic and Information Entropy based Elephant-Herding Optimization for Clustering (SCIE_EOC) scheme is introduced. This scheme is applied to cluster extensive datasets and is an upgraded version of the Elephant-Herding Optimization (EO) scheme. Unlike traditional qualitative analyses of prior works, the proposed SCIE_EOC scheme focuses on a quantitative consideration of clustering schemes for large datasets.
The main contribution of SCIE_EOC implementation is as follows:
-
The significant objective of SCIE_EOC is to attain an effective dissemination of data component values across diverse resources in a distributed scenario.
-
The data dissemination is accomplished by leveraging sinusoidal chaotic [24, 25] for population generation and information entropy for data values distribution.
-
The ultimate goal is to optimize the distribution of data elements, ensuring efficient utilization of heterogeneous resources within the distributed system.
-
The implementation of SCIE_EOC scheme utilizing MATLAB 2022a software on six extensive datasets as evidenced by factors including Root Mean Square Error, F-Measure, Standard Deviation, Purity Index, Accuracy, Intra-Cluster Distance, Time Complexity and Statistical Exploration in comparison to earlier schemes like K-Means, PSO, GWO, and EO.
Part 1 provides an introduction, while Part 2 presents a literature survey on data clustering and classification schemes, and big data examination. This section explores various parameters related to these fields. In Part 3, the Elephant-Herding Optimization (EO) scheme is discussed, laying the foundation for the proposed SCIE_EOC
approach, which is explained in Part 4. Part 4 includes complete details such as flowcharts, and algorithms related to the SCIE_EOC approach. Part 5 focuses on datasets used in the experiments, performance metrics, and the results obtained from applying the SCIE_EOC approach. It sheds light on how well the proposed method performs in practice. Lastly, Part 6 comprises the conclusions drawn from the study, summarizing the key findings and implications of the research.
-
-
LITERATURE REVIEW
The Decisiion support systems (DSS) [28] discusses the expansion of machinery to manage numerous complex data retrieving from distributed sources. It introduces the concepts of Big Data [32, 34], which are high-tech multidimensional information controlling structures that aid investors in using up-to-date data-driven schemes for problem resolving. The ubiquity of Big Data in nutrition protection is emphasized, as data in the nutrition supply chain is dispersed and diverse in setup, measure, and terrestrial source. Overall, the examination and conversation delivered purpose to encourage the use of Big Data and DSS in nutrition protection, ultimately improving risk impost and ensuring nutrition protection in the context of climate variation.
The clustering [12] is as a widely utilized strategy for grouping data into comparable clusters. However, most existing systems for clustering steadly describe clusters with instances values, and concentration, lacking perfect semantic sense. To address this issue, a new Evolutionary Clustering Algorithm (ECA) is developed that integrates communal class position, population based schemes, Levy flight optimization, and K-Means scheme. The evaluation involves using 32 diverse datasets, assessing their efficiencies using centre and outside clustering actions. In conclusion, the proposed ECA algorithm presents a significant improvement in clustering heterogeneous and multiple-featured datasets with better cluster identification and increased robustness across various dataset characteristics.
The study [30] focuses on association rule mining strategies and their application in college student quality valuation schemes. These strategies are used to discover probable relations between diverse abilities of college students based on their existence and knowledge data. This helps teachers identify individual students’ powers and weaknesses and enables personalized teaching. The objective is to address issues related to associative rule removal schemes and explore the influence of implementing such schemes in college student excellence valuation. The experiment confirms the strength and achievability of the scheme for college student excellence valuation, supporting its practical application in real-world scenarios. Overall, the paper highlights the benefits of using association rule mining strategies in college student excellence valuation systems, paving the way for more effective and personalized teaching strategies.
The paper introduces a novel heuristic approach for data clustering based on the Moth Flame Optimizer (MFO) [9], a new metaheuristic inspired by the intelligence of moths in nature. While the k-means scheme is widely recognized for clustering, its procedure is highly reliant on primary cluster centers and may get trapped in local optima because of weak search proficiency. The MFO is proposed as a potential solution to manage complex problems, and its act has been establish acceptable in numerous researches. To evaluate the MFO effectiveness, experiments are accompanied employing UCI datasets.
The paper [14] discusses the cumulative proficiency to gather a enormous volume of diverse data, with a particular focus on time-series data, which holds a significant amount of untapped information. Previous data mining schemes often have limitations when examining time-series, particularly when dealing with multi-dimensional time-series that need to be examined together to remove meaningful information. In response, the paper implements a different clustering scheme called K-MDTSC (K-Multi-Dimensional Time-Series Clustering), exactly intended to address multi-dimensional time-series clustering. The outcomes illustrate that K-MDTSC produce decent clustering outputs, particularly when dealing with more complicated synthetic datasets. Overall, the paper highlights the significance of multi-dimensional time-series clustering and demonstrates the effectiveness of the proposed K-MDTSC algorithm in both synthetic and real-world scenarios. The findings suggest that K-MDTSC can be a valuable tool in extracting knowledge from complex multi-dimensional time-series data, particularly in industrial applications such as predictive maintenance.
The PSO [31] is a commonly used method to calculate the universal finest of multivariable objevtive functions. In this study, a new PSO scheme is established to exactly calculate the finest value of multidimensional objective functions. The PSO repetition procedure contains of numerous huge repetition steps, every including two phases. In 1st phase, a development practice is employed to successfully discover the multidimensional adjustable area. The 2nd phase employs the traditional PSO scheme to calculate the universal finest value of the function. Overall, the paper presents a new PSO algorithm designed for multidimensional function optimization, demonstrating its effectiveness and superior accuracy through rigorous testing and analysis. This algorithm has the potential to be valuable in various scientific and engineering applications.
The paper [33] focuses on the operational description of clusters to understand their magnitude and composition related features. As tentative schemes only may not offer a whole image of cluster organizations, autonomous theoretical examinations are desired to attain a comprehensive explanation of the symmetrical procedure and features of the clusters. To determine numerous minima and eventually realize the universal minima for the clusters, potential energy surfaces (PES) are discovered. Various optimization schemes are developed and utilized for resolving geometrical isomers in clean essential groups, like Genetic Algorithm (GA). In this article, the optimization schemes, evaluation methods and precision of outcomes achieved by employing the strategies for various cluster type is to be examined.
The paper [18] addresses the serious concern of the Dimensionality Curse, which delays progress in various fields, including bioinformatics. To overcome this challenge, the authors propose a corporation solution that combines scrambling based dimensionality decline scheme with PSO based evaluation methods. The paper focuses on feature assortment for cancer forecast using a complex Spark Distributed scheme called Spark Distributed PSO (SDPSO). The proposed approach integrates Binary PSO (BPSO) with PSO scheme to achieve efficient and effective feature selection. Overall, the proposed SDPSO approach proves to be highly efficient and accurate in feature selection for cancer forecast, showcasing its potential to address the Dimensionality Curse and improve performance in bioinformatics applications.
-
THE ELEPHANT-HERDING OPTIMIZATION (EO) SCHEME
The widespread optimization problems are deteched by the EO, which is a bio-inspired method copied of the nature of elephant herding. There are three general conceptions in EO; (a) Clans are integrated by uniting numerous predeterminated elephants, and the elephant`s population is attained by uniting multiple clans; (b) the movement of different predeterminated male elephants from their clans and their strength of movement from the optimal elephant swarm at every establishment are required; (c) in the clan, the elephants are similarly stayed with the support of a matriarch.
-
Clan Updation Scheme
The location of succeeding elephants in clan Ln is obstructed through matriarch. The location of th elephant is reformed in clan Ln using eq (1).
,, = , + × (, ,) × (1)
Where,
,, & , = The new and current locations of th elephant in clan Ln respectively. = A magnitude term attaning the influence of matriarch Ln on XLn,. { [0,1]}
, = The matriarch (an elephant having highest fitness value in clan Ln).
= Random number. { [0,1]}
The location of th elephant having highest fitness value is reformed in every clan using eq.
(2). (Here, reformation is not performed using eq. (1), i.e., , = , )
,, = × , (2)
Where,
= A term attaining the rein of , on ,,. { [0,1]}
,= The center value of Clan Ln.
The location of th elephant (eq. (3)) is reformed by adding th dimension in eq. (2).
{1 , = }
Where,
1
,, =
× ,,
=1
(3)
= Overall elephants in clan Ln. (Tcn = Overall clans in elephant`s population)
, , = The center value of elephant location in clan Ln with th dimension.
,,= The location of th elephant in clan Ln (th dimension).
-
Separating Scheme
Onward, enhancing the investigation strength of EO scheme, comply that the separating scheme will identify the elephants devising worst fitness value at each iteration as expressed in eq. (4).
, = + ( + 1) × (4)
Where,
& = Extreme and Lowest limit of elephant location.
= Notional and reliable distribution. { [0,1]}
,= Elephant having worst fitness value in the clan Ln.
-
-
SCIE_EOC – A SINUSOIDAL CHAOTIC AND INFORMATION ENTROPY BASED ELEPHANT-HERDING OPTIMIZATION FOR CLUSTERING
The EO scheme is modified with the help of sinusoidal chaotic and information entropy. The effectual consequences for numeric optimization are attained by the EO scheme, which is right fit for disquisitioning and deployment in a massive searching region. However, the data clustering scheme is genuinely incongruent from numeric optimization. At present, the adjustment of numerous essential features is performed, and the application of a sinusoidal chaotic and information entropy for data clustering is carried out using a SCIE_EOC.
-
Component Information Entropy
Within the data clustering tactic, the dispersion of data points occurs in a multi-dimensional region. The level of ambiguity of a probabilistic variable is computed using information entropy; subsequently, this is employed for measurements of components to determine the distribution. The entropy (P) is assessed to separate the values of components by approximating each value to its nearest whole number using eq. (5).
Where,
= log() (5)
=
Pq = qth component entropy (q = 1,2, ,T)
T = Total components in dataset
Mq & mq = qth component maximum and minimum value when categorized. Gu = uth component numerical percentage value.
The maximum entropy is computed for each component in dataset according to eq. (6) during the clan updation scheme of EO population.
Where,
() = log (
1
) (6)
+ 1
MAX(Pq) = The qth component maximum entropy in dataset.
Finally, eq. (7) is used to compute the normalized entropy for each component.
() = ( ) (7)
Where,
NORM(Pq) = The qth component normalized entropy in dataset.
Eq. (7) is iteratively applied to complete data components in dataset, resulting in the generation of a set of normalized entropy (NORM(P)).
NORM (P) = (NORM(P1), NORM(P2), ,NORM(PT))
-
Information Entropy based Clan Updation in EO Scheme
In this context, two mechanisms are incorporated in clan updation to enhance the efficiency of EO scheme. The initial one is original clan updation scheme of EO elucidated in the preceding section. The second mechanism (Algorithm 1) is the revised version of the original clan updation, where O individuals are randomly selected from the current population (1 O OP), and the best individual is designated as the reference individual (Xref,Ln,). The clan direction is steered by ,, , while information entropy is employed to establish an equivalence between population diversity and convergence rate. The increased information entropy of a component denotes the heightened indeterminacy, which adversely affects the rate of convergence. Therefore, the velocity is
improved for a component by relocating it to the position on the basis of the reference individual, universally with the highest probability in contrast to the component with the least information entropy.
Algorithm 1: Information Entropy based Clan Updation in EO Scheme
Algorithm Number of Operations
START
FOR y = 1 to OP DO (OP+1)
P
Pick the optimal choice from O random selections as Xref,Ln, O
FOR z = 1 to Ti DO OP*(Ti+1)
IF arbitrary < NORM (Pq) then OP*Ti
Xnext,Ln, = Xref,Ln,
ELSE
Revise , and generate ,, (eq. (1))
END IF END FOR
END FOR
STOP
-
Sinusoidal Chaotic based Selection of Population of EO Scheme
A upgraded EO is devised using a sinusoidal chaotic to counter imprudent convergence calamity. The inclusion of a sinusoidal chaotic in EO scheme is intended to counter the issue of local optima through the utilization of pseudorandom numbers. The sinusoidal chaotic is assessed to acquire the three parameters (, 1, 2) utilizing eq.
(8) to eq. (11).
= ()2 sin() (8)
+1
= ( 2
) (9)
1,+1
1, )
sin(1,
= ( 2
) (10)
2,+1
2, )
sin(2,
(, , ) [0,1], (0,4] (11)
Where,
0 1,0
2,0
m = Overall iterations
The primary values of EO population (OP) are determined by assessing the sinusoidal chaotic (eq. (8) to eq. (11)), aiming to enhance the effectiveness of EO scheme through suitable utilization of a vast solution space.
-
The Entire SCIE_EOC Scheme for Data Clustering
The amalgamation of information entropy-driven clan updation and sinusoidal chaotic-driven selection of population is employed to achieve an optimal solution for data clustering in SCIE_EOC (Fig. 1) (Algorithm 2). To begin with, eq. (7) is iteratively applied to complete data components in dataset, ensuing in the production of a set of normalized entropy (NORM(P)). The sinusoidal chaotic (eq. (8) to eq. (11)) is employed to prepare the population (OP) of the EO scheme, where each individual is represented as a vector comprising = × , (Ti
= Individual dimension, T = Total components in dataset, R = Total cluster centers). The R cluster centers
positions are set in a vector, with each cluster center being linked to its respective set of T attributes. The initial specific vector values are randomly and uniformly created within the range of lower and upper component values present in dataset for a iteration threshold (ITR). Next, SCIE_EOC is employed to compute the fitness values of all individuals using eq. (1) to eq. (4).
Algorithm 2: The Entire SCIE_EOC Scheme
Algorithm Number of Operations
START
Allocate iteration counter w =1 and iteration threshold (ITR) 1
Compute the entropy for each component to generate normalized entropy (NORM(P)) 1
<>Prepare the population OP of EO by employing sinusoidal chaotic (eq. (8) to eq. (11)) 1
Determine fitness for every elephant OP
WHILE (w < ITR) DO ITR+1
Do Sorting to overall elephants through their fitness ITR
FOR Ln = 1 to Tcn DO ITR*(Tcn+1)
FOR = 1 to DO ITR*Tcn*(TLn+1) Do information entropy based clan updation (Algorithm 1) 2*OP*Ti+3*OP+1
Revise , and generate ,, (eq. (1)) ITR*Tcn*TLn IF , = , THEN ITR*Tcn*TLn Revise , and generate ,, (eq. (2)) ITR*Tcn*TLn
END IF END FOR
END FOR
FOR Ln = 1 to Tcn DO ITR*(Tcn+1)
Adjust the Ln clan elephant with worst fitness (eq. (4)) ITR*Tcn
END FOR
Compute population by freshly revised positions
ITR
w = w+1;
ITR
END WHILE
Return optimal elephant with maximum fitness
1
STOP
Start
Prepare the population by employing sinusoidal chaotic
-
Allocate iteration counter (w) and compute entropy of each component
Determine fitness of every elephant
w = 0
Do information entropy based clan updation (Algorithm 1)
Revise the fitness values of next individuals
w = w +1
Is end condition reached?
No
Yes
End
Return the optimal solution
Fig. 1: The flowchart of SCIE_EOC Scheme
-
RESULTS AND ANALYSIS
This segment presents a concize overview of the six extensive datasets (Table 1) and performance metrics for the SCIE_EOC scheme. All the schemes are implemented within the MATLAB 2022a tool running on a windows 10 operating system and validated on six comprehensive datasets. The results for the SCIE_EOC scheme are acquired through 500 iterations, considering metrics like Root Mean Square Error, F-Measure, Standard Deviation, Purity Index, Accuracy, Intra-Cluster Distance, Time Complexity and Statistical Exploration. Comparative analysis is conducted against earlier scheme like K-Means, PSO, GWO, and EO over 25 distinct runs.
-
Datasets
The SCIE_EOC scheme is executed on six substantial, separate datasets obtained from UCI repository. The datsets are HCV, Blood Transfusion, Thoraic Surgery, HTRU, Diabetic Retinopathy and Cancer (Table 1).
Table 1: Datasets
Sr.
No.
Datasets
Total
Components/ Instances
Total
Features/Attribu tes
Total
Classes/Cluster s
1
HCV
1385
29
4
2
Blood Transfusion
748
4
2
3
Thoraic Surgery
470
17
2
4
HTRU
12330
9
2
5
Diabetic Retinopathy
1151
18
2
6
Cancer
683
9
2
-
Performance Metrics
The SCIE_EOC scheme is assessed considering metrics like Root Mean Square Error, F-Measure, Standard Deviation, Purity Index, Accuracy, Intra-Cluster Distance, Time Complexity and Statistical Exploration.
-
Standard Deviation
The fixed data grouping within the mean value domain is attained using a geometric characteristic reffered to as Standard Deviation (V). The finest grouping is achieved by employing the minimum value of V (eq. (12)).
= ( )
||
(12)
Here,
||= Dataset Length
A = Dataset Components Values
= Mean Value of Components in Dataset
Table 2: Standard Deviation
Datasets
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
0.13534
0.05263
0.03253
0.03036
0.00892
Blood Transfusion
0.00463
0.18393
0.14342
0.17386
0.00543
Thoraic Surgery
0.00463
0.18364
0.00327
0.00183
0.00352
HTRU
0.27831
0.19373
0.16384
0.14253
0.11854
Diabetic Retinopathy
0.23748
0.11623
0.06274
0.04854
0.00546
Cancer
0.04253
0.12643
0.11837
0.08743
0.00846
Standard Deviation
0.3
0.25
0.2
0.15
0.1
0.05
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
Standard Deviation
According to Table 2 and Fig. 2, the SCIE_EOC determined the least value of V for the six comprehensive dataset. In Standard Deviation (V), the SCIE_EOC achieves 13%, 24%, 38% and 54% higher performance than EO, GWO, PSO and K-Means respectively, across all six datasets.
Fig. 2: Standard Deviation
-
Root Mean Square Error
The value of Root Mean Square Error (E) quantifies the discrepancy between predicted and computed values. The finest grouping is achieved by employing the minimum value of E (eq. (13)).
= 1
||
||
(
=1
2
)
(13)
Here,
& = bth component computed nad predicted value in dataset.
Table 3: Root Mean Square Error
Datasets
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
0.473
0.273
0.098
0.036
0.016
Blood Transfusion
0.427
0.275
0.294
0.193
0.074
Thoraic Surgery
0.538
0.285
0.189
0.154
0.034
HTRU
0.395
0.296
0.353
0.173
0.045
Diabetic Retinopathy
0.486
0.376
0.186
0.039
0.032
Cancer
0.546
0.365
0.276
0.143
0.017
Root Mean Square Error
0.6
0.5
0.4
0.3
0.2
0.1
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
Root Mean Square Error
The data in Table 3 and Fig. 3 reveal that the SCIE_EOC yielded the least value of E for the six comprehensive dataset. According to Root Mean Square Error (E), the SCIE_EOC surpasses EO, GWO, PSO and K-Means by 17%, 59%, 63% and 84% respectively, across all six datasets.
Fig. 3: Root Mean Square Error
-
Intra-cluster Distance
To begin with, the distances between data components within a cluster are computed. Afterward, the intra-cluster distance is determined as the mean of these distances. The finest grouping is achieved by retaining the minimum value of intra-cluster distance. By computing the mean of distances from the group center to entire data componentes within a group, he average distance is obtained, and this process is repeated for each group. Ultimately, the mean intra-cluster distance is computed by aggregating the mean distances from all groups.
Table 4: Average Rank (Intra-Cluster Distance)
Dataset
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
362.46 (5)
283.24
(4)
212.74 (3)
184.34 (2)
142.37 (1)
Blood Transfusion
4.7352 (4)
5.6345
(5)
4.3654 (3)
2.8643(2)
2.6453 (1)
Thoraic Surgery
2152.8 (5)
1432.2
(3)
1927.8 (4)
1214.2 (2)
1164.3 (1)
HTRU
74.372 (3)
82.383
(4)
94.287 (5)
64.382 (2)
53.251 (1)
Diabetic Retinopathy
8.2872 (5)
6.3423
(3)
7.2537 (4)
5.3627 (2)
3.2836 (1)
Cancer
36.382 (5)
27.386
(4)
19.283 (3)
15.374 (2)
9.8353 (1)
Average Rank ()
4.5
3.83
3.67
2
1
Average Rank
5
4
3
2
1
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
Average Rank
According to Table 4 and Fig. 4, the SCIE_EOC determined the least value of intra-cluster distance for the six comprehensive dataset. In intra-cluster distance, the SCIE_EOC achieves 18%, 32%, 43% and 51% higher performance than EO, GWO, PSO and K-Means respectively, across all six datasets. The schemes are ranked based on the average of their intra-cluster distances, ranging from the minimum to maximum (from 1 to 4.5).
Fig. 4: Average Rank (Intra-Cluster Distance)
-
Purity Index
Purity (Pr) serves as a measure of the effectiveness of grouping (clustering) schemes, reflecting the precision of generated clustering for data components. Consequently, all components belonging to a specific group can be given to a specific group. The Purity Index (Pi) is computed using eq. (14) and eq. (15) based on purity (Pr). Maximum purity is attained with the Pi value being close to 1.
() =
(||)
||
(14)
=
=1
(||())
||
(15)
Here,
() = th group/cluster purity.
||= th group/cluster Length.
||= Overall data components of µth class consigned to th group/cluster.
Table 5: Purity Index
Dataset
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
0.76
0.83
0.82
0.87
0.91
Blood Transfusion
0.73
0.83
0.84
0.89
0.93
Thoraic Surgery
0.78
0.82
0.84
0.88
0.91
HTRU
0.76
0.85
0.85
0.89
0.92
Diabetic Retinopathy
0.75
0.83
0.85
0.86
0.9
Cancer
0.77
0.81
0.83
0.87
0.9
Purity Index
1
0.8
0.6
0.4
0.2
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
Purity Index
The data in Table 5 and Fig. 5 reveal that the SCIE_EOC yielded the extreme value of Pi for the six comprehensive dataset. According to Purity Index (Pi), the SCIE_EOC exceeds better than EO, GWO, PSO and K-Means by 5%, 12%, 14% and 21% respectively, across all six datasets.
Fig. 5: Purity Index
-
F-Measure
To begin with, Precision (N) and Recall (RC) are determined to retrieve the data (eq. (16) an eq. (17)). Afterward, N and RC are integrated to assess F-Measure (F) (eq. (18) and eq. (19)).
(, ) = ||
||
(, ) = ||
||
(16)
(17)
(, ) =
2 × (, ) × (, )
(, ) + (, ) (18)
= || {(, )}
||
(19)
Where,
=1
||= µth class length.
Table 6: F-Measure
Dataset
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
0.74
0.81
0.8
0.85
0.9
Blood Transfusion
0.7
0.8
0.81
0.86
0.91
Thoraic Surgery
0.75
0.79
0.82
0.85
0.89
HTRU
0.73
0.84
0.83
0.85
0.9
Diabetic Retinopathy
0.74
0.82
0.83
0.83
0.89
Cancer
0.75
0.78
0.8
0.84
0.88
F-Measure
1
0.8
0.6
0.4
0.2
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
F-Measure
According to Table 6 and Fig. 6, the SCIE_EOC determined the extreme value of F for the six comprehensive dataset. In F-Measure (F), the SCIE_EOC attains 6%, 13%, 16% and 20% greater performance than EO, GWO, PSO and K-Means respectively, across all six datasets.
Fig. 6: F-Measure
-
Accuracy
The Accuracy (Acy) is determined based on the distribution of precise clusters (i.e., the assessment of percentage of correct decisions) and detailing the percentage of clusters in the dominant group (eq. (20)).
=
=1
||
100% (20)
Table 7: Accuracy
Dataset
K-Means
PSO
GWO
EO
SCIE_EOC
HCV
78
85
86
90
93
Blood
Transfusion
79
83
84
87
91
Thoraic
Surgery
77
81
80
86
90
HTRU
76
84
83
87
92
Diabetic
Retinopathy
78
82
84
87
91
Cancer
79
85
86
90
93
Accuracy
100
80
60
40
20
0
K-Means
PSO GWO EO
SCIE_EOC
Datasets
Accuracy (%)
The data in Table 7 and Fig. 7 reveal that the SCIE_EOC yielded the extreme value of Acy for the six comprehensive dataset. According to Accurcy (Acy), the SCIE_EOC exceeds better than EO, GWO, PSO and K-Means by 6%, 11%, 15% and 22% respectively, across all six datasets.
Fig. 7: Accuracy
The utilization of information entropy in conjunction with the EO scheme for clustering ensures precise and proficient dispersal of data values within the dataset. The sinusoidal chaotic is incorporated into preliminary population production of EO scheme to augment the search and utilization capabilities of elephants within clan, considerably employed for the cluster member distribution scheme. Consequently, the SCIE_EOC scheme yields optimally produced cluster centers and members of excellent quality. As a result, SCIE_EOC delivers better outcomes than EO, GWO, PSO and K-Means schemes.
-
-
SCIE_EOC Scheme Time Complexity
The time complexity is assessed by performing the clustering and enumerating the total operations. The step cost is established as 1 unit for each operation. The Full Operation Cost (FOC) is derived by applying algorithm 2 along with eq. (21) and eq. (22).
= 1 + 1 + 1 + + + 1 + + ( + 1) + 2 + 3 + 1 +
( + 1) + + + +
( + 1) + + + + 1 (21)
= 4 + 4 + 2 + 6 + 4 + 6 (22)
Gather that all parameters are alike in eq. (22) in worst case; accordingly, eq. (23) is derived as follows:
= 43 + 62 + 10 + 6 (23)
In the worst case scenarios, for SCIE_EOC, EO, GWO and PSO schemes, the time complexities are O(N3) and for K-Means, it is O(N2). As a result, all schemes exhibit polynomial time complexity.
-
Statistical Exploration
A statistical assessment is carried out to ascertain the magnitude of noteworthy distinctions among the impacts of clustering schemes. In the context, a distribution free Friedman Assessment (FA) is conducted to explore distinctions among the set of sequential relevant variables. Every clustering scheme is equivalently productive, supporting the null hypothesis (0).
The Friedman Assessment (FA) is expressed through eq. (24).
=
12
( + 1)
5
[()2=1
( + 1)2 4
] (24)
Where,
K = Total Datasets
C = Total Clustering Schemes
= th Scheme Average Rank
The FA critical value of 2.24893 is determined from F-distribution table [35] by means of (C – 1) and (C 1) * (K 1) degrees of freedom lying within in the range (5 – 1) = 4 and (5 1) * (6 1) = 20 for applying 5 clustering schemes (C = 5) to 6 datasets (K = 6) having = 0.10 (confidence point). The estimated FA value exceeds the critical value for null hypothesis (0) rejection; otherwise 0 is accepted. The FA estimated value (eq. (24)) is 20.13072 for applying 5 clustering schemes (C = 5) to 6 datasets (K = 6) having = 0.10. Consequently, the estimated FA value is above the crtical FA value, resulting in the rejection of 0.
Thus, a post hoc evaluation is carried out using Holm scheme; in which proposed SCIE_EOC is statistically compared alongside other clustering schemes in this assessment. At the beginning, the Z-score is derived through eq. (25) and eq. (26) and subsequently, Probability (H) is determined utilizing Z-score and normal distribution
table [36]. At the end, value (Table 8) is derived through .
Where, 0.91
=
= ( + 1)
6 ×
( )
(26)
(25)
Table 8: Holm Strategy Results
Approaches
Z-value
H-value
( )
Hypothesis
1
K-Means
-3.8461
0.00006
0.025
Rejected
2
PSO
-3.1099
0.00071
0.033
Rejected
3
GWO
-2.9341
0.00169
0.05
Rejected
4
EO
-1.0989
0.13786
0.10
Accepted
According to Table 8, the ( )
value exceeds the
value, suggesting the rejection of the hypothsis for
overall cases. Consequently, as per the above examination, the proposed SCIE_EOC and EO schemes show better clustering capabilities in comparison to GWO, PSO and K-Means schemes.
-
-
CONCLUSIONS AND FUTURE WORK
In this paper, the extensive use of data clustering is discussed in various sectors, highlighting its importance in decentralized scenarios for optimizing information exchange. Here, a Sinusoidal Chaotic and Information Entropy based Elephant-Herding Optimization based Clustering (SCIE_EOC) is introduced, aiming at efficiently classifying diverse datasets and enhancing data search capabilities. The SCIE_EOC focuses on achieving precise data value distribution through the integration of information entropy and sinusoidal chaotic techniques for population formation. The SCIE_EOC scheme is implemented on MATLAB 2022a software on six extensive datasets and demonstrated the remarkable efficacy, outperforming traditional schemes like K-Means, PSO, GWO and EO across multiple evaluation factors, including Root Mean Square Error, F-Measure, Standard Deviation, Purity Index, Accuracy, Intra-Cluster Distance, Time Complexity and Statistical Exploration. Overall, SCIE_EOC is shown great promise in accurately clustering diverse datasets and improving optimization performance. Going forward, the proposed scheme is predicted to forecast and validate with vast databases within the domain of bigdata. Moreover, interconnected communication is expected to provide and verify in the structure of the Internet of Things (IoT) in future.
REFERENCES
-
R. J and K. K., Diabetes data classification using whale optimization algorithm and back propagation neural network, Int. Res. J. Pharm., Vol. 8(11), pp. 219-222, 2017. http://dx.doi.org/10.7897/2230-8407.0811242
-
S. Bhaskaran, R. Marappan and B. Santhi, Design and Analysis of a Cluster-Based Intelligent Hybrid Recommendation System for E-Learning Applications, Mathematics, MDPI, Vol. 9 (107), pp. 1-21, 2021. (https://doi.org/10.3390/math9020197)
-
S. Kiruthika, EVRM. Kalaimani, R. Sudha and K. Suganthi, ACO Feature selection and Novel Black Widow meta-heuristic Learning rate optimized CNN for Early diagnosis of Parkinson`s disease, Turkish Journal of Computer and Mathematics Education, Vol. 12 (7),
pp. 809-817, 2021.
-
D. Giordano, M. Mellia and T. Cerquitelli, K-MDTSC: K-Multi-Dimensional Time-Series Clustering Algorithm, Electronics, MDPI, Vol. 10 (1166), pp. 1-21, 2021. (https://doi.org/10.3390/electronics10101166)
-
A. Khare, R. Gupta and P. K. Shukla, A Grey Wolf Optimization Algorithm (GWOA) for Node Capture Attack to Enhance the Security of Wireless Sensor Network, International Journal of Scientific & Technology Research, Vol. 9 (3), pp. 206-209, 2020.
-
D. Sonntag, and H. J. Profitlich, An architecture of open-source tools to combine textual information extraction, faceted search and information visualisation, Artificial Intelligence In Medicine, Elsevier, Vol. 93, pp-13-28, 2019. (https://doi.org/10.1016/j.artmed.2018.08.003)
-
X. Cui, Y. Li, J. Fan, T. Wang and Y. Zheng, A Hybrid Improved Dragonfly Algorithm for Feature Selection, IEEE Access, Vol. 8,
pp. 155619-155629, 2020.
-
F. L. Vinmalar and A. K. Kombaiya, An Improved Dragonfly Optimization Algorithm based Feature Selection in High Dimensional Gene Expression Analysis for Lung Cancer Recognition, International Journal of Innovative Technology and Exploring Engineering (IJITEE), Vol. 9 (8), pp. 896-908, 2020.
-
T. Singh, N. Saxena, M. Khurana, D. Singh, M. Abdalia and H. Alshazly, Data Clustering Using Moth-Flame Optimization Algorithm, Sensors, MDPI, Vol. 21 (4086), pp. 1-19, 2021.
-
M. Daouda, and M. Mayo, A survey of neural network-based cancer prediction models from microarray data, Artificial Intelligence in Medicine, Elsevier, Vol. 97, pp-204-214, 2019. (https://doi.org/10.1016/j.artmed.2019.01.006)
-
X. Cui, Y. Li, J. Fan, T. Wang and Y. Zheng, A Hybrid Improved Dragonfly Algorithm for Feature Selection, IEEE Access, Vol. 8,
pp. 155619-155629, 2020.
-
B. A. Hassan and T. A. Rashid, Multi-disciplinary Ensemble Algorithm for Clustering Heterogeneous Datasets, Neural Computing and Applications, pp. 1-30, 2021.
-
H. Chantar, M. Tubishat, M. Essgaer and S. Mirjalili, Hybrid Binary Dragonfly Algorithm with Simulated Annealing for Feature Selection, SN Computer Science, Vol. 2(295), pp. 1-11, 2021.
-
D. Giordano, M. Mellia and T. Cerquitelli, K-MDTSC: K-Multi-Dimensional Time-Series Clustering Algorithm, Electronics, MDPI, Vol. 10 (1166), pp. 1-21, 2021.
-
V. M. Herrera, T. M. Khoshgoftaar, F. Villanustre and B. Furht, Random forest implementation and optimization for Big Data analytics on LexisNexiss high performance computing cluster platform, Journal of Big Data, pp-1-36, 2019.
-
Y. Denga, A. Sanderb, L. Faulstichb, and K. Deneckea, Towards automatic encoding of medical procedures using convolutional neural networks and autoencoders, Artificial Intelligence in Medicine, Vol. 93, pp-29-42, 2019. (https://doi.org/10.1016/j.artmed.2018.10.001)
-
N. Shahid, T. Rappon, and W. Berta, Applications of artificial neural networks in health care organizational decision-making: A scoping review, PLOS ONE, pp-1-22, 2019. (https://doi.org/10.1371/journal.pone.0212356)
-
K. Tadist, F. Mrabti, N. S. Nikolov, A. Zahi and S. Najah, SDPSO: Spark Distributed PSO-based approach for feature selection and cancer disease prognosis, Journal of Big Data, Vol. 8 (19), pp. 1-22, 2021.
-
S. Mitta, Cognitive Computing Architectures for Machine (Deep) Learning at Scale, Proceedings, MDPI, Vol. 1 (186), 2019.
-
T. B. Nun and T. Hoefler, Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis, cs.LG, pp-1-47, 2019.
-
B. A. Tama and S. Lim, A Comparative Performance Evaluation of Classification Algorithms for Clinical Decision Support Systems, Mathematics, MDPI, Vol. 8 (1814), pp. 1-25, 2020. (doi:10.3390/math8101814)
-
A. N. Habowski, T. J. Habowski and M. L. Waterman, GECO: gene expression clustering optimization app for non-linear data visualization of patterns, BMC Bioinformatics, Vol. 22 (29), pp. 1-13, 2021.
-
B. Hea, Y. Guana, and R. Dai, Classifying medical relations in clinical text via convolutional neural networks, Artificial Intelligence In Medicine, Elsevier Vol. 93, pp-43-49, 2019. (https://doi.org/10.1016/j.artmed.2018.05.001)
-
A. Khare, P. Shukla and S. Silakari, Secure and Fast Chaos based Encryption System using Digital Logic Circuit, I. J. Computer Network and Information Security, MECS, Vol. 6, pp. 25-33, 2014.
-
A. Khare, P. K. Shukla, M. A. Rizvi and S. Stalin, An Intelligent and Fast Chaotic Encryption using Digital Logic Circuits for Ad-Hoc and Ubiquitous Computing, Entropy, MDPI, Vol. 18 (201), pp. 1-27, 2016.
-
R. Krishnamurthi, A. Kumar, D. Gopinathan, A. Nayyar and B. Qureshi, An Overview of IoT Sensor Data Processing, Fusion, and Analysis Techniques, Sensors, MDPI, Vol. 20 (6076), pp. 1-23, 2020. (doi:10.3390/s20216076)
-
K. Demertzis, K. Rantos and G. Drosatos, A Dynamic Intelligent Policies Analysis Mechanism for Personal Data Processing in the IoT Ecosystem, Big Data and Cognitive Computing, MDPI, Vol. 4 (9), pp. 1-16, 2020. (doi:10.3390/bdcc4020009)
-
G. Talari, E. Cummins, C. McNamara and J. O`Brien, State of the art review of Big Data and web-based Decision Support Systems (DSS) for food safety risk assessment with respect to climate change, Trends in Food Science & Technology, Elsevier, Vol. 126, pp. 192-204, 2022.
-
Y. Yao, H. Lei, T. Gedeon and L. Zheng, Large-scale Training Data Search for Object Re-identification, arXiv, cs.CS, pp. 1-13, 2023.
-
J. Lei, Association Rule Mining Algorithm in College Students` Quality Evaluation System, Journal of Electrical and Computer Engineering, Hindawi, Vol. 2022, pp. 1-9, 2022.
-
G. Li, J. Sun, M. N. A. Rana, Y. Song, C. Liu and Z. Y. Zhu, Optimizing High-Dimensional Functions with an Efficient Particle Swarm Optimization Algorithm, Mathematical Problems in Engineering, Hindawi, Vol. 2020, pp. 1-10, 2020.
-
L. Abualigah, A. H. Gandomi, M. A. Elaziz, H. A. Hamad, M. Omari, M. Alshinwan and A. M. Khasawneh, Advances in Meta-Heuristic Optimization Algorithms in Big Data Text Clustering, Electronics, MDPI, Vol. 10 (101), pp. 1-30, 2021.
-
R. Srivastava, Application of Optimization Algorithms in Clusters, Frontiers in Chemistry, pp. 1-17, 2021.
-
S. L. M. Mosharraf and M. A. Adnan, Improving lookup and query execution performance in distributed Big Data systems using Cuckoo Filter, Journal of Big Data, Vol. 9 (12), pp. 1-30, 2022.
-
F Distribution Table, 2018 Mar 18. Retrieved from http://www.socr.ucla.edu/applets.dir/f_table.html.
-
Normal Distribution Table. Retrieved from http://math.arizona.edu/~rsims/ma464/standardnormaltable.pdf.
