Social Networking Unstructured Big Data Clustering using K-Means on R

Download Full-Text PDF Cite this Publication

Text Only Version

Social Networking Unstructured Big Data Clustering using K-Means on R

Seema Jangid1, Reshu Grover2, 3Dr. Pratap Singh Patwal

1Research Scholar M.Tech, Computer Science & Engineering, LIET Alwar, Rajasthan, India

2,3 Computer Science & Engineering, LIET Alwar, Rajasthan, India

Abstract: The main goal of this paper is to describe the implementation of an out of core techniques for the data analysis of very large social networking dataset with the sequential and parallel version of the clustering algorithms. To evaluate the performance of clustering algorithm such as k means and hierarchical algorithm on social media data sets using spark frame work with R programming on cloudera. The implementation purpose we have used social media Youtube dataset, which collected from UCI machine repository. To compare the algorithm to give an exact result this states the benchmark and time consideration to solve a problem. Our main aim, therefore, the mining and clustering unstructured large volume data and Test and comparison of the results for different clustering algorithm.

Keywords: Social networking, clustering, R programming, Cloudera, K-means, Big data


Big data means really a big data, it is a collection of large datasets that cannot be processed using traditional computing techniques. Big data is not merely a data, rather it has become a complete subject, which involves various tools, techniques and frameworks. These data are generated from online transactions, email, videos, images, Logs, posts, search queries, social networking and their applications.

Big data is characterized as vast amount of information which requires new innovation and models to make conceivable to extract meaningful information from it by catching and determining it. Big data idea implies a dataset which keeps on developing so much that is because to extremely hard to handle it utilizing existing database management concept and devises. The trouble can be identified with data capture, stockpiling, search, sharing, analysis, and so forth. Big data because of its different properties like volume, velocity, variety, value and complexity result in many difficulties because to manage these properties is quits impractical. Big data mining is referred to the collective data mining or extraction techniques that are performed on large data sets or the big data. Big data mining is primarily done to extract and retrieve desired information or pattern from heterogeneous quantity of data. In the big data process, it is also necessary to know the combination of unstructured and structured knowledge that contains the amount of data.

  • Fully structured data: Fully structured data follows a predefined outline. The associated instance of said scheme is a few data that conforms to the present specification. A typical example of fully structured data is a relational database system.

  • Unstructured data: The unstructured term refers to the actual fact that there is no identifiable structure within this type of data available. Storing data in an unstructured associated type without any differentiated data schema can be a common approach to file information. Unstructured data such as twitter, Facebook and Youtube data.

  • Semi-Structured Data: Semi-structured data is often explained as schema less or self- describing, terms that indicate that there is no separate description of the type or structure of the data. Semi-structured data does not require a schema definition.

Given large dataset, regardless of whether they are text or numeric, it is frequently helpful to group them together or group comparable things naturally. In numerous executions of clustering, things in the gathering are represented as vectors in an n- dimensional space. At that point, the real clusters can be ascertained by gathering together the things that are close in distance.

  1. How to store unstructured data on local repository or on server?

  2. How to map and analyzed unstructured data in Big data?

  3. How social networking or streamed data will be analyzed in Big data on Hadoop?

  4. How to extract the information from big data?

  5. How clustering is used on big data?

  6. Finding better clustering for unstructured data.


      S. Vikram Phaneendra et al. [1] Shown that in long time past days the information was less and effortlessly taken care of by RDBMS yet as of late it is hard to handle gigantic information through RDBMS apparatuses, which is favored as "large information". In this they told that huge information varies from other information in 5 measurements, for example, volume, speed, assortment, esteem and multifaceted nature. Dnvsl Indira and Kiran kumara Reddi [2] Upgraded us with the information that Big Data is mix of organized , semi-organized ,unstructured homogenous and heterogeneous information .The creator proposed to utilize pleasant model to handle exchange of colossal measure of information over the system .Under this model, these exchanges are consigned to low request periods where there is abundant ,sit without moving transfer speed accessible .

      Albert Bifet and Wei Fan et al. [3] Presented Big Data Mining as the ability of extracting useful data from the large datasets or streams of information that because of its Volume, inconstancy and speed it was unrealistic before to do it. Albert Bifet et al. [4] Expressed that cascading data analysis is progressively is turning into the speediest and most proficient approach to acquire helpful knowledge, allowing associations to react immediately when any issue comes or detect to enhance performance. Bernice Purcell et al. [5] Stated that conventional system can't handle the big data because big data is comprised of the substantial data sets. Big Data is comprised of semi-structured, structured or unstructured data. The information storing method used for big data incorporates various clustered network attached storage and object-based storage. Sameer Agarwal et al. [6] Presents an inexact query engine to run parallel interactive SQL queries on large datasets. This engine is named as Blink DB. This engine follows 2 key ideas: 1) an adaptive improvement framework that constructs and maintains the arrangement of multidimensional flaky samples from real data over time. 2) a dynamic example selection technique that selects the approximate sized sample on the basis of the accuracy of a query or according the response time necessity. N. Lal et al. [11, 17, 18] proposed a framework for heterogeneous data being generated from various sources like digital audio, video, images data from different domains including healthcare, retail etc, and day to day life utilities. For instance, 30 billion web pages are accessed or the World Wide Web (WWW). Yingyi Bu et al. [9] Utilized another strategy called as HaLoop which is altered version of Hadoop MapReduce Framework, as Map Reduce needs inherent support for iterative programs. HaLoop is based on top of Hadoop and is extended with another programming model and a few necessary improvements that include (1) a loop-aware assignment scheduler, (2) loop invariant information reserving, and (3) storing for effective fic point verification. Osama Abu Abbas [10] thought about various different techniques used for the clustering. The algorithms which were under scrutiny were K-means Algorithms, Hierarchical Clustering algorithm, Expectation-Maximization algorithm and self-organizing map algorithm each algorithm is compared with each other on the basis of size of dataset, number of clusters formed, kind of dataset and sort of programming utilized. He provided the conclusion on the basis of the performance, quality and accuracy. Bao Rong Chang et al. [14] have discussed the crucial problem of the integration of multiple platforms is how to adapt for their own computing features so as to execute the assignments most efficiently and gain the best outcome. This paper introduced the new approaches to big data platform, RHhadoop and SparkR, and integrated them to form high-performance big data analytics with multiple platforms as part of business intelligence (BI) to carry out rapid data retrieval and analytics with R programming. Simon Mulwa Kiio et al. [15] study developed an apache spark based forensic tool to stream and analysis social media data for hate speech and cyberbully cybercrimes while diving to investigate relevant artifacts found on Twitter social network and ways to collect, preserve and ensure authenticity of the evidence. The study employed Naïve Bayes algorithm within Spark ML API to automatically classify and categorize hate speech and cyberbullying found within Twitter social media. Agnivesh et. al [16] have discussed the One of the critical problems with K-means clustering is that it only converges to local optima which is easier than solving for global optima but can lead to less optimal convergence. This is particularly true for big data as the initial centers play a very important role on the performance of this algorithm.


      The tremendous use of new technologies has led to extensive growth of miscellaneous data which is termed as big data. Today the information is growing faster than the last decade because people are getting connected to the social networks and communicating over objects. A few reviews contend that taking care of and utilizing wisely this enormous data could turn into another mainstay of financial aspects also as logical research, experimentation and recreation.

      1. Clustering

        Clustering is a process which divides the data into small clusters on the basis of the similarity which data items possess shown in Figure 1 and Figure 2. One objective of Clustering is to concentrate patterns and data from raw data sets. An alternate objective is to build up a minimal portrayal of data set by making a set of models that

        represent it.

        Figure 1 Data items Figure 2 Cluster Formation.

        There are two general sorts of clustering that are utilized: supervised and unsupervised clustering. Supervised clustering utilizes an arrangement of example data to characterize the rest of the data points. This can be called as arrangement and here the undertaking is to learn to assign cases to pre-characterized classes. Data items in a similar cluster are like each other, they share certain properties. Clustering algorithms can have distinctive properties:

        1. Hierarchical: These strategies incorporate those systems where the input information is not divided into the desired number of classes in a single step. Rather, a progression of progressive combinations of information is performed until the last number

          of clusters is acquired.

        2. Non-Hierarchical or iterative: These strategies incorporate those methods in which a desired number of clusters is expected at the starting step. Data items are reassigned to groups to enhance them.

        3. Hard and Soft: Hard clustering relegates each example to precisely one group. Delicate clustering appoints each example a likelihood of having a place with a group

        4. Disjunctive: Instances can be a part of more than one group.

          Figure 3 beneath demonstrates a representation of the properties of grouping.

          Figure 3 Various properties of clustering

          Clustering is a division of data into groups of comparable items depicted in Figure 4. These clusters possess two different properties i.e. Our proposed model for clustering algorithms is shown in Figure 5.

          1. High intra cluster property which means that similarity between data items must be high

          2. Low inter cluster property which states that data items of two different clusters must be dissimilar. We can classify the cluster in two forms.

          Clustering is categorized into two types: a) Single Data Mining Clustering, b) Multi Data Mining Clustering

          Big Data Clustering

          Big Data Clustering

          Single Machine Clustering Algorithm

          Multi Machine Clustering Algorithm

          Single Machine Clustering Algorithm

          Multi Machine Clustering Algorithm


          Mining Dimension

          Algorithms Reduction Clustering


          Parallel Reduce

          Clustering based Clustering


          Mining Dimension

          Algorithms Reduction Clustering


          Parallel Reduce

          Clustering based Clustering

          Figure 4 Clustering Categorization

          Figure 5 Design Model for the Clustering Algorithm

      2. R languages

        R is a language and environment for statistical computing and graphics. It is a GNU paper which is similar to the S language and environment. R provides a wide variety of and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open-Source route to participation in that activity.

      3. Spark

        Spark begin life in 2009 as a paper within the AMP lab at the University of California, Berkeley. Spark is a top-level paper of the apache software Foundation, designed to be used with a range of programing language and on variety of architectures. Spark is an open-source big data processing framework built around speed, ease of use, and sophisticated analytics. Spark has several advantages compared to other big data and MapReduce technologies like Hadoop and Storm. Spark supports a range of programming language includes Java, Scala, Python and R. Although often closely associated with Hadoops underling storage system.

      4. SparkR

      The SparkR paper was initially started in the AMP Lab as an effort to explore different techniques to integrate the usability of R with the scalability of Spark. It is an R package that provides a light-weight frontend to use Apache Spark from R. SparkR provides a distributed DataFrame implementation that supports operations like selection, filtering, aggregation etc. SparkR DataFrames present an API similar to local R data frames but can scale to large data sets using support for distributed computation in Spark. SparkR can read in data from a variety of sources include Hive tables, JSON files, etc. Operations executed on SparkR DataFrames get automatically distributed across all the cores and machines available on the Spark cluster. As a result, SparkR DataFrames can be used on terabytes of data and run-on clusters with thousands of machines.


      Big data-Text mining is the analysis of data contained in natural language text. The application of text (Email, Facebook, Twitter, Youtube) mining techniques to solve business problems is called text analytics. Text Mining is one of the most complex analysis in the industry of analytics. The reason for this is that, while doing text mining, we deal with unstructured data like Email, Facebook, Twitter and Youtube data. Clustering is the job of grouping a group of objects so that objects in the same group are more linked than object of other groups, clustering is shown in Figure 6. Different data mining techniques are carried out taking into account the Big Data criteria [13].

      Figure 6 Overview of clustering

      Having included two clustering algorithm and their implementation we now turn to techniques of a practical study. It involves this algorithm and testing of set of social media (Youtube) dataset.

      Figure 7 Flow chart of proposed model

      A proposed approach for comparing of two clustering algorithms, Partitioned based cluster and Hierarchical based is shown in Figure 7. In our approachfirst we load the unstructured data from local directory or from specified web storage, after loading the unstructured data we preprocess the loaded data using our algorithm, after preprocessing find the frequent words from the unstructured data, and then after removing the frequent words we create cluster using two known algorithm K-means clustering and Hierarchical Clustering using SparkR programming on Big data. Hadoop on windows operating system with cloudera. The implementation purpose we have used social media Youtube dataset, which collected from UCI machine repository.


After performing the above experiment, it is concluded that in last few years big data has taken the coverage over every field and to find the useful information from this huge heap of data is becoming difficult day by day. Thus, clustering is done to group the similar information together. For this we will work on clustering algorithms like K-MEANS, HIERARCHIAL. For future enhancement, the clustering algorithm can be used find the accuracy of HIERARCHIAL and KMEAN algorithm. We can perform work on the detailed accuracy of incorrectly clustered instances and clustered instances for heterogeneous. For the future enhanced better algorithm design may be give the best result if merging together of unstructured data, structured data, semi-structured and for heterogeneous data in Big data. The most efficient algorithm has to be developed for clustering for all types of documents.


  1. S.Vikram Phaneendra & E.Madhusudhan Reddy Big Data- solutions for RDBMS problems- A survey In 12th IEEE/IFIP Network Operations & Management Symposium (NOMS 2010) (Osaka, Japan, Apr 19, 2013).

  2. Kiran kumara Reddi & Dnvsl Indira Different Technique to Transfer Big Data: survey IEEE Transactions on 52(8) (Aug.2013).

  3. Vaithiyanathan, V., Rajeswari, K., Tajane, K., & Pitale, R.,comparison of different classification techniques, International Journal of Advances in Engineering & Technology, May 2013. ISSN: 2231-1963, 6(2), 764768.

  4. Albert Bifet Mining Big Data In Real Time, Informatics, 37 (2013) 1520 DEC 2012.

  5. Bernice Purcell The emergence of big data technology and analytics Journal of Technology Research 2013.

  6. Sameer Agarwal, Barzan MozafariX, Aurojit Panda, Henry Milner, Samuel MaddenX, Ion Stoica BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data, ACM , 978-1-4503-1994 2/13/04.

  7. K. A. Abdul Nazeer & M. P. Sebastian Improving the Accuracy and Efficiency of the K Means Clustering Algorithm .Proceedings of the World Congress on Engineering, 2009 Vol I WCE 2009, London, U.K, July 1 3.

  8. Niranjan Lal & Bhagyashree Pathak mining of unstructured data with clustering approach, International journal of engineering research & Management Technology, 2016.

  9. Yingyi Bu _ Bill Howe _ Magdalena Balazinska _ Michael D. Ernst The HaLoop Approach to Large-Scale Iterative Data Analysis, VLDB, 2010 paper.

  10. Osama Abu Abbas Comparison between Data Clustering Algorithms, the International Arab Journal of Information Technology, Volume 5, July 2008.

  11. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI, 2010.

  12. D. Napoleon & P. Ganga lakshmi, An Efficient K-Means Clustering Algorithm for Reducing Time Complexity using Uniform Distribution Data Points, IEEE, 2010.

  13. Monika kalra, Niranjan lal, & samimul Qamar, (2017)K-mean Clustering algorithm for data Mining of Heterogeneous Data, International and Communication Technology for Sustainable Development , pp61-70.

  14. Bao Rong Chang, Yun-Da Lee, and Po-Hao Liao Development of Multiple Big Data Analytics Platforms with Rapid Response Scientific Programming Volume 2017, Article ID 6972461,

  15. Simon Mulwa Kiio, Elisha O. Abade " Apache Spark based Big Data Analytics for Social Network Cybercrime Forensics " International Journal of Computer Applications (0975 8887) Volume 179 No.8, December 2017.

  16. Agnivesh, Rajiv Pandey, Amarjeet Singh" Enhancing K-means for Multidimensional Big Data Clustering using R on Cloud " International Journal of Innovative Technology and Exploring Engineering (IJITEE) ISSN: 2278-3075, Volume-8, Issue-7, May 2019.

  17. Kaur N., Lal N. (2018) Clustering of Social Networking Data Using SparkR in Big Data. In: Singh M., Gupta P., Tyagi V., Flusser J., Ören T. (eds) Advances in Computing and Data Sciences. ICACDS 2018. Communications in Computer and Information Science, vol 906. Springer, Singapore.

  18. N. Lal, M. Singh, S. Pandey and A. Solanki, "A Proposed Ranked Clustering Approach for Unstructured Data from Dataspace using VSM," 2020 20th International Conference on Computational Science and Its Applications (ICCSA), Cagliari, Italy, 2020, pp. 80-86, doi: 10.1109/ICCSA50381.2020.00024.

Leave a Reply

Your email address will not be published. Required fields are marked *