Data Mining with Big Data

Download Full-Text PDF Cite this Publication

Text Only Version

Data Mining with Big Data

Greeshma G S Mr. Arun Raj S

PG scholar in CSE Asst. Professor, IT

Younus College of Eng ineering and Technology, Younus College of Engineering and Technology, Vadakkevila, Ko lla m, Kerala, India, pin: 691010 Vadakkevila, Kollam, Kerala, India, pin : 691010

Abstract: Big Data concern la rge-volu me, comple x, growing data sets with mult iple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly e xpanding in a ll s cience and engineering doma ins , including physical, b iologica l and bio med ical s ciences . This paper pres ents a HACE theore m that characterizes the features of the Big Data revolution, and propos es a Big Data proces s ing model, fro m the data mining perspective. This data-driven model involves demand-driven aggregation of informat ion s ources , mining and analysis, user interes t modeling, and security and privacy considerations. We analyze the challenging is sues in the data-driven model and also in the Big Data revolution.


Big Data concern la rge-volu me, co mple x, growing data sets with mult iple, autonomous sources. With the fas t development of networ king, data storage, and the data collection capacity, Big Data a re now rapidly e xpanding in a ll s cience and engineering doma ins , including physical, b iologica l and bio med ical s ciences.

Every day, 2.5 quintillion bytes of data are created and 90 percent of the data in the wo rld today were produced within the past two years. Our capability for data generation has never been so powerful and enormous ever since the invention of the informat ion technology in the early 19th century. As another e xa mple , on 4 October 2012, the firs t pres idential debate between President Barac k Obama and Governor Mitt Ro mney triggered more than 10 million t weets within 2 hours. A mong all thes e tweets, the specific mo ments that generated the most discussions actually revea led the public interes ts, s uch as the dis cus s ions about med icare and vouchers. Such online d iscussions provide a new means to s ense the public interes ts and generate feedback in real- t ime, and are mostly appealing co mpared to generic media, such as radio or TV broadcasting. Another e xa mp le is Flickr, a public p icture sharing site, which rece ived 1.8 million photos per day, on average, fro m February to March 2012. Assuming the

size o f each photo is 2 megabytes (MB), this requires

3.6 terabytes (TB) storage every single day. Indeed, as an old saying states: a picture is worth a thousand words

, the billions of pictures on Flic ker are a treasure tank for us to explore the human society, social events

, public affa irs , d isas ters, and so on, only if we have the power to harness the enormous amount of data.

Due to the ris e of Big Data applications where data collection has grown tremendous ly and is beyond the ability of co mmonly used software tools to capture, manage, and process within a tolerab le e lapsed time. The mos t fundamental challenge for Big Data applications is to exp lore the large volu mes of data and ext ract useful informat ion or knowledge for future actions. In many situations , the knowledge e xtraction process has to be very efficient and close to real time because storing all observed data is nearly infeas ible . For e xa mple , the square kilo meter array (SKA) in radio astronomy consists of 1,000 to

1,500 15- meter dishes in a central 5-km area. It provides 100 times more sensitive vision than any e xisting radio telescopes, answering fundamental ques tions about the Univers e. Ho wever, with a 40 gigabytes (GB)/second data volume, the data generated from the SKA are exceptionally large. Although researchers have confir med that interes ting patterns, such as transient radio anomalies can be discovered from the SKA data, e xis ting methods can only wor k in an offline fashion and are incapable of handling this Big Data scenario in rea l t ime. As a result, the unprecedented data volumes require an effective data ana lysis and prediction plat form to achieve fas t response and real-t ime classificat ion for s uch Big Data.


We can often describe Big Data Using three Vs including Vo lu me, Velocity and Variety respectively.

  1. Volume

    Vo lu me refe rs to the vas t amounts of data generated every second. We are not talking about Terabytes but Zettabytes or Brontobytes . If we take all the data

    generated in the world between 2008, the same amount of data will soon be generated every minute. New big data tools use dis tributed systems so that we can store and analyze data across databases that are dotes around anywhere in the world.

  2. Velocity

    Veloc ity refe rs to the speed at which new data is generated and the speed at which data moves around. Jus t think of s ocial med ia mes sages going viral in s econds . Technology allows us now to analyze the data while is being generated without ever putting it into databases.

  3. Variety

    Variety refers to the diffe rent types of data we can now use. In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world s data is unstructured such as text, images, video, voice etc. With big data technology we can now analy ze and bring together data of d iffe rent types such as messages, social media conversations, photos, sensor data, and video or voice recordings .

    These characteris tics make it an e xt reme challenge for discovering useful knowledge fro m the Big Data. In a na¨ve sense, we can imagine that a number of blind men are t rying to size up a giant elephant (see Figure 2.1), which will be the Big Data in this context. The goal of each b lind Man is to draw a picture (or conclusion) of the elephant according to the part of informat ion he collects during the process. Because each persons view is limited to his local region, it is not surprising that the blind men will each conclude independently that the elephant feels like a rope, a hose, or a wall, depending on the region each of them is limited to. To make the proble m even more co mplicated, let us assume that the elephant is growing rapid ly and its pose changes constantly, and each blind man may have his own Informat ion s ources that tell him about biased knowledge about the elephant.

    Figure 2.1: The blind men and the giant elephant

    Exp loring the Big Data in this scenario is equivalent to aggregating heterogeneous informat ion fro m diffe rent sources to help draw a bes t poss ible picture to reveal the genuine gesture of the elephant in a rea l- time fashion. Indeed, this task is not as s imp le as asking each blind man to des cribe his fee lings about the elephant and then getting an expert to draw one single pictu re with a co mbined view, concerning that each individual may s peak a different language (heterogeneous and diverse information sources) and they may even have privacy concerns about the mes sages they deliberate in the informat ion e xchange process.


      Data mining refe rs to e xt racting knowledge fro m large a mounts of data stored in databases, data warehouses or other information repositories . So me people treat data mining same as Knowledge dis covery while so me people v iew data min ing ess ential step in process of knowledge discovery. Here is the list of steps involved in knowledge discovery process.

      Figure 3.1: Knowledge discovery process

      Data Cleaning – In tis step the noise and inconsistent data is re moved.

      Data Integration – In this step multip le data s ources are combined.

      Data Selection – In this s tep relevant to the analysis task are retrieved fro m the database.

      Data Transformat ion – In this s tep data are transformed or consolidated into forms appropriate for min ing by performing summary or aggregation operations .

      Data Min ing – In this step intelligent methods are applied in order to extract data patterns.

      Pattern Eva luation – In this step, data patterns are evaluated.

      Knowledge Presentation – In this step, knowledge is represented.


      For an intelligent learn ing database system to handle Big Data, the essential key is to scale up to the e xceptionally large volu me of data and provide treatments for the characteris tics of big data. Fig. 2 s hows a conceptual view of the Big Data processing fra me work, which includes three tiers fro m inside out with considerations on data accessing and computing (Tie r I), data privacy and domain knowledge (Tier II), and Big Data mining algorith ms (Tier III).

      Figure 4.1: A Big Data processing framework

        1. Tier I: Big Data Mining Platfor m

          The challenges at Tier I focus on data accessing and arith metic co mputing procedures . Because Big Data are often stored at diffe rent locations and data volumes may continuous ly grow, an effect ive computing platform will have to take dis tributed large-scale data storage into cons ideration for computing. For e xa mp le, typical data min ing algorith ms require all data to be loaded into the main me mo ry, this

          , however, is becoming a clear technical barrier for Big Data because moving data across diffe rent locations is expensive, even if we do have a super large ma in me mo ry to hold all data for computing.

          In typical data min ing systems, the min ing procedures require computational intensive computing units for data analysis and comparis ons . A computing platform is , therefore, needed to have efficient acces s to, at least, two types of resources: data and computing proces sors. For s ma ll scale data mining tasks, a s ingle desktop computer, wh ich contains hard disk and CPU processors, is suffic ient to fulfill the data mining goals. Indeed, many data mining algorith m are designed for this type of

          problem settings. For mediu m scale data min ing tasks, data are typically large (and possibly distributed) and cannot be fit into the main me mory. Co mmon solutions are to rely on para llel co mputing or collective min ing to sample and aggregate data fro m different sources and then use parallel computing progra mming to carry out the min ing process.

          For Big Data mining, because data scale is far beyond the capacity that a s ingle personal computer (PC) can handle, a typical Big Data proces sing fra me work will re ly on c luster computers with a high-perfor mance computing platform, with a data mining tas k being deployed by running s ome parallel programming tools, such as MapReduce or Enterpris e Control Language (ECL), on a la rge number of computing nodes (i.e., c lusters). The role of the s oftware co mponent is to ma ke sure that a single data mining task, such as finding the best match of a query fro m a database with billions of records, is split into many sma ll tasks each of which is running on one or mu ltip le co mputing nodes.

        2. Tier II: Big Data Semantics and Applic ati on Knowle dge

          The challenges at Tier II center on semantics and domain knowledge fo r d ifferent Big Data applications . Se mantics and applicat ion knowledge in Big Data re fer to nu merous as pects related to the regulations , policies , user knowledge, and do main informat ion. Such information can provide additional benefits to the mining process, as well as add technical barriers to the Big Data access (Tier I) and mining algorithms (T ier III). For exa mp le, depending on different doma in applications , the data privacy and informat ion sharing mechanis ms between data producers and data consumers can be significantly diffe rent. The two most important issues at this tier include data sharing and privacy; and domain and application knowledge.

          1. In formation Sharing and Data Privacy Information s haring is an u ltimat e goal for all s ys tems involving mult iple part ies . While the motivation for sharing is clea r, a rea l-world concern is that Big Data applicat ions are related to sensitive informat ion, such as banking transactions and med ical records

            . Simple data e xchanges or transmissions do not resolve privacy concerns . For e xa mple, knowing peoples locations and their preferences , one can enable a variety of useful location-bas ed services , but public dis clos ure of an individuals locations

            /movements over time can have s erious consequences for privacy. To protect privacy, two co mmon approaches are to res trict acces s to the data, s uch as adding certification or access control to

            the data entries, so sensitive in formation is accessible by a limited group of users only, and anonymize data fields such that sensitive informat ion cannot be pinpointed to an individual record. For the f irs t approach, common challenges are to design secured certification or access control mechanis ms, such that no sensitive information can be misconducted by unauthorized ind ividuals. For data anonymization, the main objective is to inject randomnes s into the data to ensure a number of privacy goals. For e xa mple , the most common k-anonymity privacy meas ure is to ensure that each individual in the database must be indis tinguishable fro m k _ 1 others. Co mmon anonymizat ion approaches are to us e s uppress ion, generalization, perturbation, and permutation to generate an altered version of the data, which is, in fact, some uncertain data.

          2. Domain and Applicat ion Knowledge

            Do main and application knowledge provides ess ential information for des igning Big Data min ing algorith ms and sys tems. In a simple cas e, do main knowledge can he lp identify right features for modeling the underlying data e.g., blood glucose level is clearly a better feature than body mass in diagnosing Type II diabetes . The domain and application knowledge can also help des ign achievable business objectives by us ing Big Data analytical techniques .

        3. Tier III: Big Data Mining Algorithms

      At Tier I II, the data mining challenges concentrate on algorith m designs in tackling the difficu lties ra ised by the Big Data volu mes, distributed data distributions , and by complex and dynamic data characteris tics. The circ le at Tier III contains three stages. First, s parse, heterogeneous , uncertain, inco mplete, and mu ltis ource data are preproces s ed by data fus ion techniques . Second, comple x and dynamic data are mined after preproces sing. Third, the global knowledge obtained by local learning and model fus ion is tested and relevant infor mat ion is fedback to the preproces s ing s tage. Then, the model and parameters are adjusted according to the feedback. In the whole p rocess, information sharing is not only a promis e of s mooth development of each s tage, but also a purpose of Big Data processing.


        1. Big Data Mining Platforms (Tier I)

          Due to the mult is ource, mas sive, heterogeneous , and dynamic characteris tics of application data involved in a dis tributed environment, one of the most important characteris tics of Big Data is to carry out

          computing on the petabyte (PB), even the e xab yte ( EB)-level data with a co mple x co mputing proces s . Therefore, utilizing a para lle l co mputing infras tructure, its corres ponding programming language support, and software mode ls to efficiently analyze and mine the distributed data are the crit ical goals for Big Data proces sing to change fro m quantity to quality.

          Currently, Big Data proces sing main ly dpends on paralle l progra mming models like MapReduce, as well as providing a cloud computing platform of Big Data services for the public. MapReduce is a batch- oriented paralle l co mputing model. There is still a certain gap in per formance with re lational databas es . I mproving the perfo rmance of MapReduce and enhancing the real-time nature of la rge-scale data processing have received a s ignificant amount of attention, with MapReduce parallel progra mming being applied to many machine learning and data mining a lgorith ms. Data min ing algorith ms usually need to scan through the training data for obtaining the statistics to solve or optimize model para meters . It calls for intens ive computing to access the large- scale data frequently. To imp rove the efficiency of algorith ms, Chu propos ed a general-purpos e parallel programming method, which is applicable to a la rge number of machine learning a lgorith ms bas ed on the s imp le MapReduce progra mming model on mu lticore processors. Ten class ical data min ing algorith ms are realized in the fra me work, including locally we ighted linear regress ion, k- Means, logis tic regression, naive Bayes , linear s upport vector machines, the independent variable analysis, Gaussian discriminate analysis, e xpectation ma ximization, and back- propagation neural networks. With the analysis of these classical machine learn ing algorithms , we argue that the computational operations in the algorithm learning process could be transformed into a summation operation on a number of tra ining data sets. Summat ion operations could be performed on diffe rent subsets independently and achieve penalization e xecuted eas ily on the MapReduce programming platfor m. Therefore , a large- scale data set could be divided into several subsets and as s igned to mu ltip le Mapper nodes . Then, various s ummat ion operations could be performed on the Mapper nodes to collect intermediate results. Finally, learn ing algorith ms are e xecuted in paralle l through merg ing summation on Reduce nodes. Ranger proposed a MapReduce-bas ed application progra mming interface Phoenix, which s upports paralle l progra mming in the environment of multicore and mu ltiprocessor s ys tems, and realized three data min ing algorith ms including k-Means, principal co mponent analysis, and linear regres sion. Gillick imp roved the MapReduces implementation mechanism in Hadoop,

          evaluated the algorithms performance of s ingle-pas s learning, iterat ive learning, and query-based learning in the MapReduce framewo rk, s tudied data s haring between computing nodes involved in para llel learning a lgorith ms, distributed data storage, and then s howed that the MapReduce mechanis ms s uitable for large-scale data mining by testing series of s tandard data min ing tasks on mediu m-size c lusters. Papadimit riou and Sun proposed a distributed collaborative aggregation (Dis Co) fra mework us ing practical d is tributed data preprocess ing and collaborative aggregation techniques . The imple mentation on Hadoop in an open source MapReduce project showed that DisCo has perfect s calability and can process and analyze mas sive data s ets (with hundreds of GB). To improve the weak s calability of traditional analys is s oftware and poor analysis capabilit ies of Hadoop systems, Das conducted a study of the integration of R (open s ource statistical analysis software) and Hadoop. The in- depth integration pushes data computation to paralle l processing, which enables powerfu l deep analysis capabilities for Hadoop. Wegener achieved the integration of Weka (an open-source mach ine learning and data mining software tool) and MapReduce. Standard Weka tools can only run on a single machine, with a limitation of 1-GB me mory. After algorith m parallelization, Weka breaks through the limitations and improves performance by taking the advantage of paralle l co mputing to handle more than 100-GB data on Map Reduce clusters. Ghoting proposed Hadoop- ML, on which developers can easily build task- paralle l or data-paralle l mach ine learning and data mining a lgorith ms on program blocks under the language runtime environ ment.

        2. Big Data Semantics and Applicati on Knowle dge (Tier II)

          In privacy protection of mas sive data, Ye proposed a mu ltilayer rough set model, which can accurately des cribe the granularity change produced by diffe rent levels of generalization and provide a theoretical foundation for measuring the data effectiveness criteria in the anonymizat ion process , and des igned a dynamic mechanism for balancing privacy and data utility, to s olve the optima l generalization/refine ment order for c lassification. A recent paper on confidentiality protection in Big Data s ummar izes a number o f methods for protecting public re leas e data, including aggregation, s uppress ion, data s wapping, adding random noise, or s imp ly rep lacing the whole original data values at a high risk of disclosure with values s ynthetically generated from simu lated distributions

          . For applications involving Big Data and tremendous data volumes , it is often the case that data are phys ically distributed at different locations ,

          which means that users no longer physically poss ess the s torage of their data. To ca rry out Big Data mining, having an effic ient and effective data access mechanis m is vital, es pecially for users who intend to hire a third party to proces s their data. Under such a circu mstance, users privacy restrictions may include no local data copies or down loading and all analysis must be deployed based on the existing data storage systems without violating existing privacy settings, and many others. In Wang a privacy-pres erving public auditing mechanis m for large scale data s torage (such as cloud computing s ys tems ) has been propos ed. The public key-based mechanis m is used to enable third- party audit ing (TPA), so users can safely allow a third party to analy ze their data without breaching the security settings or compro mising the data privacy.

          For most Big Data applications , privacy concerns focus on excluding the third party (such as data miners) fro m directly accessing the original data. Co mmon solutions are to rely on some privacy- preserving approaches or encryption mechanis ms to protect the data. A recent effort by Lo rch indicates that users data access patterns can also have s evere data privacy is s ues and lead to disclos ures of geographically co-located users or users with common interests. In their system, namely Shround, us ers data acces s patterns from the s ervers are hidden by using virtual disks. As a result, it can s upport a variety of Big Data applications , such as micro blog search and social network queries, without compro mising the user privacy.

        3. Big Data Mining Algorithms (Tier III)

      To adapt to the multisource, massive, dynamic Big Data, researchers have e xpanded e xis ting data mining methods in many ways , including the effic iency improve ment of single- s ource knowledge dis covery methods , des igning a data mining mechanis m fro m a mu ltis ource perspective, as we ll as the study of dynamic data mining methods and the analysis of s tream data. The ma in motivation fo r discovering knowledge fro m mas sive data is improving the effic iency of single- s ource min ing methods.

      Wu proposed and established the theory of local pattern analysis, which has laid a foundation for global knowledge d iscovery in mu ltis ource data mining. This theory provides a solution not only for the proble m of full search, but also for finding global models that traditional min ing methods cannot find. Local pattern analysis of data processing can avoid putting different data sources together to carry out centralized co mputing.

      Data streams are widely used in financial analysis, online trading, med ical, tes ting, and so on. Static

      knowledge discovery methods cannot adapt to the characteris tics of dynamic data s treams, such as continuity, variability, rap idity, and ininity, and can eas ily lead to the loss of useful information. Therefore, effective theoretical and technical fra me works are needed to support data stream mining. Knowledge evolution is a co mmon phenomenon in real world s ystems. For e xa mple , the clin icians treat ment programs will constantly adjus t with the conditions of the patient, such as family economic status, health insurance, the course of treatment, t reatment e ffects, and dis tribution of cardiovas cular and other chronic epidemio logical changes with the passage of time. In the knowledge dis covery process, concept drifting aims to analyze the phenomenon of implic it target concept changes or even fundamental changes triggered by dynamics and context in data streams. According to different types of concept drifts , knowledge evolution can take forms of mutation drift, p rogressive drift, and data distribution drift, based on s ingle features , mult iple features , and streaming features .


Driven by rea l-world applications and key industrial s takeholders and initialized by national funding agencies , managing and mining Big Data have shown to be a challenging yet very compelling task. While the term Big Data literally concerns about data volumes

, the key characteris tics of the Big Data are huge with heterogeneous and diverse data sources, autonomous with dis tributed and decentralized control, and complex and evolving in data and knowledge as s ociations. Such combined characteris tics s ugges t that Big Data require a big mind to cons olidate data for ma ximu m va lues.

To e xplore Big Data, we have analyzed several challenges at the data, model, and system levels. To s upport Big Data mining, high-perfor mance computing platforms are required, which impose s ystematic designs to unleash the full power of the Big Data. At the data level, the autonomous informat ion sources and the variety of the data collection environments , often result in data with complicated conditions, such as missing/uncertain values. In other s ituations, privacy concerns , nois e, and errors can be introduced into the data, to produce altered data copies

. Developing a safe and sound informat ion sharing protocol is a major cha llenge. At the model level, the key challenge is to generate global models by comb ining locally d iscovered patterns to form a unifying view. This requires carefully designed algorithms to analy ze model correlations between distributed s ites, and fuse decisions from mu ltip le sources to gain a bes t model out of the Big Data. At the system level, the essential

challenge is that a Big Data mining fra me work needs to cons ider co mple x relat ions hips between s amples , models , and data sources, along with their evolv ing changes with time and other possible factors. A s ys tem needs to be carefully designed so that unstructured data can be lin ked through their comple x relat ionships to form useful patterns, and the

growth of data volu mes and ite m re lationships should help form legit imate patterns to predict the trend and future.

We regard Big Data as an emerging trend and the need for Big Data min ing is arising in all s cience and engineering do mains. W ith Big Data technologies , we will hopefully be ab le to provide most re levant and most accurate social sensing feedback to better understand our society at real-t ime. W e can further stimulate the participation of the Public audiences in the data production circle for societal and economica l events. The era of Big Data has arrived.


  1. IBM What Is Big Data: Bring Big Data to the Enterprise,

    http://www01.ib m/s oftware/data/bigdata/, IBM, 2012.

  2. F. M ichel, How Many Photos Are Uploaded to Flickr Every Day and Month? http://www.flic m/photos/franckmiche l/6855169 886/, 2012.

  3. A. Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge Univ. Press, 2011.

  4. C. Reed, D. Thompson, W. Majid, and K. Wagstaff, Rea l Time Ma chine Learning to Find Fast Transient Radio Anomalies: A Semi -Supervised Approach Combining Detection and RFI Excision , Proc. Intl As tronomica l Union Symp. Time Domain Astronomy, Sept. 2011.

  5. X. Wu, Building Intelligent Learning Database Systems, AI Magazine, vol. 21, no. 3, pp. 61-67, 2000.

  6. D. Luo, C. Ding, and H. Huang, Parallelization with Multiplicative Algorithms for Big Data Mining , Proc. IEEE 12th Intl Conf. Data Mining, pp. 489- 498, 2012.

  7. J. Shafer, R. Agrawa l, and M. Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining, Proc.

    22nd VLDB Conf., 1996.

  8. Ghoting and E. Pednault, Hadoop-ML: An Infrastructure fort he Rapid Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine Learn ing: Paralle lis m and Massive Data Sets Workshop (NIPS 09), 2009.

  9. C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu , G.R. Bradski,

    A.Y. Ng, and K. Olu kotun, Map-Reduce for Machine Learning on Multicore, Proc. 20th Ann. Conf. Neural Informat ion Processing Systems (NIPS06), pp. 281-288, 2006.

  10. C. Ranger, R. Raghuraman, A. Pen metsa, G. Bradski, and

    C. Kozyrakis, Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc. IEEE 13th Intl Symp. High Performance Computer Architecture (HPCA 07), pp. 13-24, 2007.

  11. D. Gillic k, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine Learning, Berkley, Dec. 2006.

  12. S. Das, Y. Sis manis, K.S. Beyer, R. Ge mulla, P.J. Haas, and J.McPhers on, Ricardo: Integrating R and Hadoop, Proc. A CMSIGM OD Intl Conf. Management Data (SIGM OD 10), pp. 987- 998.2010.

  13. D. Wegener, M. Mock, D. Adranale, and S. Wrobel, Toolk it-Based High-Performance Data Mining of Large Data on MapReduce Clusters, Proc. Intl Conf. Data Mining Workshops (ICDMW 09),pp. 296-301, 2009.

  14. X. Wu, C. Zhang, and S. Zhang, Database Classification for Multi-Database Mining, Information Systems, vol. 30, no. 1, pp. 71-88, 2005.

  15. P. Domingos and G. Hulten, Mining High- Speed Data Streams, Proc. Sixth ACM SIGKDD Intl Conf. Knowledge Discovery and DataMining (KDD 00), pp. 71- 80, 2000.

Leave a Reply

Your email address will not be published. Required fields are marked *