Data Mining with Big Data

Greeshma G S; Mr. Arun Raj S

doi:10.17577/IJERTCONV3IS13003

NCICN - 2015 (Volume 3 - Issue 13)

Data Mining with Big Data

DOI : 10.17577/IJERTCONV3IS13003

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 93
Total Downloads : 24
Authors : Greeshma G S, Mr. Arun Raj S
Paper ID : IJERTCONV3IS13003
Volume & Issue : NCICN – 2015 (Volume 3 – Issue 13)
Published (First Online): 30-07-2018
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Data Mining with Big Data

Greeshma G S Mr. Arun Raj S

PG scholar in CSE Asst. Professor, IT

Younus College of Eng ineering and Technology, Younus College of Engineering and Technology, Vadakkevila, Ko lla m, Kerala, India, pin: 691010 Vadakkevila, Kollam, Kerala, India, pin : 691010

Abstract: Big Data concern la rge-volu me, comple x, growing data sets with mult iple, autonomous sources. With the fast development of networking, data storage, and the data collection capacity, Big Data are now rapidly e xpanding in a ll s cience and engineering doma ins , including physical, b iologica l and bio med ical s ciences . This paper pres ents a HACE theore m that characterizes the features of the Big Data revolution, and propos es a Big Data proces s ing model, fro m the data mining perspective. This data-driven model involves demand-driven aggregation of informat ion s ources , mining and analysis, user interes t modeling, and security and privacy considerations. We analyze the challenging is sues in the data-driven model and also in the Big Data revolution.

I.INTRODUCTION

Big Data concern la rge-volu me, co mple x, growing data sets with mult iple, autonomous sources. With the fas t development of networ king, data storage, and the data collection capacity, Big Data a re now rapidly e xpanding in a ll s cience and engineering doma ins , including physical, b iologica l and bio med ical s ciences.

Every day, 2.5 quintillion bytes of data are created and 90 percent of the data in the wo rld today were produced within the past two years. Our capability for data generation has never been so powerful and enormous ever since the invention of the informat ion technology in the early 19th century. As another e xa mple , on 4 October 2012, the firs t pres idential debate between President Barac k Obama and Governor Mitt Ro mney triggered more than 10 million t weets within 2 hours. A mong all thes e tweets, the specific mo ments that generated the most discussions actually revea led the public interes ts, s uch as the dis cus s ions about med icare and vouchers. Such online d iscussions provide a new means to s ense the public interes ts and generate feedback in real- t ime, and are mostly appealing co mpared to generic media, such as radio or TV broadcasting. Another e xa mp le is Flickr, a public p icture sharing site, which rece ived 1.8 million photos per day, on average, fro m February to March 2012. Assuming the

size o f each photo is 2 megabytes (MB), this requires

3.6 terabytes (TB) storage every single day. Indeed, as an old saying states: a picture is worth a thousand words

, the billions of pictures on Flic ker are a treasure tank for us to explore the human society, social events

, public affa irs , d isas ters, and so on, only if we have the power to harness the enormous amount of data.

Due to the ris e of Big Data applications where data collection has grown tremendous ly and is beyond the ability of co mmonly used software tools to capture, manage, and process within a tolerab le e lapsed time. The mos t fundamental challenge for Big Data applications is to exp lore the large volu mes of data and ext ract useful informat ion or knowledge for future actions. In many situations , the knowledge e xtraction process has to be very efficient and close to real time because storing all observed data is nearly infeas ible . For e xa mple , the square kilo meter array (SKA) in radio astronomy consists of 1,000 to

1,500 15- meter dishes in a central 5-km area. It provides 100 times more sensitive vision than any e xisting radio telescopes, answering fundamental ques tions about the Univers e. Ho wever, with a 40 gigabytes (GB)/second data volume, the data generated from the SKA are exceptionally large. Although researchers have confir med that interes ting patterns, such as transient radio anomalies can be discovered from the SKA data, e xis ting methods can only wor k in an offline fashion and are incapable of handling this Big Data scenario in rea l t ime. As a result, the unprecedented data volumes require an effective data ana lysis and prediction plat form to achieve fas t response and real-t ime classificat ion for s uch Big Data.

II.VECTORS OF BIG DATA

We can often describe Big Data Using three Vs including Vo lu me, Velocity and Variety respectively.

Volume

Vo lu me refe rs to the vas t amounts of data generated every second. We are not talking about Terabytes but Zettabytes or Brontobytes . If we take all the data

generated in the world between 2008, the same amount of data will soon be generated every minute. New big data tools use dis tributed systems so that we can store and analyze data across databases that are dotes around anywhere in the world.
Velocity

Veloc ity refe rs to the speed at which new data is generated and the speed at which data moves around. Jus t think of s ocial med ia mes sages going viral in s econds . Technology allows us now to analyze the data while is being generated without ever putting it into databases.
Variety

Variety refers to the diffe rent types of data we can now use. In the past we only focused on structured data that neatly fitted into tables or relational databases, such as financial data. In fact, 80% of the world s data is unstructured such as text, images, video, voice etc. With big data technology we can now analy ze and bring together data of d iffe rent types such as messages, social media conversations, photos, sensor data, and video or voice recordings .

These characteris tics make it an e xt reme challenge for discovering useful knowledge fro m the Big Data. In a naÂ¨ve sense, we can imagine that a number of blind men are t rying to size up a giant elephant (see Figure 2.1), which will be the Big Data in this context. The goal of each b lind Man is to draw a picture (or conclusion) of the elephant according to the part of informat ion he collects during the process. Because each persons view is limited to his local region, it is not surprising that the blind men will each conclude independently that the elephant feels like a rope, a hose, or a wall, depending on the region each of them is limited to. To make the proble m even more co mplicated, let us assume that the elephant is growing rapid ly and its pose changes constantly, and each blind man may have his own Informat ion s ources that tell him about biased knowledge about the elephant.

Figure 2.1: The blind men and the giant elephant

Exp loring the Big Data in this scenario is equivalent to aggregating heterogeneous informat ion fro m diffe rent sources to help draw a bes t poss ible picture to reveal the genuine gesture of the elephant in a rea l- time fashion. Indeed, this task is not as s imp le as asking each blind man to des cribe his fee lings about the elephant and then getting an expert to draw one single pictu re with a co mbined view, concerning that each individual may s peak a different language (heterogeneous and diverse information sources) and they may even have privacy concerns about the mes sages they deliberate in the informat ion e xchange process.
1. DATA MINING
  
  Data mining refe rs to e xt racting knowledge fro m large a mounts of data stored in databases, data warehouses or other information repositories . So me people treat data mining same as Knowledge dis covery while so me people v iew data min ing ess ential step in process of knowledge discovery. Here is the list of steps involved in knowledge discovery process.
  
  Figure 3.1: Knowledge discovery process
  
  Data Cleaning – In tis step the noise and inconsistent data is re moved.
  
  Data Integration – In this step multip le data s ources are combined.
  
  Data Selection – In this s tep relevant to the analysis task are retrieved fro m the database.
  
  Data Transformat ion – In this s tep data are transformed or consolidated into forms appropriate for min ing by performing summary or aggregation operations .
  
  Data Min ing – In this step intelligent methods are applied in order to extract data patterns.
  
  Pattern Eva luation – In this step, data patterns are evaluated.
  
  Knowledge Presentation – In this step, knowledge is represented.
2. DATA MINING CHALLENGES WITH B IG DATA
  
  For an intelligent learn ing database system to handle Big Data, the essential key is to scale up to the e xceptionally large volu me of data and provide treatments for the characteris tics of big data. Fig. 2 s hows a conceptual view of the Big Data processing fra me work, which includes three tiers fro m inside out with considerations on data accessing and computing (Tie r I), data privacy and domain knowledge (Tier II), and Big Data mining algorith ms (Tier III).
  
  Figure 4.1: A Big Data processing framework
  At Tier I II, the data mining challenges concentrate on algorith m designs in tackling the difficu lties ra ised by the Big Data volu mes, distributed data distributions , and by complex and dynamic data characteris tics. The circ le at Tier III contains three stages. First, s parse, heterogeneous , uncertain, inco mplete, and mu ltis ource data are preproces s ed by data fus ion techniques . Second, comple x and dynamic data are mined after preproces sing. Third, the global knowledge obtained by local learning and model fus ion is tested and relevant infor mat ion is fedback to the preproces s ing s tage. Then, the model and parameters are adjusted according to the feedback. In the whole p rocess, information sharing is not only a promis e of s mooth development of each s tage, but also a purpose of Big Data processing.
3. RELATED WORKS
  To adapt to the multisource, massive, dynamic Big Data, researchers have e xpanded e xis ting data mining methods in many ways , including the effic iency improve ment of single- s ource knowledge dis covery methods , des igning a data mining mechanis m fro m a mu ltis ource perspective, as we ll as the study of dynamic data mining methods and the analysis of s tream data. The ma in motivation fo r discovering knowledge fro m mas sive data is improving the effic iency of single- s ource min ing methods.
  
  Wu proposed and established the theory of local pattern analysis, which has laid a foundation for global knowledge d iscovery in mu ltis ource data mining. This theory provides a solution not only for the proble m of full search, but also for finding global models that traditional min ing methods cannot find. Local pattern analysis of data processing can avoid putting different data sources together to carry out centralized co mputing.
  
  Data streams are widely used in financial analysis, online trading, med ical, tes ting, and so on. Static
  
  knowledge discovery methods cannot adapt to the characteris tics of dynamic data s treams, such as continuity, variability, rap idity, and ininity, and can eas ily lead to the loss of useful information. Therefore, effective theoretical and technical fra me works are needed to support data stream mining. Knowledge evolution is a co mmon phenomenon in real world s ystems. For e xa mple , the clin icians treat ment programs will constantly adjus t with the conditions of the patient, such as family economic status, health insurance, the course of treatment, t reatment e ffects, and dis tribution of cardiovas cular and other chronic epidemio logical changes with the passage of time. In the knowledge dis covery process, concept drifting aims to analyze the phenomenon of implic it target concept changes or even fundamental changes triggered by dynamics and context in data streams. According to different types of concept drifts , knowledge evolution can take forms of mutation drift, p rogressive drift, and data distribution drift, based on s ingle features , mult iple features , and streaming features .
4. CONCLUSION

Driven by rea l-world applications and key industrial s takeholders and initialized by national funding agencies , managing and mining Big Data have shown to be a challenging yet very compelling task. While the term Big Data literally concerns about data volumes

, the key characteris tics of the Big Data are huge with heterogeneous and diverse data sources, autonomous with dis tributed and decentralized control, and complex and evolving in data and knowledge as s ociations. Such combined characteris tics s ugges t that Big Data require a big mind to cons olidate data for ma ximu m va lues.

To e xplore Big Data, we have analyzed several challenges at the data, model, and system levels. To s upport Big Data mining, high-perfor mance computing platforms are required, which impose s ystematic designs to unleash the full power of the Big Data. At the data level, the autonomous informat ion sources and the variety of the data collection environments , often result in data with complicated conditions, such as missing/uncertain values. In other s ituations, privacy concerns , nois e, and errors can be introduced into the data, to produce altered data copies

. Developing a safe and sound informat ion sharing protocol is a major cha llenge. At the model level, the key challenge is to generate global models by comb ining locally d iscovered patterns to form a unifying view. This requires carefully designed algorithms to analy ze model correlations between distributed s ites, and fuse decisions from mu ltip le sources to gain a bes t model out of the Big Data. At the system level, the essential

challenge is that a Big Data mining fra me work needs to cons ider co mple x relat ions hips between s amples , models , and data sources, along with their evolv ing changes with time and other possible factors. A s ys tem needs to be carefully designed so that unstructured data can be lin ked through their comple x relat ionships to form useful patterns, and the

growth of data volu mes and ite m re lationships should help form legit imate patterns to predict the trend and future.

We regard Big Data as an emerging trend and the need for Big Data min ing is arising in all s cience and engineering do mains. W ith Big Data technologies , we will hopefully be ab le to provide most re levant and most accurate social sensing feedback to better understand our society at real-t ime. W e can further stimulate the participation of the Public audiences in the data production circle for societal and economica l events. The era of Big Data has arrived.

REFERENCE

IBM What Is Big Data: Bring Big Data to the Enterprise,

http://www01.ib m.co m/s oftware/data/bigdata/, IBM, 2012.
F. M ichel, How Many Photos Are Uploaded to Flickr Every Day and Month? http://www.flic kr.co m/photos/franckmiche l/6855169 886/, 2012.
A. Rajaraman and J. Ullman, Mining of Massive Data Sets. Cambridge Univ. Press, 2011.
C. Reed, D. Thompson, W. Majid, and K. Wagstaff, Rea l Time Ma chine Learning to Find Fast Transient Radio Anomalies: A Semi -Supervised Approach Combining Detection and RFI Excision , Proc. Intl As tronomica l Union Symp. Time Domain Astronomy, Sept. 2011.
X. Wu, Building Intelligent Learning Database Systems, AI Magazine, vol. 21, no. 3, pp. 61-67, 2000.
D. Luo, C. Ding, and H. Huang, Parallelization with Multiplicative Algorithms for Big Data Mining , Proc. IEEE 12th Intl Conf. Data Mining, pp. 489- 498, 2012.
J. Shafer, R. Agrawa l, and M. Mehta, SPRINT: A Scalable Parallel Classifier for Data Mining, Proc.

22nd VLDB Conf., 1996.
Ghoting and E. Pednault, Hadoop-ML: An Infrastructure fort he Rapid Implementation of Parallel Reusable Analytics, Proc. Large-Scale Machine Learn ing: Paralle lis m and Massive Data Sets Workshop (NIPS 09), 2009.
C.T. Chu, S.K. Kim, Y.A. Lin, Y. Yu , G.R. Bradski,

A.Y. Ng, and K. Olu kotun, Map-Reduce for Machine Learning on Multicore, Proc. 20th Ann. Conf. Neural Informat ion Processing Systems (NIPS06), pp. 281-288, 2006.
C. Ranger, R. Raghuraman, A. Pen metsa, G. Bradski, and

C. Kozyrakis, Evaluating MapReduce for Multi-Core and Multiprocessor Systems, Proc. IEEE 13th Intl Symp. High Performance Computer Architecture (HPCA 07), pp. 13-24, 2007.
D. Gillic k, A. Faria, and J. DeNero, MapReduce: Distributed Computing for Machine Learning, Berkley, Dec. 2006.
S. Das, Y. Sis manis, K.S. Beyer, R. Ge mulla, P.J. Haas, and J.McPhers on, Ricardo: Integrating R and Hadoop, Proc. A CMSIGM OD Intl Conf. Management Data (SIGM OD 10), pp. 987- 998.2010.
D. Wegener, M. Mock, D. Adranale, and S. Wrobel, Toolk it-Based High-Performance Data Mining of Large Data on MapReduce Clusters, Proc. Intl Conf. Data Mining Workshops (ICDMW 09),pp. 296-301, 2009.
X. Wu, C. Zhang, and S. Zhang, Database Classification for Multi-Database Mining, Information Systems, vol. 30, no. 1, pp. 71-88, 2005.
P. Domingos and G. Hulten, Mining High- Speed Data Streams, Proc. Sixth ACM SIGKDD Intl Conf. Knowledge Discovery and DataMining (KDD 00), pp. 71- 80, 2000.

Data Mining with Big Data

Leave a Reply