Machine Learning in Big Data Analytics

Saikat Das; Sampita Mallick; Shyamapriya Chatterjee; Sujata Kundu

doi:10.17577/IJERTCONV9IS11032

NCETER - 2021 (Volume 09 - Issue 11)

Machine Learning in Big Data Analytics

DOI : 10.17577/IJERTCONV9IS11032

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 154
Authors : Saikat Das, Sampita Mallick, Shyamapriya Chatterjee, Sujata Kundu
Paper ID : IJERTCONV9IS11032
Volume & Issue : NCETER – 2021 (Volume 09 – Issue 11)
Published (First Online): 16-07-2021
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Machine Learning in Big Data Analytics

Saikat Das1

B.Tech (Information Technology) Narula Institute of Technology Kolkata, India

Shyamapriya Chatterjee3

Sampita Mallick2

B.Tech (Information Technology) Narula Institute of Technology Kolkata, India

Sujata Kundu4

ech (Information Technology) B.Tech (Information Technology)

Narula Institute of Technology Narula Institute of Technology Kolkata, India Kolkata, India

AbstractThe term Big Data refers to data that is so large, fast or complex that its difficult or impossible to process using traditional methods. The concept of big data gain momentum in the early 2000s due the enormous use of data. Earlier 3Vs are use in big data but now the concept of 5Vs are used which are VOLUME, VELOCITY, VARIETY,

VERACITY, VALUE. While the potential of these massive data is undoubtedly significant. Machine learning is the artificial intelligence method of discovering knowledge for making intelligent decisions. This paper introduces methods in machine learning, main technologies in big data (case studies) and some application of machine learning in big data.

KeywordsMachine Learning, Big Data, 5Vs

INTRODUCTION:

The term Machine Learning was coined in 1959 by Arthur Samuel, an American IBMer and pioneer in the field of computer gaming and artificial intelligence. A representative book of the machine learning research during the 1960s was the Nilsson's book on Learning Machines, dealing mostly with machine learning for pattern classification. Interest related to pattern recognition continued into the 1970s, as described by Duda and Hart in 1973. In 1981 a report was given on using teaching strategies so that a neural network learns to recognize 40 characters (26 letters, 10 digits, and 4 special symbols) from a computer terminal. Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks. The term Big Data has been in use since the 1990s, with some giving credit to John Mashey for popularizing the term. Big data usually includes data sets with sizes beyond the ability of commonly used software tools to capture, curate, manage, and process data within a tolerable elapsed time.

BIG DATA PHILOSOPHY ENCOMPASSES UNSTRUCTURED, SEMI-STRUCTURED AND STRUCTURED DATA, HOWEVER THE MAIN FOCUS IS ON UNSTRUCTURED DATA.
1. Brief Analysis of Big Data
  
  The term Big Data refers to a huge volume of data that cannot be stored processed by any traditional data storage or processing units. Data is generated at a very large scale and it is being used by many multinational companies to process and analyze in order to uncover insights and improve the business of many organizations.
  1. Big Data is generally categorized into three different varieties. They are as shown below:
    1. Structured Data: Structured data is data that adheres to a pre-defined data model and is therefore straightforward to analyze. Structured data conforms to a tabular format with relationship between the different rows and columns. Common examples of structured data are Excel files or SQL databases. Each of these have structured rows and columns that can be sorted. Structured data depends on the existence of a data model a model of how data can be stored, processed and accessed. Because of a data model, each field is discrete and can be accesses separately or jointly along with data from other fields. This makes structured data extremely powerful: it is possible to quickly aggregate data from various locations in the database.2
    2. Semi- Structured Data: Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contain tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. Therefore, it is also known as self-describing structure. Examples of semi- structured data include JSON and XML are forms of semi- structured data. The reason that this third category exists (between structured and unstructured data) is because semi- structured data is considerably easier to analyze than unstructured data. Many Big Data solutions and tools have the ability to read and process either JSON or XML. This reduces the complexity to analyze structured data, compared to unstructured data. Example- Comma Separated Values(CSV) File.
    3. Unstructured Data: Unstructured data is information that either does not have a predefined data model or is not
    organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well. This results in irregularities and ambiguities that make it difficult to understand using traditional programs as compared to data stored in structured databases. Common examples of unstructured data include audio, video files or No-SQL databases. Example- Audio Files, Images etc
2. 5Vs of Big Data:
1. Volume:
  - The name Big Data itself is related to a size which is enormous.
  - Volume is a huge amount of data.
  - To determine the value of data, size of data plays a very crucial role. If the volume of data is very large then it is actually considered as a Big Data. This means whether a particular data can actually be considered as a Big Data or not, is dependent upon the volume of data.
  - Hence while dealing with Big Data it is necessary to consider a characteristic Volume.
  - Example: In the year 2016, the estimated global mobile traffic was 6.2 Exabytes (6.2 billion GB) per month. Also, by the year 2020 we will have almost 40000 Exabytes of data.
2. Velocity:
  
  Velocity refers to the high speed of accumulation of data. In Big Data velocity data flows in from sources like machines, networks, social media, mobile phones etc.
  - There is a massive and continuous flow of data. This determines the potential of data that how fast the data is generated and processed to meet the demands.
  - Sampling data can help in dealing with the issue like velocity.
  - Example: There are more than 3.5 billion searches per day are made on Google. Also, Facebook users are increasing by 22% (Approx.) year by year.
3. Variety:
  - It refers to nature of data that is structured, semi- structured and unstructured data.
  - It also refers to heterogeneous sources.
  - Variety is basically the arrival of data from new sources that are both inside and outside of an enterprise. It can be structured, semi-structured and unstructured.
  - Structured data: This data is basically an organized data. It generally refers to data that has defined the length and format of data.
  - Semi- Structured data: This data is basically a semi- organized data. It is generally a form of data that do not conform to the formal structure of data. Log files are the examples of this type ofdata.
  - Unstructured data: This data basically refers to unorganized data. It generally refers to data that doesnt fit neatly into the traditional row and column structure of the relational database. Texts, pictures, videos etc. are the examples of unstructured data which cant be stored in the form of rows and columns.
4. Veracity:
  - It refers to inconsistencies and uncertainty in data, that is data which is available can sometimes get messy and quality and accuracy are difficult to control.
  - Big Data is also variable because of the multitude of data dimensions resulting from multiple disparate data types and sources.
  - Example: Data in bulk could create confusion whereas less amount of data could convey half or Incomplete Information.
5. Value:
  - After having the 4 Vs into account there comes one more V which stands for Value! The bulk of Data having no Value is of no good to the company, unless you turn it into something useful.
  - Data in itself is of no use or importance but it needs to be converted into something valuable to extract Information. Hence, you can state that Value! is the most important V of all the 5Vs
BRIEF ANALYSIS OF MACHINE LEARNING

(ML):
BIG DATA MEETS MACHINE LEARNING: Machine learning algorithms are described as learning a target function (f) that best maps input variables (X) to an output variable (Y).

Y = f(X)

This is a general learning task where we would like to make predictions in the future (Y) given new examples of input variables (X). We don't know what the function (f) looks like or its form. If we did, we would use it directly and we would not need to learn it from data using machine learning algorithms. The most common type of machine learning is to learn the mapping Y = f(X) to make predictions of Y for new X. This is called predictive modeling or predictive analytics and the main aim is to make the most accurate predictions possible. If we use big data for storing bulk amount of information and manipulation, is one thing but extracting useful information from these will possible through machine learning. With this machine learning we can extract efficient patterns. Machine-learning algorthms become more effective as the size of training datasets grows. So when combining big data with machine learning, we benefit twice: the algorithms help us keep up with the continuous influx of data, while the volume and variety of the same data feeds the algorithms and helps them grow. Lets look at how this integration process might work: By feeding big data to a machine-learning algorithm, we might expect to see defined and analyzed results, like hidden patterns and analytics that can assist in predictive modelling. For some companies, these algorithms might automate processes that were previously human – centered. But more often than not, company will review the algorithms findings and search them for valuable insights that might guide business operations. Heres

where people come back into the picture. While AI and data analytics run on computers that outperform humans by a vast margin, they lack certain decision- making abilities. Computers have yet to replicate many characteristics inherent to humans, such as critical thinking, intention and the ability to use holistic approaches. Without an expert to provide the right data, the value of algorithm-generated results diminishes, and without an expert to interpret its output, suggestions made by an algorithm may compromise company decisions.

HOW TO APPLY MACHINE LEARNING IN BIG

DATA?

Machine Learning provides efficient and automated tools for data gathering, analysis, and assimilation. In collaboration with cloud computing superiority, the machine learning ingests agility into processing and integrates large amounts of data regardless of its source.

Machine learning algorithms can be applied to every element of Big Data operation including:

Data Segmentation
Data Analytics
Simulation

All these stages are integrated create the big picture out of Big Data with insights, patterns, which later get categorized and packaged into an understandable format. The fusion of Machine Learning and Big Data is a never-ending loop.

The algorithms created for certain purposes are monitored and perfected over time as the information is coming into the system and out of the system.

Basis For Comparison	Big Data	Machine Learning
	Big data can be used for a	Machine learning is
	variety of purposes, including	the technology behind
Data Use:	financial research, collecting	self-driving cars and
	sales data etc.	advance
		recommendation
		engines.
	Big data analytics pulls from	On the other hand,
	existing information to look for	Machine learning can
Foundation For	emerging patterns that can help	learn from the existing
Learning:	shape our decision-making	data and provide the
	processes.	foundation required
		for a machine to teach
		itself.
	Big data analytics can reveal	However, machine
Pattern Recognition:	some patterns through classifications and sequence analysis.	learning takes this concept a one step ahead by using the
		same algorithms that
		big data analytics uses
		to automatically learn
		from the collected
		data.
	Big data as the name suggest	ML tends to be more
	tends to be interested in large-	interested in small
Data Volume:	scale datasets wherethe problem	datasets where over-
	is dealing with the large volume	fitting is the problem
	of data.
	Purpose of big data is to	Purpose of
	store large volume of data	machine learning
Purpose:	and find out pattern indata	is to learnfrom
		trained data and
		predicts or
		estimates future
		results.

Basis For Comparison	Big Data	Machine Learning
	Big data can be used for a	Machine learning is
	variety of purposes, including	the technology behind
Data Use:	financial research, collecting	self-driving cars and
	sales data etc.	advance
		recommendation
		engines.
	Big data analytics pulls from	On the other hand,
	existing information to look for	Machine learning can
Foundation For	emerging patterns that can help	learn from the existing
Learning:	shape our decision-making	data and provide the
	processes.	foundation required
		for a machine to teach
		itself.
	Big data analytics can reveal	However, machine
Pattern Recognition:	some patterns through classifications and sequence analysis.	learning takes this concept a one step ahead by using the
		same algorithms that
		big data analytics uses
		to automatically learn
		from the collected
		data.
	Big data as the name suggest	ML tends to be more
	tends to be interested in large-	interested in small
Data Volume:	scale datasets wherethe problem	datasets where over-
	is dealing with the large volume	fitting is the problem
	of data.
	Purpose of big data is to	Purpose of
	store large volume of data	machine learning
Purpose:	and find out pattern indata	is to learnfrom
		trained data and
		predicts or
		estimates future
		results.

CASE STUDY OF MACHINE LEARNING IN BIG DATA:
MACHINE LEARNING APPLICATIONS FOR BIG DATA:

Today, machine learning is used in a wide range of applications. Perhaps one of the most well-known examples of machine learning in action is the recommendation engine that powers Facebook's News Feed. Facebook uses machine learning to personalize how each member's feed is delivered. If a member frequently stops to read a particular group's posts, the recommendation engine will start to how more of that group's activity earlier in the feed. Behind the scenes, the engine is attempting to reinforce known patterns in the member's online behavior. Should the member change patterns and fail to read posts from that group in the coming weeks, the News Feed will adjust accordingly. In addition to recommendation engines, other uses for machine learning include the following: Customer relationship management – CRM software can use machine learning models to analyze email and prompt sales team members to respond to the most important messages first. More advanced systems can even recommend potentially effective responses.

Business intelligence – BI and analytics vendors use machine learning in their software to identify potentially important data points, patterns of data points and anomalies.

Human resource information systems – HRIS systems can use machine learning models to filter through applications and identify the best candidates for an open position.

Self-driving cars – Machine learning algorithms can even make it possible for a semi-autonomous car to recognize a partially visible object and alert the driver.

Virtual assistants – Smart assistants typically combine supervised and unsupervised machine learning models to interpret natural speech and supply context.
BIG DATA AND MACHINE LEARNING COMPARISON TABLE:

Big data at Walmart

Walmart is the largest retailer in the world and the worlds largest company by revenue, with more than 2 million employees and 20000 stores in 28 countries. It started making use of big data analytics much before the word Big Data came into the picture. Walmart uses Data Mining to discover patterns that can be used to provide product recommendations to the user, based on which products were brought together. Walmart by applying effective Data Mining has increased its conversion rate of customers. It has been speeding along big data analysis to provide best-in-class e-commerce technologies with a motive to deliver superior customer experience. The main objective of holding big data at Walmart is to optimize the shopping experience of customers when they are in a Walmart store. Big data solutions at Walmart are developed with the intent of redesigning global websites and building innovative applications to customize the shopping experience for customers whilst increasing logistics efficiency. Hadoop and No-SQL technologies are used to provide internal customers with access to real-time data collected from different sources and centralized for effective use.

IX. CONCLUSION:

ML (Machine Learning) is fundamental to address the difficulties postured by big data and reveal concealed patters, information, and bits of knowledge from enormous information keeping in mind the end goal to transform the capability of the last into genuine incentive for business basic leadership and logical investigation. Future scope of Machine learning analytics is how to make ML more declarative, so that it is easier for no experts to specify and interact with different type of data in different streams. In the future, we will enhance and assess the performance of machine learning techniques for different types of problems. One promising direction is to extend the machine learning approaches towards big data, which are efficient and highly scalable in the way they process high dimensional data.

REFERENCES

https://www.edureka.co/blog/big-data-characteristics/
https://www.geeksforgeeks.org/5-vs-of-big-data/
https://www.educba.com/big-data-vs-machine-learning/
https://searchenterpriseai.techtarget.com/definition/machine- learning-ML https://www.geeksforgeeks.org/ml-introduction- data-machine-learning/
https://www.geeksforgeeks.org/demystifying-machine-learning/
https://www.edureka.co/blog/machine-learning-and-big-data/
https://towardsdatascience.com/machine-learning-and-big-data- real-world-applications-3ba3a3345cf5
https://www.educba.com/big-data-vs-machine-learning/4

Machine Learning in Big Data Analytics

Leave a Reply