Data Analysis using Python

DOI : 10.17577/IJERTV10IS070241

Download Full-Text PDF Cite this Publication

Text Only Version

Data Analysis using Python

Kiranbala Nongthombam

University Institute of Sciences (Mathematics Department) Chandigarh University,

Punjab, India

Deepika Sharma

University Institute of Sciences (Mathematics Department) Chandigarh University, Punjab, India

Abstract- In this paper, the analysis of data using Python Programming Language is studied. The very basic processes of data analysis like cleaning, transforming, modeling of data is briefly explained in this paper and focus more on exploratory data analysis of an already existing dataset and finding the insights. Some graphical analysis of the data from the dataset will be shown using different libraries and functions of Python. Here, a dataset named World Happiness report 2021 is used to analyze and extract various information in both numerical and pictorial form.

Keywords:- Data analysis; python; data visualization; pandas; seaborn; exploratory data analysis

  1. INTRODUCTION

    Data are those raw facts and figures with no proper information hence need to be processed to get the desired information. While information is those results which we get after processing the raw data in different levels or extracted conclusions from a given dataset through a process called data analysis.

    Data Analysis is simply the analysis of various data means cleaning the data, transforming it into understandable form, and then modeling data to extract some useful information for business use or an organizational use. It is mainly used in taking business decisions. Many libraries are available for doing the analysis. For example, NumPy, Pandas, Seaborn, Matplotlib, Sklearn, etc. [7].

    • NumPy: NumPy is a library written in Python, used for numerical analysis in Python. It stores the data in the form of nd-arrays (n-dimensional arrays).

    • Pandas: Pandas is mainly used for converting data into tabular form and hence, makes the data more structured and easily to read.

    • Matplotlib: Matplotlib is a data visualisation and graphical plotting package for Python and its numerical extension NumPy that runs on all platforms.

    • Seaborn: Seaborn is a Python data visualisation package based on matplotlib that is tightly connected with pandas data structures. The core component of Seaborn is visualisation, which aids in data exploration and comprehension.

    • Sklearn: Scikit-learn is the most useful library for machine learning in Python. It includes numerous useful tools for classification, regression, clustering, and dimensionality reduction.

    Data visualization will help the data analysis to make it more understandable and interactive by plotting or displaying the data in pictorial form. Pandas, a Python open-source package that deals with three different data structures: series, data

    frames, and panels, solves that need of analyzing and visualization of data [2].

    Data analysis using Python makes task easier since Python Programming language has many advantages over any other programming language. It has prominent features like being a high-level programming language (the codes are in human readable form) it is easy to understand and use by any programmer or user. Many libraries and functions for statistical, numerical analysis are available in Python. Moreover, the source code is freely available to anyone (free and open source).

    This paper includes all the basic terms and functions which are much needed by a beginner to know what data analysis is. The paper is divided broadly into 4 sections. In section II, the main steps in data analysis will be discussed. In section III, data analysis using python will be studied with all the basic needs of python in doing data analysis and data visualization will aid the analysis by representing them in picture format. In section IV, conclusion of the paper is given.

  2. MAIN PHASES IN DATA ANALYSIS

    1. Data requirements

      Data are the most important unit in any study. Data must be provided as inputs to the analysis based on the analysis requirements. The term experimental unit refers to the type of organization that would be used to gather data (e.g., a person or population of people). It is possible to identify and obtain specific population variables (such as height, weight, age, and salary). It doesnt matter whether the data is numerical or categorical.

    2. Data Collecting:

      The collecting of data is simply known as Data Collecting. Data is gathered from a variety of sources, including relational databases, cloud databases, and other sources, depending on the study needs. Field sensors, such as traffic cameras, satellites, monitoring systems, and so on, can also be used as data sources.

    3. Data processing

      Data that are collected must be processed or organized for analysis. For instance, these may involve arranging data into rows and columns in a table format (known as structured data) for further analysis, often through the use of spreadsheet or statistical software.

    4. Data cleaning:

      The method of cleaning data after it has been processed and organized is known as data cleaning. It scans for data

      inconsistencies, duplicates, and errors, and then removes them. The data cleaning process includes tasks such as record matching, identifying data inaccuracy, data sort, outlier data identification, textual data spell checker, and data quality maintenance. As a consequence, it keeps us from having unexpected outcomes and assists us in delivering high-quality data, which is essential for a successful outcome.

    5. Exploratory data analysis:

      Once the datasets are cleaned and free of error, it can then be analyzed. A variety of techniques can be applied such as exploratory data analysis- understanding the messages contained within the obtained data and descriptive statistics- finding average, median, etc. Data visualization is also a technique used, in which the data is represented in a graphical format in order to obtain additional insights, regarding the information within the data [4].

    6. Modeling and algorithms:

      Mathematical formulas or models (known as algorithms), may be applied to the data in order to identify relationships among the variables; for example, using correlation or causation.

    7. Data product

    A data product, is a computer application that takes data inputs and generates outputs, feeding them back into the environment. It may be based on a model or algorithm.

  3. DATA ANALYSIS USING PYTHON

    In this section, data analysis using python will be studied. The most basic things like why using python for data analysis will be understood. Moreover, how anyone can start using python will be shown. The important libraries, the platforms, the dataset to carry out the analysis will be introduced. Usage of various python functions for numerical analysis are given along with various methods of plotting graphs or charts are discussed.

    1. Why using Python?

      Python is a high-level, interpreted, multi-purpose programming language. Many programming paradigms like procedural programming language, object-oriented programming is supported in python. It can be used for many applications, that includes statistical computing with various packages and functions. Moreover, it is easy to learn. It can be picked up by anyone including those who has less programming skills [9].

      Some features of Python are as listed below:

      • Open source and free

      • Interpreted language

      • Dynamic typesetting

      • Portable

      • Numerous IDE

    2. Packages used:

      • Numpy

      • Pandas

      • Seaborn

      • Matplotlib

    3. Platform used:

      • Anaconda (Jupyter Notebook)

    4. Dataset used:

      • World Happiness record 2021

        Fig. 1. A view of the dataset (World Happiness record 2021)

    5. Working with dataset

      • Importing libraries:

        Libraries that would be used in the process of analysis are to be imported first. Here are the codes to import the libraries. import pandas as pd

        import numpy as np

        import matplotlib.pyplot as plt import seaborn as sns

        Fig. 2. Importing libraries

      • Importing dataset

        Here, the dataset (World Happiness report 2021) is imported in the jupyter notebook.

        mydata=pd.read csv(World Happiness report 2021.csv) mydata

        Fig. 3. Importing dataset

      • Cleaning Data

        Removing unwanted data or null values are done in the process of data cleaning. So, first we need to check the dataset whether it contains any null value or empty cells [6].

        # isnull() returns true in the entry where there is no value or NA value. And sum() is used together with isnull() to find the total number of null values in every columns.

        mydata.isnull().sum()

        Fig. 4. Checking null values in the dataset

        According to our needs for the analysis, we can extract some particular rows or records from the dataset. Here is an example to extract the top most and last rows from the dataset.

        #head() is used to extract the top-most data in the dataset. 5 is the default value of the head(). Here, top 10 rows from the dataset is taken.

        headdata=mydata.head(10) headdata

        Fig. 5. Top 10 rows of the dataset

        #tail() is used to extract the last rows in the dataset. 5 is the default value of the tail(). taildata=mydata.tail(10) taildata

        Fig. 6. Last 10 rows of the dataset

    6. Exploratory Data Analysis

      In statistics, exploratory data analysis is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis was promoted by John Tukey to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments [4][8].

      • Data types: Datatype refers to the type of data- int, object, float are the basic datatypes in python. Printing the types of data of all the columns in the dataset using dtypes-

        mydata.dtypes

        Fig. 7. Datatypes of the whole coumns in the dataset

      • Describing the dataset: Describing data of a dataset means extracting the summary of the given dataframe such as mean, count, min, max, etc. It can be done using describe() function-

        For the whole dataset: mydata.describe()

        Fig. 8. Summary of the whole dataset

        For some selected rows: taildata.describe()

        Fig. 9. Summary of some selected entries(10 last rows)

      • Correlations: Correlation shows the relation between any two variables in the dataset. The strength of a linear relation between two variables is measured by correlation. Printing Correlation of various attributes using corr() [1].

        # For whole dataset- mydata.corr()

        Fig. 10. Correlation of the whole dataset

        # For some selected coulmns or attributes- mydata[[Country name, Regional indicator, Ladder score, Standard error of ladder score, Logged GDP per capita, Social support, Healthy life expectancy, Generosity, Perceptions of corruption]].corr()

        Fig. 11. Correlation of some attributes in the dataset

    7. Graphical EDA

    Fundamentally, graphical exploratory data analysis is the graphical equivalent to conventional non-graphical exploratory data analysis. EDA that examines data sets in order to summarise their statistical characteristics by focusing on the same four main features, such as measures of central tendency, measures of spread, distribution form, and the presence of

    outliers. We also divided GEDA into three categories: Univariate GEDA, Bivariate GEDA, and Multivariate GEDA. Well go through these important varieties in more detail in the following paragraphs and aspects of GEDA [5].

    First, a subset of the dataframe is taken to analyse or visualize using it.

    Fig. 12. A subset of the dataframe

    1. Univariate GEDA

      • Histogram: A histogram is a data representation that looks like a bar graph that buckets a variety of outcomes into columns along the x-axis. The y-axis can be used to illustrate data distributions by representing the numerical count or percentage of occurrences in each column. Histogram in python can be drawn using matplotlib.pyplot.hist()-

        Fig. 13. Histogram

      • Stem Plot: A stem plot draws vertical lines from the baseline to the y axis and sets a marker at each x point. The x-positions are not necessary. The formats can be specified as keyword-arguments or as positional arguments. Stem plot in python can be drawn using matplotlib.pyplot.stem()

        Fig. 14. Stem plot

        • Box Plot: Box plot is a visual representation of and comparison of groups of data. The box plot depicts the level, spread, and symmetry of a data distribution by using the median, approximate quartiles, outliers, and the lowest and highest data points (extreme values) [10].

        Fig. 15. Boxplot

    2. Multivariate GEDA

      • Scatter plot: Dots are used to indicate values for two different numeric variables in a scatter plot. The values for each data point are indicated by the position of each dot on the horizontal and vertical axes. Scatter plots are used to see how variables relate to one another. Here, scatter plot of Ladder score against Standard error of ladder score is plotted below-

        Fig. 16. Scatter Plot

      • Heat Maps: A heatmap is a graphical depiction of data that uses a color-coding method to represent various values. It represents two- dimensional table of color- shades. This technique of plotting is popularly used in biology to represent gene expression and other multivariate data [3].

        A heatmap example is shown in the fig. 17.

        Fig. 17. Heatmap

      • Count Plot: A Seaborn count plot is a graphical representation of the number of occurrences or frequency for each category data using bars to depict the number of occurrences or frequency. The countplot() function is used to visualize the number of observations in each categorical category as bars. Here, Count plot is plotted for the subdata dataframe.

    Fig. 18. Countplot

  4. CONCLUSION

    In this paper, various phases of data analysis including data collection, cleaning and analysis are discussed briefly. Explorative data analysis is mainly studied here. For the implementation, Python programming language is used. For detailed research, jupyter notebook is used. Different Python libraries and packages are introduced. Using various analysis and visulaization methods, numerous results are extracted. The dataset World Happiness Record 2021 is used and extract important informations like the difference in the score of happiness of different countries, the dependence of one attribute in building up the score, how a variable affects another variable, etc. are seen in this analysis and various graphs has been plotted using various attributes in the dataset and draw conclusions in an easy way.

  5. ACKNOWLEDGMENT

    I express my heartfelt gratitude towards my mentor Ms. Deepika Sharma for guiding me to accomplish such a great work. I offer my sincere appreciation towards the Head of Department, University Institute of Sciences (Mathematics Deartment), Chandigarh University for giving me such a chance to gain a wider view of knowledge.

  6. REFERENCES

  1. Viv Bewick, Liz Cheek, and Jonathan Ball. Statistics review 7: Correlation and regression. Critical care, 2003.

  2. Dr Ossama Embarak, Embarak, and Karkal. Data analysis and visualization using python. Springer, 2018.

  3. Nils Gehlenborg and Bang Wong. Heat maps. Nature Methods, 2012.

  4. Michel Jambu. Exploratory and multivariate data analysis. Elsevier, 1991.

  5. Matthieu Komorowski, Dominic C Marshall, Justin D Salciccioli, and Yves Crutain. Exploratory data analysis. Secondary analysis of electronic health records, 2016.

  6. Wes McKinney. Python for data analysis: Data wrangling with Pandas, NumPy, and IPython. OReilly Media, Inc., 2012.

  7. Fabio Nelli. Python data analytics: Data analysis and science using PANDAs, Matplotlib and the Python Programming Language. Apress, 2015.

  8. Kabita Sahoo, Abhaya Kumar Samal, Jitendra Pramanik, and Subhendu Kumar Pani. Exploratory data analysis using python. International Journal of Innovative Technology and Exploring Engineering (IJITEE), 2019.

  9. Guido Van Rossum et al. Python programming language. In USENIX annual technical conference, 2007.

  10. David F Williamson, Robert A Parker, and Juliette S Kendrick. The box plot: a simple visual method to interpret data. Annals of internal medicine, 1989.

Leave a Reply