Data Exploration
Exploring data using numpy and pandas
The main activity of a data scientist is exploring and analyzing data. Even though, the analysis results involve reports, graphs or predictive models, the source is always data. In fact, the objetive of a data analysis is to establish connections among the data features (e.g., a predictive model) or to test a hypothesis.
Let us take the example of a mathematics professor that collects data about students (e.g., the number of lectures attended, the amount of studying hours, and the grade on the final exam). So, one could take the sample and analysis it to determine the relationship between the number of lectures and of studying hours and the final grade. Also, one could use the sample to test a hypothesis, such as, only the students having a minimal amount of studying hours can obtain a final grade greater than the passing grade.
Exploring Data
As data scientist, an important amount of the time is dedicated to explore, analyze and visualize data. One of mainly used tools and programming langages for this activity are Jupyter notebooks and Python.
Python has a great popularity in data science and machine learning due to the variety of packages helping with data analysis and data visualisation. Here, one presents some ways to deal with data analysis using Python.
Exploring Data with NumPy
Suppose a mathematics professor take a sample of 15 student grades (e.g., in France, the grades range from 0 to 20).
grades = [8, 10, 15, 11, 12, 16, 10, 18, 14, 12, 10, 19, 9, 20, 7]
print(grades)
One could use the Python list structure to load and to manipulate, but for numeric manipulation purposes, the numpy package is better suited.
import numpy as np
grades = np.array([8, 10, 15, 11, 12, 16, 10, 18, 14, 12, 10, 19, 9, 20, 7])
print(grades)
Multiplying a numpy array by a scalar performs an element-wise computation. The previous numpy array behavior is equivalent to vectors. So, in the example below, the result is an array of the same size with each element multiplied by 2.
twice_grades = grades * 2
print(twice_grades)
print(type(twice_grades))
In fact, the class type for the numpy array is numpy.ndarray, which consists of a structure with n dimensions. To obtain the shape of an array, one run the code below:
twice_grades.shape
Let us perform some analysis on the grades data. Using NumPy, we can apply functions over all elements in the array to obtain some statistics such as the average grade.
print("Average grade: {m}".format(m=grades.mean()))
Now, one adds another piece of data about the same students, i.e., the average number of studying hours per week.
# Array of the average number of studying hours for each student
hours = np.array([10.0, 11.5, 16.0, 9.5, 9.0, 8.25, 6.5, 15.5, 13.0, 9.0, 11.75, 13.5, 12.0, 15.5, 12.0])
# 2D array containing the number of studying hours and the final grade for each student
data = np.array([hours, grades])
# Display
print(data)
Indeed, the data array contains two elements, each one is an array containing 15 elements. This multidimensional array containing the information about the umber of studying hours and the final grade per student can be used extract some insights about the average compared to the average grade.
print("Average of hours: {mh:.2f}".format(mh=data[0].mean()))
print("Average of grades: {mg:.2f}".format(mg=data[1].mean()))
Exploring Data with Pandas
In fact, numpy package provides a lot of functionality to deal with numeric. When we are dealing with , the Pandas package is more convenient than NumPy.
Further Reading
To learn more about Python packages, see the following documentation:
To view the notebooks from which this post was inspired, see the following repository: