Pandas Quick notes(Part-1/2) | Snehit Vaddi

Snehit Vaddi
6 min readMar 9, 2021

If you are even remotely interested in data science, this blog post will surely help you. In this post, we are going to talk about Pandas. Not the cute animal, but Pandas stands for ‘Python Data Analysis’. Pandas is an open-source Python library that is built on top of NumPy. As the name suggests, it provides various methods to do fast analysis as well as data cleaning and preparation.

Pandas is fast, flexible, and is designed to make working with ‘relational’ or ‘labeled’ data both easy and intuitive.

When to use Pandas?

Pandas is used when one is working with tabular data, such as data stored in spreadsheets, databases, and tables, etc. Pandas help one explore, clean, and process data.

Installation and Importing Pandas:

Pandas can be installed via pip command from PyPI.

!pip install pandas

Often, Pandas is imported as pd. Usually, along with Pandas, NumPy is also imported.

Pandas Series Object:

Pandas Series is a one-dimensional labeled array capable of holding any data type element. Unlike Python lists, it allows only the same type of elements.

Pandas Series can be created in different ways, like:

  • Using Python lists
  • Using NumPy arrays
  • Using Dictionary keys

Series creation using Python lists:

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list as follows:

Note: In the output of the above code, the Series assigns a sequence of indices to a series of values. We can access the values using those index values.

Series creation using Dictionaries:

A Pandas Series can also be created using Python dictionaries with or without explicit indices.

Note: If explicit indices are not specified, pandas constructs a series using keys of a given dictionary.

Series creation using NumPy arrays:

A Pandas Series can also be created using NumPy arrays with explicit indices as follows:

Note: Creating a Series with NumPy arrays makes it more efficient than a Python list. Often, to create a Series using the NumPy function, we use different functions like numpy.linespace(), numpy.random.randn()

Ways of accessing Pandas Series elements:

Series elements can be accessed element-wise or by a range of elements as specified below:

Pandas DataFrame Object:🔠

Pandas DataFrame is a two-dimensional labeled data structure with columns. It is the most commonly used data structure in Pandas.

If a Series is a one-dimensional array with flexible indices, a DataFrame is a two-dimensional array with both flexible row indices and flexible column names.

A Dataframe can be constructed in different methods like:

  • Using Series
  • Using List of Lists
  • Using NumPy Arrays
  • Using Python Dictionaries

DataFrame creation using Pandas Series:

To create a DataFrame using series, let’s first construct two series named area and population.

DataFrame creation using two series i.e., population and area.

In order to get column names and index labels, we can use df.index() and df.columns() methods. Please note that here df represents, DataFrame that we created.

Creation of DataFrame for single DataFrame object:

A DataFrame is a collection of Series objects, and a single-column DataFrame can be constructed from a single Series or a single DataFrame object.

DataFrame creation using Python List of Lists:

A DataFrame can also be constructed using a list of lists with or without additional parameters like index or columns.

When index and column parameters are not specified for the creation of DataFrame, Pandas assigns DataFrame with a sequence of integers to the sequence of rows and columns.

Let’s consider the following example for the creation of DataFrame using a list of lists with columns attribute:

DataFrame creation using Python Dictionaries:

Similar to Sequence creation, DataFrame can also be constructed using dictionaries. Here, Pandas considers keys as column labels:

DataFrame creation using NumPy arrays:

Let’s consider a three-dimensional NumPy array. We can create a DataFrame with any specified column and index names.

Pandas Indexers : loc, iloc, ix

In Pandas, there are a lot of ways to pull the elements, rows, and columns from a DataFrame. These slicing and indexing conventions can, however, become a source of confusion.

If you’re already confused about this, consider the following example to get some clarity:

Consider a series with explicit indices, an indexing operation such as data[1] will use the explicit indices, while a slicing operation, like data[1:3], will use the implicit index.

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer functions, namely:

  • Dataframe.loc[]: This allows indexing and slicing based on an explicit index
  • Dataframe.iloc[]: This allows indexing and slicing based on implicit Python-style index
  • Dataframe.ix[]: This function is used for both implicit and explicit slicing — both label and integer-based slicing are supported

Using the loc indexer, we can index data based on explicit indexes. Consider a series named data:

Using the iloc indexer, we can index data based on the implicit index:

Some important operations on DataFrames:

To understand this better, let’s consider a DataFrame that we had created earlier ie; population and area DataFrames.

Division of two DataFrames:

Dividing two Series objects divides the respective integer elements. Pandas assign NaN(Not a Number) if any number is not available.

Union of indices:

Union of indices of two input arrays results in an array with a collection of all indices of both arrays/series.

Addition of DataFrames:

Handling with Missing Data:

In the real world], data is rarely continuous and homogeneous. That would be too good to be true! Missing Data can occur when no information is provided for one or more items. To make matters more complicated, different data sources may indicate missing data in different ways.

But, don’t worry Pandas provides various methods and tools for handling missing data. Here, we will refer to missing data in general as null, NaN (an acronym for Not a Number), or NA values.

Pandas treat None and NaN as essentially interchangeable for indicating missing or empty values.

Operating on Null Values:

To deal with null or missing values, Pandas provides several useful methods for detecting, removing, and replacing null values. These are:

  • isnull(): It checks null values in Pandas DataFrame and returns Boolean values. True for NaN values
  • notnull(): It checks not null values in Pandas DataFrame and returns Boolean values. False for NaN values
  • dropna(): It drops rows/columns of DataFrames with Null values
  • fillna(): It fills Null values in DataFrame with a specified value

Filling NA values with 0:

Filling the missing value with the next ones:

Interpolating the missing value:

Importing and Uploading Jupyter notebook to Jovian.ml:

Conclusion:🤩

With this, we reach the end of the first part of the Basics of Pandas series. To sum up, we have touched upon various basic concepts, like the Creation of series, DataFrames in different methods, Indexing, Indexers, Ufuncs, and handling of missing data. We cannot stress this enough but Pandas can indeed be extremely powerful when it comes to data analysis. We hope that this blog will make you want to take a data set and play around it using Pandas. Next time, we will learn a few advanced topics like operating with .csv file, manipulations, and many more! Stay Tuned!

References:📗

Author:🤠

- Snehit Vaddi

I am a Machine Learning enthusiast. I teach machines how to see, listen, and learn.

Linkedin: https://www.linkedin.com/in/snehit-vaddi/

Github: https://github.com/snehitvaddi

--

--

Snehit Vaddi

👨‍🎓I am a Machine Learning enthusiast. I teach machines how to see, listen, and learn.