How to Prepare Data for Analysis?

During data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming, and rearranging. Such tasks are often reported to take up 80% or more of an analyst’s time. Today, we will learn how to clean your data easily using in-built Python methods.

01 . Missing Data 🤔

Missing data occurs commonly in many data analysis applications. We can easily handle missing values using the Pandas Library. One of the goals of pandas is to make working with missing data as painless as possible. For example, all of the descriptive statistics on Pandas objects (like mean, median, and standard deviation) exclude missing data by default.

The way that missing data is represented in Pandas objects is not perfect, but is functional.

For numeric data, Pandas use the floating-point value NaN (Not a Number) to represent missing data. We call this a sentinel value that can be easily detected. To know more about how these values can be detected, refer to this blog post: https://blog.dhruvbadaya.in/what-the-fck-is-pandas-a-complete-tutorial#heading-0305-isna

In Pandas, we’ve adopted a convention used in the R programming language by referring to missing data as NA, which stands for not available. In statistics applications, NA data may either be data that does not exist or that exists but was not observed. When cleaning up data for analysis, it is often important to analyze the missing data itself to identify data collection problems or potential biases in the data caused by missing data.

The built-in Python None value is also treated as NA in object arrays. Let's understand this with a code example.

import pandas as pd
import numpy as np

a = pd.Series([np.nan,3,None,4])

print(a.isna())

Alternatively, we can also use a.isnull( ). The output of this would be:

0     True
1    False
2     True
3    False
dtype: bool

There are four important methods in Pandas to handle missing data, which one must be aware of.

Method	Usage
dropna	Filter axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate.
fillna	Fill in missing data with some value or using an interpolation method such as 'ffill' or 'bfill'.
isnull	Return boolean values indicating which values are missing/NA.
notnull	Negation of isnull.

I will explain all of this in a moment. First of all, let us look at an example.

01.01. Filtering out Missing Data

Suppose we have the following Series:

import pandas as pd
import numpy as np

a = pd.Series([1,np.nan,3.5,np.nan,7])

Now, we have to filter out the missing values. This means that we have to get all the values, except those that are null. For this, we can simply have:

a[a.notnull()]

This will give us the following output:

0 1.0
2 3.5
4 7.0
dtype: float64

However, we also have an in-built function to do this. We can use dropna( ) and it shall drop all the null values in that Series or DataFrame.

a.dropna( )

This is the same as writing a[a.notnull( )]. We just saw how dropna( ) would work on Series objects. But what about DataFrames? How would it work on DataFrames?

With DataFrame objects, things are a bit more complex. You may want to drop rows or columns that are all NA or only those containing any NAs. dropna by default drops any row containing a missing value:

Suppose we have the following DataFrame:

from numpy import nan as NA
data = pd.DataFrame([[1., 6.5, 3.], [NA, NA, NA], [NA, NA, NA], [NA, 6.5, 3.]])

Let's draw the structure of the DataFrame to easily visualize it. The DataFrame looks something like this:

     0    1    2
0  1.0  6.5  3.0
1  NaN  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

Now, let's apply .dropna( ) here.

data.dropna( )

The output of this is:

     0    1    2
0  1.0  6.5  3.0

As we can see this dropped all rows that had even one single NaN value. However, what if we want to drop only those rows that are all NaN? For that, we have a parameter called how which we have to set to all. Once we do that, only those rows will be eliminated which are all NaN. Hence, if I run data.dropna( ), then rows 1 and 2 will be eliminated and the output will be:

     0    1    2
0  1.0  6.5  3.0
3  NaN  6.5  3.0

To drop columns in the same way, pass axis=1:

data.dropna(axis=1)

The output will be an empty DataFrame because every column contains at least one NaN value. If we want to drop a column that has all NaNs, then we can set the how parameter to all.

A related way to filter out DataFrame rows tends to concern time series data. Suppose you want to keep only rows containing a certain number of observations. You can indicate this with the thresh argument. Let us see an example.

Let's consider the following DataFrame, df:

          0         1         2
0  0.111752       NaN       NaN
1 -1.497898       NaN       NaN
2 -0.897230       NaN -1.297108
3  0.783673       NaN -0.638079
4  0.274705 -1.322938  0.449272
5 -0.169539 -0.898968  0.310462
6  1.129882 -1.082405 -0.005681

If you are following along, use the following commands to create the DataFrame:

import numpy as np
import pandas as pd
from numpy import nan as NA
df = pd.DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA

Now, what does the thresh command do? The thresh parameter works by counting the number of non-NA values in each row or column. If the number of non-NA values is greater than or equal to the thresh value, then the row or column is kept. Otherwise, the row or column is dropped. The thresh parameter can take any integer value. For example, if thresh is 2, only those rows will be kept, where the number of non-NaN values is equal to or greater than 2. Rest of the rows will be dropped.

df.dropna(thresh=2)

The output of this would be:

          0         1         2
2 -0.897230       NaN -1.297108
3  0.783673       NaN -0.638079
4  0.274705 -1.322938  0.449272
5 -0.169539 -0.898968  0.310462
6  1.129882 -1.082405 -0.005681

01.02. Filling in Missing Data

Rather than filtering out missing data (and potentially discarding other data along with it), you may want to fill in the “holes” in any number of ways. For most purposes, the fillna method is the workhorse function to use. Calling fillna with a constant replaces missing values with that value.

For example, consider the following DataFrame, df:

          0         1         2
0  0.111752       NaN       NaN
1 -1.497898       NaN       NaN
2 -0.897230       NaN -1.297108
3  0.783673       NaN -0.638079
4  0.274705 -1.322938  0.449272
5 -0.169539 -0.898968  0.310462
6  1.129882 -1.082405 -0.005681

Suppose I want to replace my NaN values with 0, I can just run the following command:

df.fillna(0)

The DataFrame will now look like this:

          0         1         2
0  0.277356  0.000000  0.000000
1 -0.894346  0.000000  0.000000
2  1.450615  0.000000  0.167842
3  0.661788  0.000000 -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

As we can see 0.0 has replaced all NaN values.

Suppose in Row 1, I want to replace NaN values with 5, and in Row 2, I want to replace NaN values with 7. So, I can just pass in a dictionary in the fillna( ) method.

df.fillna({1:5,2:7})

This will replace the NaN values in row 1 with 5 and in row 2 with 7.

          0         1         2
0  0.277356  5.000000  7.000000
1 -0.894346  5.000000  7.000000
2  1.450615  5.000000  0.167842
3  0.661788  5.000000 -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

Before we move forward, I want to remind you that these changes are not reflected in the original DataFrame. If we want these changes to be reflected in the original DataFrame, we will need to use the inplace parameter and set it to True. I have already mentioned this in my original Pandas blog post (https://blog.dhruvbadaya.in/what-the-fck-is-pandas-a-complete-tutorial#heading-0407-remove-duplicates).

Now, with fillna( ) we can use another parameter called method. It can take in the following parameters: 'backfill’, ‘bfill’, ‘pad’, ‘ffill’, None. Tbe default parameter is set to None. If we specify the method parameter in fillna( ), then we do not need to provide a value parameter.

(Please note that at the time of writing this article, the method parameter is in the deprecation process. In the future, instead of df.fillna(method='bfill'), we have to write df.bfill( ) directly. Since at the time of writing, both methods are valid, I shall use both methods. The output of both of them will be the same).

Also note that 'pad' and 'ffill' imply the same thing and 'bfill' and 'backfill' imply the same thing. First of all, let us try the bfill or backfill method. You can run any one of the following commands:

df.fillna(method='bfill') # TO BE DEPRECATED

df.fillna(method='backfill') #TO BE DEPRECATED

df.backfill( ) #TO BE DEPRECATED

df.bfill( ) #USE THIS

What these methods do is that use next valid observation to fill gap. This implies that they will take in a column and find the next valid observation in that column and then it will use that observation to fill the previous NaN values. The output DataFrame will be:

          0         1         2
0  0.277356 -1.876592  0.167842
1 -0.894346 -1.876592  0.167842
2  1.450615 -1.876592  0.167842
3  0.661788 -1.876592 -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

As we can see that it took the next valid non-NaN values and replaced the NaN values with it. But what if there are no valid NaN values after a particular NaN? Then, bfill does not do anything to those NaN values. For instance, let's apply the bfill method to a row. In our DataFrame, we do not have next valid NaN values for column 2 if we go by row. Hence, nothing will happen in that column.

df.bfill(axis = 1)

The output DataFrame will be:

          0         1         2
0  0.277356       NaN       NaN
1 -0.894346       NaN       NaN
2  1.450615  0.167842  0.167842
3  0.661788 -1.884698 -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

Now, let's talk about ffill or pad. What these functions will do is that they take in the previous valid value and fill it instead of NaN values. So, we can run any of the following commands:

df.fillna(method='ffill') #TO BE DEPRECATED

df.fillna(method='pad') #TO BE DEPRECATED

df.ffill() #USE THIS

df.pad() #TO BE DEPRECATED

The output of this is:

          0         1         2
0  0.277356       NaN       NaN
1 -0.894346       NaN       NaN
2  1.450615       NaN  0.167842
3  0.661788       NaN -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

As we can see there were no changes in the DataFrame. This is because there were no valid previous non-NaN values in columns 1 and 2. However, if we go by row, then we definitely have valid non-NaN values previous to the NaN values.

df.ffill(axis = 1)

The output of this would be:

          0         1         2
0  0.277356  0.277356  0.277356
1 -0.894346 -0.894346 -0.894346
2  1.450615  1.450615  0.167842
3  0.661788  0.661788 -1.884698
4  0.870281 -1.876592 -1.103075
5  0.605526  0.080819  3.067988
6 -0.439021 -0.143373 -0.952685

02 . Duplicates 🍰🍰

Duplicate rows may be found in a DataFrame for any number of reasons. Here is an example DataFrame, df:

    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   1

If you are following along, you can use the following command to create the DataFrame:

data = pd.DataFrame({'k1': ['one', 'two'] * 3 + ['two'], 'k2': [1, 1, 2, 3, 3, 4, 4]})

The DataFrame method duplicated returns a boolean Series indicating whether each row is a duplicate (has been observed in a previous row) or not. By duplicate, I mean each value, not just one value. So, for a row to a duplicate of another, the entire row must be a replica of any previous one.

So, as we can see only row 6 is a duplicate or row 1. To find this out programatically, we can use .duplicated( ) method.

df.duplicated( )

This will return a boolean array, where True implies that the row has been duplicated.

data.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

Please note that this function does not axis parameter and works only on rows. As we have already studied in the original Pandas blog post, we can use .drop_duplicates( ) method to drop the rows that are duplicates. (https://blog.dhruvbadaya.in/what-the-fck-is-pandas-a-complete-tutorial#heading-0407-remove-duplicates)

data.drop_duplicates( )

The output of this would be:

    k1  k2
0  one   1
1  two   1
2  one   2
3  two   3
4  one   3
5  two   4

Now, what if I want to drop the duplicates, however, I want to consider only one column. Well... we can give the list of columns we want to consider as a list in the drop.duplicates( ) function. For example, if I want to consider only the column k2, we can write:

data.drop_duplicates(['k1'])

This will drop all the rows that have duplicated values in k1.

    k1  k2
0  one   1
1  two   1

Now, if you have observed, drop_duplicates( ) keeps the first value and drops the rest of the values. However, by giving the keep=last parameter, we can tell drop_duplicates to keep the last value and drop the rest of the values.

 data.drop_duplicates(keep='last')

Now here, instead of the 6th row, the 1st row will be dropped.

    k1  k2
0  one   1
2  one   2
3  two   3
4  one   3
5  two   4
6  two   1

If we want to neither keep the first value nor the last value we can set keep to False.

data.drop_duplicates(['k1'],keep=False)

This will result in an Empty DataFrame.

03 . Mapping 🗺️

For many datasets, you may wish to perform some transformation based on the values in an array, Series, or column in a DataFrame. Consider the following DataFrame 'data' collected about various kinds of meat:

          food  ounces
0        bacon     4.0
1  pulled pork     3.0
2        bacon    12.0
3     Pastrami     6.0
4  corned beef     7.5
5        Bacon     8.0
6     pastrami     3.0
7    honey ham     5.0
8     nova lox     6.0

If you are following along, you can use the following command to create the DataFrame:

data = pd.DataFrame({'food': ['bacon', 'pulled pork', 'bacon', 'Pastrami', 'corned beef', 'Bacon', 'pastrami', 'honey ham', 'nova lox'], 'ounces': [4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

Suppose you wanted to add a column indicating the type of animal that each food came from. Let’s write down a mapping (dictionary) of each distinct meat type to the kind of animal:

meat_to_animal = { 'bacon': 'pig', 'pulled pork': 'pig', 'pastrami': 'cow', 'corned beef': 'cow', 'honey ham': 'pig', 'nova lox': 'salmon' }

The map method on a Series accepts a function or dictionary object containing a mapping, but here we have a small problem in that some of the meats are capitalized and others are not. Thus, we need to convert each value to lowercase using the str.lower Series method:

lowercased = data['food'].str.lower()

Now, this has converted one column of our DataFrame to lowercase, without modifying the actual DataFrame.

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

Now, let's map the values of our Dictionary to the lowercased DataFrame.

lowercased = lowercased.map(meat_to_animal)

What this will do is it will get the key from the Dictionary and replace the values in the DataFrame corresponding the keys in the Dictionary with Dictionary values. If you did not understand this, you will once you see the output. The output of this is:

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

...and now, we can set a new column called animal in our original DataFrame.

data['Animal'] = lowercased

The output of this would be:

          food  ounces  Animal
0        bacon     4.0     pig
1  pulled pork     3.0     pig
2        bacon    12.0     pig
3     Pastrami     6.0     cow
4  corned beef     7.5     cow
5        Bacon     8.0     pig
6     pastrami     3.0     cow
7    honey ham     5.0     pig
8     nova lox     6.0  salmon

We could also have passed a function that does all the work:

data['food'].map(lambda x: meat_to_animal[x.lower()])

This is a lambda function. You can read about lambda functions in this blog post: [COMING UP]

04 . Replacing Values 🔁

We have already seen fillna( ) which is used to replace NaN values with some specific values. That is a special case of NaN values. But what about numerical values. Suppose that I have to replace -999 in my DataFrame with nan. Then, what do I do in this situation?

Let's start by creating a DataFrame.

data = pd.Series([1., -999., 2., -999., -1000., 3.])

The -999 values might be sentinel values for missing data. In data analysis, a sentinel value refers to a specific value used to signal the end of a dataset, a sequence, or a condition within the data. It acts as a marker or indicator, often distinct from typical data values, to denote the conclusion or a particular state in the data. Sentinel values can be used for various purposes, such as indicating missing data, terminating loops, or denoting the end of a series of records.

To replace these with NA values that pandas understand, we can use replace, producing a new Series (unless you pass inplace=True):

data.replace(-999,np.nan)

This will produce the following output:

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

If you want to replace multiple values at once, you instead pass a list and then the substitute value:

data.replace([-999, -1000], np.nan)

The output of this will be as follows:

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

To use a different replacement for each value, pass a list of substitutes:

data.replace([-999, -1000], [np.nan, 0])

The output of this will be as follows:

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

The argument passed can also be a dictionary:

data.replace({-999: np.nan, -1000: 0})

This will produce the same output. Please note that the data.replace method is distinct from data.str.replace, which performs string substitution element-wise. We'll look at this method later on in this blog post.

05 . Discretizing and Binning 🗑️

Continuous data is often discretized or otherwise separated into “bins” for analysis. Suppose you have data about a group of people in a study, and you want to group them into discrete age buckets:

 ages = [20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]

Let’s divide these into bins of 18 to 25, 26 to 35, 36 to 60, and finally 61 and older. To do so, you have to use cut, a function in pandas. First, we create a list of bins where we want to transfer these values.

bins = [18, 25, 35, 60, 100]

The first bin will be (18,25]. This is a representation commonly used in Set Theory. This implies that 18 in excluded and 25 in included in the range. So, the value mentioned in excluded, but the higher value is included in the bin. After creating this list, we use the pd.cut( ) funtion.

cats = pd.cut(ages,bins)

This will create a list (not a list, but let's say list for now) as follows:

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64, right]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

This list gives the bin corresponding to each value of the ages list.

Now above, I have mentioned that it's not an actual list. So, what is this? The object pandas returns is a special Categorical object. The output you see describes the bins computed by pandas.cut. You can treat it like an array of strings indicating the bin name; internally it contains a categories array specifying the dis‐ tinct category names along with a labeling for the ages data in the codes attribute.

cats.codes

The output of this would be:

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

Let's look at the explanation of this output. Every bin is assigned an index. The first index is assigned 0, the second is assigned 1 and so on. These numbers are the indices of corresponding bins. 0 implies that the corresponding value lies in the first bin, 1 implies that the corresponding value lies in the second bin and so on.

In Pandas, we can count unique values of an array using pd.value_counts( ).

pd.value_counts(cats) #TO BE DEPRECATED

cats.value_counts( ) #USE THIS IN THE FUTURE

The output of this is:

(18, 25]     5
(25, 35]     3
(35, 60]     3
(60, 100]    1
Name: count, dtype: int64

Consistent with mathematical notation for intervals, a parenthesis means that the side is open, while the square bracket means it is closed (inclusive). You can change which side is closed by passing right=False:

pd.cut(ages, [18, 26, 36, 61, 100], right=False)

The output of this will be:

[[18, 26), [18, 26), [18, 26), [26, 36), [18, 26), ..., [26, 36), [61, 100), [36, 61), [36, 61), [26, 36)]
Length: 12
Categories (4, interval[int64, left]): [[18, 26) < [26, 36) < [36, 61) < [61, 100)]

Now, as you can see that the left value is included in the interval and right value is not.

You can also pass your own bin names by passing a list or array to the labels option:

group_names = ['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages, bins, labels=group_names, right=False)

The output of this will be:

['Youth', 'Youth', 'YoungAdult', 'YoungAdult', 'Youth', ..., 'YoungAdult', 'Senior', 'MiddleAged', 'MiddleAged', 'YoungAdult']
Length: 12
Categories (4, object): ['Youth' < 'YoungAdult' < 'MiddleAged' < 'Senior']

If you pass an integer number of bins to cut instead of explicit bin edges, it will com‐ pute equal-length bins based on the minimum and maximum values in the data. Consider the case of some uniformly distributed data chopped into fourths:

pd.cut(data, 4, precision=2)

The output of this will be:

0        (-247.75, 3.0]
1    (-1001.0, -749.25]
2        (-247.75, 3.0]
3    (-1001.0, -749.25]
4    (-1001.0, -749.25]
5        (-247.75, 3.0]
dtype: category
Categories (4, interval[float64, right]): [(-1001.0, -749.25] < (-749.25, -498.5] < (-498.5, -247.75] <
                                           (-247.75, 3.0]]

The precision=2 option limits the decimal precision to two digits. If I do not want decimals in my bin intervals, I can set precision to 0.

A closely related function, qcut, bins the data based on sample quantiles. Depending on the distribution of the data, using cut will not usually result in each bin having the same number of data points. Since qcut uses sample quantiles instead. Hence, you will obtain roughly equal-size bins. Let's see an example.

data = np.random.randn(1000) # Normally distributed
qcats = cats = pd.qcut(data, 4)
cats = pd.cut(data,4)

The output of qcats will be:

[(-0.669, 0.0697], (-0.669, 0.0697], (-0.669, 0.0697], (-3.064, -0.669], (-3.064, -0.669], ..., (0.778, 2.96], (-3.064, -0.669], (-0.669, 0.0697], (-0.669, 0.0697], (0.0697, 0.778]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.064, -0.669] < (-0.669, 0.0697] < (0.0697, 0.778] <
                                           (0.778, 2.96]]

The output of cats will be:

[(-1.557, -0.0516], (-1.557, -0.0516], (-1.557, -0.0516], (-1.557, -0.0516], (-1.557, -0.0516], ..., (-0.0516, 1.454], (-1.557, -0.0516], (-1.557, -0.0516], (-1.557, -0.0516], (-0.0516, 1.454]]
Length: 1000
Categories (4, interval[float64, right]): [(-3.069, -1.557] < (-1.557, -0.0516] < (-0.0516, 1.454] <
                                           (1.454, 2.96]]

Let's count the values in bins for cats and qcats.

cats.value_counts( )

The output of this will be:

(-3.069, -1.557]      69
(-1.557, -0.0516]    392
(-0.0516, 1.454]     450
(1.454, 2.96]         89
Name: count, dtype: int64

While for qcats,

 qcats.value_counts( )

The output of this will be:

(-3.064, -0.669]    250
(-0.669, 0.0697]    250
(0.0697, 0.778]     250
(0.778, 2.96]       250
Name: count, dtype: int64

As we can see that all bins in qcats are equal-sized, unlike cats. However, the intervals in cats are equal-sized, unlike qcats.

06 . Detection of Outliers 🔳

Outliers in data refer to observations or data points that significantly differ from other observations in a dataset. These data points are notably different from the majority of the data and can potentially skew statistical analyses, leading to misleading interpretations if not handled appropriately. Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:

data = pd.DataFrame(np.random.randn(1000, 4))
data.describe( )

The output of this will be something like this:

                 0            1            2            3
count  1000.000000  1000.000000  1000.000000  1000.000000
mean     -0.035560     0.044672     0.028896    -0.038226
std       1.042140     1.006944     0.981323     1.009795
min      -3.096313    -3.223508    -2.681701    -3.059708
25%      -0.729944    -0.567849    -0.620859    -0.771260
50%      -0.022396     0.050605     0.011061    -0.053772
75%       0.670204     0.710862     0.709288     0.661503
max       3.054431     3.079592     3.155822     2.895367

Suppose you wanted to find values in one of the columns exceeding 3 in absolute value:

col = data[2]
col[np.abs(col) > 3]

The output of this will be:

598    3.075123
882    3.155822
Name: 2, dtype: float64

Detecting outliers is a topic that deserves a separate blog post, since it's a vast and complex topic and hence, I won't talk much about it.

07 . Renaming Axis

Like values in a Series, axis labels can be similarly transformed by a function or map‐ ping of some form to produce new, differently labeled objects. You can also modify the axes in-place without creating a new data structure.

Let's consider the following DataFrame:

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
New York    8    9     10    11

If you are following along, you can create the DataFrame using the following command:

data = pd.DataFrame(np.arange(12).reshape((3, 4)),
                    index=['Ohio', 'Colorado', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

Like a Series, the axis indexes have a map method:

data.index.map(lambda x: x[:4].upper())

The output of this will be the indices of the DataFrame in capitalized format:

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

You can assign to index, modifying the DataFrame in-place:

data.index = data.index.map(transform)

The DataFrame will now look like this:

      one  two  three  four
OHIO    0    1      2     3
COLO    4    5      6     7
NEW     8    9     10    11

If you want to create a transformed version of a dataset without modifying the original, a useful method is rename:

data.rename(index=str.title, columns=str.upper)

This will output the following:

      ONE  TWO  THREE  FOUR
Ohio    0    1      2     3
Colo    4    5      6     7
New     8    9     10    11

With rename, we can apply string operations to the row and column indices. Notably, rename can be used in conjunction with a dictionary object providing new values for a subset of the axis labels:

data.rename(index={'OHIO': 'INDIANA'}, columns={'three': 'peekaboo'})

Now, Ohio will become Indiana and three will become peekaboo.

         one  two  peekaboo  four
INDIANA    0    1         2     3
COLO       4    5         6     7
NEW        8    9        10    11

rename saves you from the chore of copying the DataFrame manually and assigning it to its index and column attributes. Should you wish to modify a dataset in place, pass inplace=True.

08 . String Manipulation 🃏

String manipulation is as important in data analysis as DataFrame manipulation because a lot of work happens on Strings. Let us look at some string manipulation techniques.

07.01. Split( )

str.split() is a Python method that splits a string into a list of substrings based on a specified separator. By default, it splits the string using whitespace (spaces, tabs, newlines) as the separator, but you can specify a custom separator as an argument.

my_string = "Hello, world!"
result = my_string.split()  # Splits by default whitespace

The output of this will be:

['Hello,', 'world!']

You can also specify a custom separator:

my_string = "apple,orange,banana,grape"
result = my_string.split(',')

The output of this will be:

['apple', 'orange', 'banana', 'grape']

This method is helpful for breaking down a string into its individual components or tokens, which can be useful for various data manipulation tasks.

07.02. strip( )

str.strip() is a Python string method used to remove leading and trailing characters (whitespace by default) from a string. It doesn't modify the original string; instead, it returns a new string with the specified characters removed.

my_string = "   Hello, world!   "
result = my_string.strip()

The output of this will be:

'Hello, world!'

We can also use rstrip( ) if we want to only strip the whitespaces in the right or lstrip( ) if we wish to strip the whitespaces in the left. strip() is commonly used to clean up input strings, particularly user input where extra spaces or specific characters might be unintentionally added.

07.03. replace( )

replace( ) is a Python string method used to create a new string where specified occurrences of a substring within the original string are replaced with another substring. It does not modify the original string; instead, it returns a new string with the replacements.

my_string = "I like cats and dogs."
new_string = my_string.replace('cats', 'rabbits')

The output of this will be:

'I like rabbits and dogs.'

You can also specify the number of replacements:

my_string = "one, one, two, three, one"
new_string = my_string.replace('one', 'four', 2)

The output of this will be:

'four, four, two, three, one'

Here, only the first two occurrences of 'one' were replaced with 'four'. The third occurrence was left unchanged. This method is handy for replacing specific substrings within a string, which can be useful for data cleaning, formatting, or manipulation tasks.

07.04. startswith( ) and endswith( )

str.startswith() and str.endswith() are Python string methods used to check whether a string starts or ends with a specified substring, respectively. They return a Boolean value (True or False) based on whether the string meets the condition.

str.startswith(): This method checks if the string starts with a specific prefix. It takes a substring argument and returns True if the string starts with that substring; otherwise, it returns False.

For example,

my_string = "Hello, world!"
starts_with_hello = my_string.startswith("Hello")
print(starts_with_hello)
# Output: True

starts_with_hi = my_string.startswith("Hi")
print(starts_with_hi)
# Output: False

str.endswith(): Similar to startswith(), str.endswith() checks if the string ends with a specified suffix. It takes a substring argument and returns True if the string ends with that substring; otherwise, it returns False. For example,

my_string = "Hello, world!"
ends_with_world = my_string.endswith("world!")
print(ends_with_world)
# Output: True

ends_with_universe = my_string.endswith("universe")
print(ends_with_universe)
# Output: False

These methods are commonly used in conditional statements to check whether a string has a particular starting or ending pattern.

There are several other String methods and I suggest that you check the Pandas documentation to learn more about String methods: https://pandas.pydata.org/docs/user_guide/text.html

How the f*ck do I prepare my data for analysis?

Preparing data implies loading, cleaning, transforming, and rearranging data, before analyzing it.

Table of contents

01 . Missing Data 🤔

01.01. Filtering out Missing Data

01.02. Filling in Missing Data

02 . Duplicates 🍰🍰

03 . Mapping 🗺️

04 . Replacing Values 🔁

05 . Discretizing and Binning 🗑️

06 . Detection of Outliers 🔳

07 . Renaming Axis

08 . String Manipulation 🃏

07.01. Split( )

07.02. strip( )

07.03. replace( )

07.04. startswith( ) and endswith( )