Pandas Handling Wrong Format

In a real world scenario, data are taken from various sources which causes inconsistencies in format of the data. For example, a column can have data of integer and string type as the data is copied from different sources.

Such inconsistencies can create challenges, making data analysis difficult or even impossible.

Let's look at an example.

import pandas as pd

# create dataframe
data = {
    'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
    'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
} df = pd.DataFrame(data) # calculate the mean temperature mean_temperature = df['Temperature'].mean() print(mean_temperature)

Output

TypeError: unsupported operand type(s) for +: 'float' and 'str'

Here, the Temperature column contains data in an inconsistent format, with a mixture of float and string types, which is causing a TypeError.

With Pandas, we can handle such issues by converting all the values in a column to a specific format.


Convert Data to Correct Format

We can remove inconsistencies in data by converting a column with inconsistencies to a specific format. For example,

import pandas as pd

# create dataframe
data = {
    'Country': ['USA', 'Canada', 'Australia', 'Germany', 'Japan'],
    'Date': ['2023-07-20', '2023-07-21', '2023-07-22', '2023-07-23', '2023-07-24'],
'Temperature': [25.5, '28.0', 30.2, 22.8, 26.3]
} df = pd.DataFrame(data)
# convert temperature column to float df['Temperature'] = df['Temperature'].astype(float)
# calculate the mean temperature mean_temperature = df['Temperature'].mean() print(mean_temperature)

Output

26.560000000000002

In this example, we converted all the values of Temperature column to float using astype(). This solves the problem of columns with mixed data.


Handling Mixed Date Formats

Another common example of inconsistency in data format that you often encounter in real life is mixed date formats.

Dates can be represented in various formats such as mm-dd-yyyy, dd-mm-yyyy, yyyy-mm-dd etc. Also, different separators such as /, -, . etc can be used.

We can handle this issue by converting the column containing dates to the DateTime format.

Let's look at an example.

import pandas as pd

# create a sample dataframe with mixed date formats
df = pd.DataFrame({'date': ['2022-12-01', '01/02/2022', '2022-03-23', '03/02/2022', '3 4 2023', '2023.9.30']})

# convert the date column to datetime format df['date'] = pd.to_datetime(df['date'], format='mixed', dayfirst=True)
print(df)

Output


   date
0 2022-12-01
1 2022-02-01
2 2022-03-23
3 2022-02-03
4 2023-04-03
5 2023-09-30

In the above example, we converted the mixed date formats to a uniform yyyy-mm-dd format. Here,

  • format='mixed': specifies that the format of each given date can be different
  • dayfirst=True: specifies that the day should be considered before the month when interpreting dates