Data Cleaning
Ali Atiyab Husain
Data cleaning consists of fixing and removing corrupted data, mislabelled data, incorrectly formatted data and many more. It also involves removing any missing, duplicate, or irrelevant data.
Types Of Problems In Data Cleaning
-
Removing Null Values
- This involves getting rid of missing values which are represented by null or nan(not a number) values . These values can affect how data analysis algorithms process data and may cause any unwanted results in our data analysis.
- In this situation, we can clean our data by either removing these nan/null values or by replacing these values by the mean, median or any other value. However, both of these methods have their vices: the first method reduces the amount and the second may be very time consuming and hectic.
-
Removing Unnecessary Values or Columns
- Sometimes we may have unnecessary values or data, or sometimes even whole columns which are completely redundant for our machine learning model.
- In this situation, we can clean our data by ‘dropping’ or removing these values through data slicing or pandas functions : specifically created for this action.
-
Converting DataTypes
- This involves converting the datatype of specific columns according to our requirements.
- For Example, sometimes we may have to deal with a customer database having a time column, here we can convert the datatype of the time column to the DateTime data type which makes it very convenient for converting date formats and helps in applying many other different functions.