What is Data Cleaning?

Data cleaning is a crucial process in Data Mining. It carries an important part in building a model. Data Cleaning can be regarded as the process that is needed but it often neglected by everyone.

The quality of the data is very important and it should be kept safe and preserved at all times. The data is sometimes incomplete, noisy, and inconsistent. This can affect the result in some way or the other.

Data cleaning in data mining is a process of identifying and removing the data that are incomplete, noisy, and inconsistent from a database.

There are many data cleaning methods through which the data should be run. The methods are described below:

  1. Ignore the tuples: This method is not very feasible, as it only comes to use when the tuple has several attributes is having missing values.
  2. Fill the missing value: This approach is also not very effective or feasible. Moreover, it can be a time-consuming method. In the approach, one has to fill in the missing value. This is usually done manually, but it can also be done by attribute mean or by using the most probable value.
  3. Binning method: This approach is very simple to understand. The smoothing of sorted data is done using the values around it. The data is then divided into several segments of equal size. After that, the different methods are executed in order to complete the task.
  4. Regression: The data is made smooth with the help of using the regression function. The regression can be linear or multiple. Linear regression has only one independent variable and multiple regression has more than one independent variable.
  5. Clustering: This method mainly operates on groups. Clustering groups the data in a cluster. Then, the outliers are detected with the help of clustering. Next, the similar values are then arranged into a “group” or a “cluster”.

It is said that “Better data beats fancier algorithms”. Different types of data require different types of cleaning. But below is mentioned the basic and starting point of steps involved in data cleaning:

  1. Removal of unwanted observations
  2. Fixing structural errors
  3. Managing unwanted outliers
  4. Handling missing data

Removal of unwanted observations:

The first step in the process of data cleaning is to remove unwanted observations from the dataset. There are two types of observations usually observed for removing unwanted observations which are given below:

  1. Duplicate observations: These kinds of observations are usually found during the process of data collection. This problem also occurs when multiple datasets are combined from multiple places, or from the scrape data, or the data received from one of the clients or any other department.
  2. Irrelevant observations: Observations that can be of any type of data that has no use to the data expert. It can be removed directly.

Fixing Structural Errors:

The next step in the process is fixing structural errors. The errors that come up during the process of measurement, transfer of data, or any other situation like that. The errors are the typos in the name of features or mislabeled classes. For example, the model treats john and John as a different class or value, though they represent the same value.

Managing unwanted outliers:

Outliers can be very problematic such as linear regression models are less robust to outliers than decision tree models. The removal of outliers shouldn’t be allowed until there is a convincing reason to remove them. However, these outliers have to be removed sometimes as this can optimize the performance. Outliers are important but at the same time, they can have disadvantages also.

Handling missing data:

Missing data has to be handled very carefully. It cannot be ignored or removed from the dataset.  There are various ways of dealing with missing data:

Dropping observations with missing value:

This method is helpful but when observations are dropped, the information is dropped also.

Imputing the missing values from the previous observations:

In this method, the missing value is filled with another value but this way, there is a loss in the information.

source