What is Data Cleaning?
Data cleaning is the process of preparing data for analysis. Data cleanup takes "messy data" and involves cleaning that includes: normalizing values, handling blank values (null), re-organizing data, and otherwise refining data into exactly what you need.
How do you know if your data needs cleaning?
Ask yourself, very generally, is the data correctly formatted and does it provide what I need? More specifically:
- Did you collect the data yourself or is it from somewhere else? If you’re re-using data, it’s likely that it’s not already formatted in the best way for your research and the tools you want to use.
- Do you know what all the columns or variables are?
- What kinds of data you should include your analysis, and how they are useful?
- Do you know if there are any missing values or possible errors?
- Have you looked for outliers? If outliers are present, you will need to decide how to handle them.
Common data errors or "messy data"
- Non-normalized values (i.e., some values are "USA" and some are "United States")
- Handling of null values (i.e., N/A, 0, blanks)
- A null just means "we don't know." However, this is indicated in many different ways depending on who entered the data.
- Differences in character encoding
- Structure of data does not match what you need
- Misspellings (caused by OCR or manually entered text)
- Punctuation/special characters
- Inconsistencies in abbreviations or capitalization
- Extra spaces
Adapted from University of Illinois Library Data Cleaning guide