What does "Data Massage" mean?

39,068

Solution 1

Manipulate, process, alter, recalculate. In short, if you are just moving the data in raw then no need to use internalStore, but if you're doing anything to it prior to storage, then you might want an internalStore.

Solution 2

Sometimes the whole process of moving data is referred to as "ETL" meaning "Extract, Transform, Load". Massaging the data is the "transform" step, but it implies ad-hoc fixes that you have to do to smooth out problems that you have encountered (like a massage does to your muscles) rather than transformations between well-known formats.

Thinks that you might do to "massage" data include:

  • Change formats from what the source system emits to what the target system expects, e.g. change date format from d/m/y to m/d/y.
  • replace missing values with defaults, e.g. Supply "0" when a quantity is not given.
  • Filter out records that not needed in the target system.
  • Check validity of records, and ignore or report on rows that would cause an error if you tried to insert them.
  • Normalise data to remove variations that should be the same, e.g. replace upper case with lower case, replace "01" with "1".

Solution 3

Clean up, normalization, filtering, ... Just changing the data somehow from the original input into a form that is better suited to your use.

Solution 4

And finally there is the less savory practice of massaging the data by throwing out data (or adjusting the numbers) when they don't give you the answer you want. Unfortunately people doing statistical analysis often massage the data to get rid of those pesky outliers which disprove their theory. Because of this practice referring to data cleaning as massaging the data is inappropriate. Cleaning the data to make it something that can go into your system (getting rid of meaningless dates like 02/30/2009 because someone else stored them in varchar instead of as dates, separating first and last names into separate fields, fixing all uppercase data, adding default values for fields that require data when the supplied data isn't given, etc.) is one thing - massaging the data implies a practice of adjusting the data inappropriately.

Also to comment on the idea that it is bad to have an internal store if you are not changing any data, I strongly disagree with this (and I have have loaded thousands of files from hundreds of sources through the years. In the first place, there is virtually no data that doesn't need to at least be examined for for cleaning. And if it was ok in the first run doesn't guarantee that a year later it won't be putting garbage into your system. Loading any file without first putting it into a staging table and cleaning it is simply irresponsible.

Also we find it easier to research issues with data if we can see easily the contents of the file we loaded in a staging table. Then we can pinpoint exactly which file/source gave us the data in question and that resolves many issues where the customer thinks we loading bad information that they actually sent us to load. In fact we always use two staging tables, one for the raw data as it came in from the file and one for the data after cleaning but before loading to the production tables. As a result I can resolve issues in seconds or minutes that would take hours if I had to go back and search through the original files. Because one thing you can guarantee is that if you are importing data, there will be times when the content of that data will be questioned.

Share:
39,068

Related videos on Youtube

MrM
Author by

MrM

Updated on July 09, 2022

Comments

  • MrM
    MrM almost 2 years

    I am doing some reading, and came across avoiding an internalStore if my application does not need to massage the data before being sent to SQL. What is a data massage?