Saturday, April 2, 2016

This IS normal(ized) in my world

I've mentioned once or twice (I think) that I'm not a professional DBA; I do this stuff for fun! (??? ORLY?)

Yup, 'tis true: it's something of a hobby, or repast, or possibly a fairly intense interest. In any event, this interest has made me money before, and hopefully will do so again. This particular post is especially ironic from a few angles: for starters, I've also mentioned that I try to do as much normalization of a database right from the start, and if you read the last post AND/OR you're a dba or involved with databases, you'll note that I've pretty much thrown out all of the data normalization conventions for this flat database. The reason is pretty simple: I know my client (me!), and I know what data WILL be needed, and what data MIGHT be needed. Also, in the end, this Excel workbook is still data.

So, to recap, I have a .txt file that I have converted into something resembling a flat database. Three of the current four columns have repetitive (if not identical) data. Why?

The answer is simple in my context, but if this were someone trying to apply a fix to an existing problem for, say, a paying client, and trying to explain that you had to create repetitive data to make the resulting database entries more efficient... well, not necessarily so simple.

I was talking with one of my coworkers earlier this week, who is a data guy at heart. A month or so ago, he had taken a stab at trying to extract those target part numbers from the text file and had dissatisfying results. The other day, I showed him the new and improved .xlsx data file, and warned him that this was something akin to anti-normalization. When he looked at the file, he cringed in agreement. However, once I explained exactly why I had created columns with duplicate or near duplicate data, he agreed with my process. And therein lies the proverbial "rub": even though I've formatted this .xlsx file as a flat database (which came to me as a .txt file, which I'm guessing was extracted from some other format), it isn't a database- even a flat one.

It's a dataset. Period. And as it is a dataset, it needs cleansing, not normalizing.

Now, having said that, the casual reader is probably wondering how this dataset will be utilized. Well, THIS is where it gets interesting (ORLY??). Before I can import ANY data into the database, it needs to be usable. In my world, this means I need to be able to count pieces by size, color, family and type; in other words, I need to make this usable by Forty-Two (the master database). So, once I clean up the data, my goal is to import all of it into Access as a part of the inventory.

That day is still far away.

As always, I am hochspeyer, blogging data analysis and management so you don't have to.

No comments:

Post a Comment