Wednesday, March 30, 2016

Data cleansing, or Purge #4

I had about 50% of what passes for a normal length post written when I decided to scrap it and start afresh. I think I've mentioned that I truly hate rewriting anything, but the post just wasn't going anywhere. More importantly, I had no logical transition to a data breakthrough.

For the past few weeks I've been working on my Peeron-based database (Working with someone else's data and Meanwhile, back in the Secret Underground Lair...). It had been slow going, and I asked a coworker about it. My shy coworker suggested that my solution might be found in Excel.

I thought about that for a bit. I'm not bad with Excel (and that statement alone could be the subject of SEVERAL blog posts!), so I did a bit of exploring. In Excel 2007, on the Data tab, resides "text to columns". As I recall, this feature has been available for some time, but this has been the first time that I've used it. Recapping my challenge: I've got a text file (the Peeron data) which has the part numbers I need, plus much other ancillary data which I may or may not need.  Important part: my source data is a .txt file.

So, I copied the master file to create a new working file. In the old master file, I executed text to columns; in the new (working), and then copied column A to the working file. This accomplished a few things: it preserved Peeron's numbering, as well as their descriptions.

Skipping a few steps, what I have now are four columns: Column D is unmodified Peeron data, column C is a size column, column B is called "Base w/mods" and column A is the "Base" column. I plan to add at least one more column at some point which will be a custom "internal" part number.

But... for now, I'm happy with my data, which is more than I've honestly been able to say about it for a few years! Seriously, this is one of those rubber meets the road blogs,where the plan actually comes together and progress is made! As promised, I've added another sheet to the workbook, and have begun extracting the extraneous data entries from the main spreadsheet, such as stickers, obsolete and moved part numbers.

Most all of the remainder of the process is going to be manual, but the text to columns wizard probably saved me upwards of sixty+  hours of work, assuming 300 rows per hour with the old cut and paste method I had been using. And I say upwards, because I think I only achieved 300 rows per hour once; generally, I've been at 200 or fewer rows per hour.

I'm desperately trying to get this published. I've said on more than one occasion that sometimes blogs just seem to write themselves. Well, this is not one of those occasions.

I'm going to close with a few speeds and feeds. As best I can tell, there were 18,512 rows of data in the original Peeron file. With the new and improved format, I've culled out close to 900 rows of data which are of no use to me. I still have quite a way to go, but for the first time in some while I'm seeing that there is a pot of gold at the end of this particular rainbow.

As always, I am hochspeyer, blogging data analysis and management so you don't have to.


No comments:

Post a Comment