Wednesday, March 30, 2016

Data cleansing, or Purge #4

I had about 50% of what passes for a normal length post written when I decided to scrap it and start afresh. I think I've mentioned that I truly hate rewriting anything, but the post just wasn't going anywhere. More importantly, I had no logical transition to a data breakthrough.

For the past few weeks I've been working on my Peeron-based database (Working with someone else's data and Meanwhile, back in the Secret Underground Lair...). It had been slow going, and I asked a coworker about it. My shy coworker suggested that my solution might be found in Excel.

I thought about that for a bit. I'm not bad with Excel (and that statement alone could be the subject of SEVERAL blog posts!), so I did a bit of exploring. In Excel 2007, on the Data tab, resides "text to columns". As I recall, this feature has been available for some time, but this has been the first time that I've used it. Recapping my challenge: I've got a text file (the Peeron data) which has the part numbers I need, plus much other ancillary data which I may or may not need.  Important part: my source data is a .txt file.

So, I copied the master file to create a new working file. In the old master file, I executed text to columns; in the new (working), and then copied column A to the working file. This accomplished a few things: it preserved Peeron's numbering, as well as their descriptions.

Skipping a few steps, what I have now are four columns: Column D is unmodified Peeron data, column C is a size column, column B is called "Base w/mods" and column A is the "Base" column. I plan to add at least one more column at some point which will be a custom "internal" part number.

But... for now, I'm happy with my data, which is more than I've honestly been able to say about it for a few years! Seriously, this is one of those rubber meets the road blogs,where the plan actually comes together and progress is made! As promised, I've added another sheet to the workbook, and have begun extracting the extraneous data entries from the main spreadsheet, such as stickers, obsolete and moved part numbers.

Most all of the remainder of the process is going to be manual, but the text to columns wizard probably saved me upwards of sixty+  hours of work, assuming 300 rows per hour with the old cut and paste method I had been using. And I say upwards, because I think I only achieved 300 rows per hour once; generally, I've been at 200 or fewer rows per hour.

I'm desperately trying to get this published. I've said on more than one occasion that sometimes blogs just seem to write themselves. Well, this is not one of those occasions.

I'm going to close with a few speeds and feeds. As best I can tell, there were 18,512 rows of data in the original Peeron file. With the new and improved format, I've culled out close to 900 rows of data which are of no use to me. I still have quite a way to go, but for the first time in some while I'm seeing that there is a pot of gold at the end of this particular rainbow.

As always, I am hochspeyer, blogging data analysis and management so you don't have to.


Monday, March 14, 2016

Meanwhile, back in the Secret Underground Lair....

Those kind folk who have been following this blog for a bit may be aware that our home office, the Secret Underground Lair (SUL), is undergoing a bit of a reconfiguration. That's actually a "bit" of an understatement.

Before going to work last Monday evening (3/7 or 7/3 depending on your calendar), Mr. T and I spent about a hour of quality time grunting, straining and sweating as we resituated eighteen cartons of books. While moving the books, I discovered a few things: not all of the boxes or plastic storage containers were properly or efficiently packed, and more importantly, there are MANY books in storage that I need access to, so we will be swapping some books out.of storage and replacing them with books which really should be in storage.  It was worth it, though: for the moment, we have TWO walkways in the office. Mr. T tried some scenarios in Microsoft Visio, though, and is currently of the opinion that we will not be able to maintain this corridor if we want the best use of our limited space. I hope to do a bit of work in the Dungeon to free up some additional space. As we are on the cusp of Daylight Saving Time (ugh!), it's nearly midnight, so any more SUL work will take place tomorrow (Sunday 3/13 or 13/3!). And if anyone is doing the math, it has taken me seven days to get this far with this latest blog entry... overtime is great for our bottom line, but it definitely puts a damper on other things- like blogging!

Data!

In my previous post Working with someone else's data, I mentioned that I'm using peeron.com's data as the basis for my Lego data. I also indicated that although it's a great list, I have a good amount of work to do before it is useful to me. The following two lines are a sampling of consecutive rows copied from the .txt file, and illustrate quite well my "issues" with their data:

3001px2 Brick 2 x 4 with Eyes and Wavey Mouth Pattern   nose, nostril, snout
3001px20        Brick 2 x 4 with Red Danger Stripes and Two Horizontal Stripes Pattern

For starters, peeron.com has done a nice job of making a list of Lego parts. Their website is excellent, and their search engine is very nice. However, I'm not interested in their frontend, or even their instruction scans.

I'm cataloging and categorizing MY data, and their part number list is what I need. And although I've also mentioned before that this list is at LEAST four years out of date, up to date is not what I initially need. I need the basics first. And although peeron's 18,511 rows of data may seem impressive to someone who isn't an Adult Fan of Lego (AFOL), for me the list is horribly incomplete as I have purchased sets that were manufactured after 2012. Also, it has a lot of data that is extraneous for me, primarily part numbers for sticker sheets and parts which have somehow been redesignated, renumbered or somehow made obsolete. So, I need to do a bit of rework on their data.

The red bolded text above is two consecutive rows of data copied from their .txt file. Based on these two lines, I hope to show how much work I have cut out for me.

For starters, I took their .txt file and copied it into an Excel spreadsheet. My first task is to make two columns out of this data- one for the part number, and one for the description. I'm using 3001- this is the Lego designation for the ubiquitous 2x4 brick which nearly every human being has stepped on at least three times. The next few characters describe the variation/description of the brick. So far, so good. First problem: I don't know of an easy way to separate the part numbers from the descriptions- I'm certain there's some software that can parse the data, find the first empty space, and move the next set of characters into a new column. So, until I find a software solution, I'm doing CRTL+X, CRTL+V to manipulate things. Second problem: Once they are separated, I can't sort these, because although px2 comes before px20, px20-23 come before px3 in Excel. And some of these numbers run into the hundreds. So, nearly every suffix will need to be reformatted to actually display in numeric order. To do this, I'm going to create a new part number that my database will use; if I ever create my own Lego website, only the original Lego part numbers will be used.

After the columns are created (and before I start tweaking the part numbers), I'm going to pull out the extraneous part numbers and put them in a separate worksheet. After that, the remaining data descriptions will get trimmed down and standardized with a few hits with Find and Replace.

I've currently got 2,022 rows separated- roughly 11% of the 1st stage is complete. That's it for now.

As always, I am hochspeyer, blogging data analysis and management so you don't have to.






Saturday, March 5, 2016

Working with someone else's data

At a certain point in one's data career (plain English: pretty near every day, and with nearly every piece of data one normally touches!), you will have the "opportunity" to work with data someone else has prepared, or at least accumulated or aggregated or in some way modified. As I create Forty-Two, "most" of the data I will be using is data which I have entered (created) into a table. There is one notable exception to this, though, and it is for me a moderately to severely painful one: Lego parts.

As mentioned in a previous post, my experience tells me that the best place for my Lego data is in a spreadsheet- at least initially. And although calculations can be done in Access, Excel is much easier for me. And, they can be linked at a later date.

However, there is a HUGE caveat to the Lego data. The title refers to someone else's data. In my case, I'm using peeron.com's parts list. It's the best one that I've found, and many other AFOL (Adult Fan of Lego) sites use it.as a parts reference. The problem with the list and site is that they're horribly out of date. The list has a very standard naming convention, and the last update was nearly four years ago at the time of this writing. However, as it is the best, easiest to use and most complete list currently available, I've decided to use it. I can update newer parts as I find them on other websites.

The other issue with the Peeron data is that it is a .txt file. Not bad when importing to Excel- just copy and paste. However, to make it usable, I have to manually edit the 18K+ rows of data. As I'm in no great hurry to finish this phase of the project, it's not too much of an issue. Still, .... I know a guy (as they say) who might be able to help. More on that later.

The final issue with Peeron is that its creator strove to make it a very complete parts list. As such, there's a great deal of data which I actually don't need- stickers and superseded part numbers are two types which come to mind immediately. So, if my guy can fix my primary issue, the other issues will be much easier to deal with. If not, then I still have a great deal of work ahead of me. UPDATE: I'm not really surprised- but also not horribly disappointed- that we were not able to fix the data.

Before I forget, I wanted to post a brief update on the blog itself. I'm not quite sure when I did this last, but I think it was around the time the blog hit 10K viewers. As of today, the blog has over 15K viewers in fifty-five countries on six continents (c'mon, Antarctica!). Africa is represented by three countries, Asia by seventeeen, Australia by one, Europe by twenty-seven (I'm counting Russia in the Europe column rather than Asia), North America by six and South America by one. What's most amazing to me is that of these fifty-five countries, I could only count seven where English was either the official language, a dominant language, or one of a group of commonly accepted languages.

To each and every reader- THANK YOU!

Last: a small compilation of my blogs dealing with data (for those who are interested in how a small-time operator handles data)

http://hochspeyer.blogspot.com/2016/02/you-said-this-was-about-data-analysis.html
http://hochspeyer.blogspot.com/2015/06/data-science-pt-1.html
http://hochspeyer.blogspot.com/2016/02/a-database-against-rules.html
http://hochspeyer.blogspot.com/2015/05/data-defined.html
http://hochspeyer.blogspot.com/2015/02/forty-two-v7-or-so.html

As always, I am hochspeyer, blogging data analysis and management so you don't have to.