Monday, March 14, 2016

Meanwhile, back in the Secret Underground Lair....

Those kind folk who have been following this blog for a bit may be aware that our home office, the Secret Underground Lair (SUL), is undergoing a bit of a reconfiguration. That's actually a "bit" of an understatement.

Before going to work last Monday evening (3/7 or 7/3 depending on your calendar), Mr. T and I spent about a hour of quality time grunting, straining and sweating as we resituated eighteen cartons of books. While moving the books, I discovered a few things: not all of the boxes or plastic storage containers were properly or efficiently packed, and more importantly, there are MANY books in storage that I need access to, so we will be swapping some books out.of storage and replacing them with books which really should be in storage.  It was worth it, though: for the moment, we have TWO walkways in the office. Mr. T tried some scenarios in Microsoft Visio, though, and is currently of the opinion that we will not be able to maintain this corridor if we want the best use of our limited space. I hope to do a bit of work in the Dungeon to free up some additional space. As we are on the cusp of Daylight Saving Time (ugh!), it's nearly midnight, so any more SUL work will take place tomorrow (Sunday 3/13 or 13/3!). And if anyone is doing the math, it has taken me seven days to get this far with this latest blog entry... overtime is great for our bottom line, but it definitely puts a damper on other things- like blogging!

Data!

In my previous post Working with someone else's data, I mentioned that I'm using peeron.com's data as the basis for my Lego data. I also indicated that although it's a great list, I have a good amount of work to do before it is useful to me. The following two lines are a sampling of consecutive rows copied from the .txt file, and illustrate quite well my "issues" with their data:

3001px2 Brick 2 x 4 with Eyes and Wavey Mouth Pattern   nose, nostril, snout
3001px20        Brick 2 x 4 with Red Danger Stripes and Two Horizontal Stripes Pattern

For starters, peeron.com has done a nice job of making a list of Lego parts. Their website is excellent, and their search engine is very nice. However, I'm not interested in their frontend, or even their instruction scans.

I'm cataloging and categorizing MY data, and their part number list is what I need. And although I've also mentioned before that this list is at LEAST four years out of date, up to date is not what I initially need. I need the basics first. And although peeron's 18,511 rows of data may seem impressive to someone who isn't an Adult Fan of Lego (AFOL), for me the list is horribly incomplete as I have purchased sets that were manufactured after 2012. Also, it has a lot of data that is extraneous for me, primarily part numbers for sticker sheets and parts which have somehow been redesignated, renumbered or somehow made obsolete. So, I need to do a bit of rework on their data.

The red bolded text above is two consecutive rows of data copied from their .txt file. Based on these two lines, I hope to show how much work I have cut out for me.

For starters, I took their .txt file and copied it into an Excel spreadsheet. My first task is to make two columns out of this data- one for the part number, and one for the description. I'm using 3001- this is the Lego designation for the ubiquitous 2x4 brick which nearly every human being has stepped on at least three times. The next few characters describe the variation/description of the brick. So far, so good. First problem: I don't know of an easy way to separate the part numbers from the descriptions- I'm certain there's some software that can parse the data, find the first empty space, and move the next set of characters into a new column. So, until I find a software solution, I'm doing CRTL+X, CRTL+V to manipulate things. Second problem: Once they are separated, I can't sort these, because although px2 comes before px20, px20-23 come before px3 in Excel. And some of these numbers run into the hundreds. So, nearly every suffix will need to be reformatted to actually display in numeric order. To do this, I'm going to create a new part number that my database will use; if I ever create my own Lego website, only the original Lego part numbers will be used.

After the columns are created (and before I start tweaking the part numbers), I'm going to pull out the extraneous part numbers and put them in a separate worksheet. After that, the remaining data descriptions will get trimmed down and standardized with a few hits with Find and Replace.

I've currently got 2,022 rows separated- roughly 11% of the 1st stage is complete. That's it for now.

As always, I am hochspeyer, blogging data analysis and management so you don't have to.






No comments:

Post a Comment