Saturday, March 5, 2016

Working with someone else's data

At a certain point in one's data career (plain English: pretty near every day, and with nearly every piece of data one normally touches!), you will have the "opportunity" to work with data someone else has prepared, or at least accumulated or aggregated or in some way modified. As I create Forty-Two, "most" of the data I will be using is data which I have entered (created) into a table. There is one notable exception to this, though, and it is for me a moderately to severely painful one: Lego parts.

As mentioned in a previous post, my experience tells me that the best place for my Lego data is in a spreadsheet- at least initially. And although calculations can be done in Access, Excel is much easier for me. And, they can be linked at a later date.

However, there is a HUGE caveat to the Lego data. The title refers to someone else's data. In my case, I'm using peeron.com's parts list. It's the best one that I've found, and many other AFOL (Adult Fan of Lego) sites use it.as a parts reference. The problem with the list and site is that they're horribly out of date. The list has a very standard naming convention, and the last update was nearly four years ago at the time of this writing. However, as it is the best, easiest to use and most complete list currently available, I've decided to use it. I can update newer parts as I find them on other websites.

The other issue with the Peeron data is that it is a .txt file. Not bad when importing to Excel- just copy and paste. However, to make it usable, I have to manually edit the 18K+ rows of data. As I'm in no great hurry to finish this phase of the project, it's not too much of an issue. Still, .... I know a guy (as they say) who might be able to help. More on that later.

The final issue with Peeron is that its creator strove to make it a very complete parts list. As such, there's a great deal of data which I actually don't need- stickers and superseded part numbers are two types which come to mind immediately. So, if my guy can fix my primary issue, the other issues will be much easier to deal with. If not, then I still have a great deal of work ahead of me. UPDATE: I'm not really surprised- but also not horribly disappointed- that we were not able to fix the data.

Before I forget, I wanted to post a brief update on the blog itself. I'm not quite sure when I did this last, but I think it was around the time the blog hit 10K viewers. As of today, the blog has over 15K viewers in fifty-five countries on six continents (c'mon, Antarctica!). Africa is represented by three countries, Asia by seventeeen, Australia by one, Europe by twenty-seven (I'm counting Russia in the Europe column rather than Asia), North America by six and South America by one. What's most amazing to me is that of these fifty-five countries, I could only count seven where English was either the official language, a dominant language, or one of a group of commonly accepted languages.

To each and every reader- THANK YOU!

Last: a small compilation of my blogs dealing with data (for those who are interested in how a small-time operator handles data)

http://hochspeyer.blogspot.com/2016/02/you-said-this-was-about-data-analysis.html
http://hochspeyer.blogspot.com/2015/06/data-science-pt-1.html
http://hochspeyer.blogspot.com/2016/02/a-database-against-rules.html
http://hochspeyer.blogspot.com/2015/05/data-defined.html
http://hochspeyer.blogspot.com/2015/02/forty-two-v7-or-so.html

As always, I am hochspeyer, blogging data analysis and management so you don't have to.



No comments:

Post a Comment