Wednesday, May 13, 2015

Data, defined

Although it is not my intent, I am certain that this post has the potential to step on a few toes, possibly bruise an ego or two, or ruffle some feathers. I may even get someone mad. Really e-mad.

For starters, I do not have any letters, diplomas, certifications and am not currently professionally employed in whatever one might consider the "data community". Whatever that might be. I do not claim to be an expert or have any special expertise or training in the areas of Big Data, the Internet of Things/Everything, Statistics, Analytics or The Cloud. I was once employed as a data analyst working with Small Data for a short time.

Whew!

So, who and what exactly am I?

I'm a guy who tweets (and retweets) primarily on the subjects of Big Data, IoT, programming and related topics. As far back as high school- maybe even earlier- I've been interested in data. It was either my music collection or Fletcher Pratt's Naval Wargame that gave me my start in classifying and quantifying. I remember even attempting to do a few music surveys way back when, and some of the respondents were unhappy because the polls were not simple popularity contests, but the answers were weighted based upon their position on the poll. Fast forward to today. I'm currently building a flat database of my Lego collection in Excel 2007 (why 2007? Because that's what I have on the computer nearest to the Legos!). This, in turn, will be added to my master database Forty-Two- so named because it answers the question of Life, the Universe and Everything.

Having said ALL of that, I'd like to start off by saying that the term "data" may not be as concrete as we are lead to believe. In my world, data comes in the following flavors: Big Data, Not-So-Big Data, Small Data, Micro Data, and Statistics. Depending upon the size of the dataset(s) and one's perspective, most- if not all- data can fit into more than one classification. Really? Sure. Case: say there's a hypothetical high school senior who is one of the stars of his basketball team. He's a good defender, doesn't get a great deal of fouls (below the league average), and is about average in scoring- except he leads the league in free throw percentage. Several colleges and universities are interested in him- they've got data on this fellow going back to 6th grade. That's data- to them. To me, a person who could care less about basketball- it's nothing more than a bunch of irrelevant stats. On the other hand, these same scouts would not be impressed by the number of PhD's that follow me on Twitter.

So, how big is a Big Data dataset? I asked a coworker. He wasn't sure, but thought a mail list might qualify. Don't laugh too soon- some of the mail lists I've seen have more than 10 million names. To me, though, I'd put that in the Not-So-Big Data or Small Data categories. The IoT,  Amazon, Google, Youtube and Wikipedia definitely fit into the Big Data category, but to the average person, these can be tough to visualize. So, for what I think might be a decent, understandable Big Data dataset, I propose the 2010 U.S. Census. It was a 10 item questionnaire (with a few extra answers possible) that mailed to 135,000,000 addresses representing approximately 309,000,000 persons.

Small Data could be a database, a website or the phone directory of a small to medium sized city- the lines are pretty fuzzy here.

Lastly, there's microdata. I'm not sure if this term is used anywhere else, but I find it to be a convenient term for personal data- data generated and maintained by one person or one family for their own use and not often formally shared. A cataloged collection of coins, stamps, recipes, exercise/workout logs or Legos- all of these are Microdata in my worldview.

Thanks for your patience- I hope you enjoyed this. I generally write a lot less... I'm not a fan of writing or reading walls of words!

As always, I am hochspeyer, blogging data analysis and management so you don't have to.

No comments:

Post a Comment