Social Media Data Quality, Sampling and Coverage: Imagine a Box of Marbles….
Imagine a box of marbles, filled to the brim with all different kinds— from aggies to micas, steelys to catseyes. Imagine you want to ask a question like, “What is the proportion of steel marbles to glass marbles?” or “How many clay marbles have stripes of the color blue?” To answer this question, you don’t necessarily have to count every marble, and in fact, you could randomly pull marbles out of the box and essentially sample the world to get to answers with pretty high confidence in relatively short order.
Now imagine the box to be 200 feet high, with layers of 10,000 marbles a quarter of an inch in diameter on top of each other, and every day, that box grows in size and a machine is constantly filling that box to meet that size. Now ask the same questions. It’s a little bit harder, as what’s getting added may not be the same distribution of marbles as what already exists. And within a month, that box could be 6000 feet high and growing if it grows at 200 feet a day.
And now imagine you only get to measure a 4’ storage box of marbles grabbed from that marble flood and, again, ask the same question. Are you able to get the same answer? How sure are you? How did that storage box get filled? If you ask the proportion of blue to red, the answer will likely change if you pulled marbles out using a magnet, a scooper off the top, a spigot at the bottom or even a magical random marble collector. Unless you understand how your own storage box was filled, there may be a bias that will skew your answers.
When it comes to your social media program and using social data to inform business decisions, asking these questions about your data and understanding how it’s collected is critical. For example, what if you’re asking whether Pepsi owns a larger share of voice than Coke within the context of Twitter mentions during lunch hour? Well, it’s a lot like counting marbles. When analyzing data within Social Media, you need to ask yourself what you’re measuring and how you got the data that you’re measuring. We’ve been spending so much time talking about sentiment, influence, share of voice, ROI and all the interesting metrics, and not enough time talking about data sampling, quality and coverage. We’re making decisions on the answers that we get, but have we spent enough time thinking about the inherent biases with which we get this data? It’s a difficult problem that has a tendency to make you lose your marbles. 🙂
Communications Best Practices
Get the latest updates on PR, communications and marketing best practices.
Cision Product News
Keep up with everything Cision. Check here for the most current product news.
Thought leadership and communications strategy for the C-suite written by the C-suite.
A blog for and about the media featuring trends, tips, tools, media moves and more.