Data Mining & Engineering, Part 1
Recently, I was invited by Wendy Parker and Tia Lerud to do a presentation for their Data Mining class, part of a University of Washington Professional and Continuing Education certification program in Business Intelligence. Students in this class are learning to use a variety of analytical tools to do hardcore analysis of business data. Not being a professional orator, it was a slightly harrowing experience for me, but the class seemed engaged with the material, and it was a good opportunity to attempt to inject a little real-world experience into a practicum on data mining. I’ll summarize a portion of the presentation here, but you can also check out the slides that go with this blog here >>
Articulating the Problem
Being able to articulate a problem is essential in any discipline. When experimentation and refinement are easy and fast, it’s simple enough to start with a nebulous idea and work inward from there. However, when data sets are enormous and a single experiment can take hours, days or weeks, it is important to reason thoroughly about a problem before going in. People can articulate problems with large data sets in a variety of ways. I’m fond of food, so I’ll use a food analogy to describe a couple of extreme hypothetical examples:
“Marketing analysts need to understand the impact of their campaigns and we can provide them an avenue to do so.”
This is the surf-and-turf of problem statements. One assumes that there is a specific goal in mind, and we think we know how to approach solving it. On the other hand, there’s this:
“We should totally Hadoop something!”
Where the first statement was steak and lobster, this is more like a knuckle sandwich.
When building out a feature set for analysis with predictive models, it is important to remember that while more features might yield a better-fitting model, those features must be scalable. It is entirely possible and sometimes desirable to fit a model against a 600- or 6000- (or 6 million-) dimension feature space, but how does that affect performance in a production system? How much is “good enough?”
If you want to be buddies with we engineers, you’ll need to understand what the scale issues are with the inputs to your model. Provide some alternatives. Communicate clearly about real and opportunity costs. Be aware of the fact that every complicated transformation, set of dummy variables, discretization or other computation has a performance impact on a running system. These are all fine things to try when building a research model, but if you hand it to me as a prototype and I’ve got to support tens or hundreds of millions of records per day, then we might have to find a compromise. You might be surprised at how much benefit can be had in a predictive model for relatively low cost.
Third-party data can be a challenge. There are a lot of vendors who provide data for a variety of purposes and these can be incredibly useful. Indeed, if furnishing those data are not a core competency of your endeavor, vendors are the way to go. But be aware of the practical realities of business. Vendors can disappear. Data might become unavailable due to licensing issues. Terms might change. A vendor might fail to maintain its source or its APIs. At its best, a relationship with a data vendor can be painless and dead simple. At its worst, entrenchment in a vendor can impede your own growth.
There is a principle in modeling called Bonini’s Paradox, named for the Stanford professor Charles Bonini, that states in a nutshell, “The more complete a model is, the less comprehensible, and therefore useful, it becomes.” For predictive systems, that might be taken to mean that the more replete the feature space and novel the mechanism used to fit the model, the less likely you are to be able to explain one of its predictions. It’s relatively easy to explain the classification outcome of a simple decision tree, but try explaining something labeled by your totally awesome neural net/SVM ensemble to an end-user. In fact, even the construction of the feature space can yield obtuse outcomes, as is often the case in statistical text mining.
I don’t mean to suggest that there is no value in such systems. But the perception of value is often subjective. Knowing your audience can mean knowing where to draw the line when it comes to sophistication and nuance. Some customers and users are more ready than others to cope with the vagaries of complexity. Like any technical discipline, it is important to start out small and take diverse and calculated risks.
We’re just getting our feet wet with this post. Check back next Wednesday for more on production systems, text mining and sentiment.
In the mean time, you can read about how social media segmentation and monitoring can uncover clues to help your business.
Communications Best Practices
Get the latest updates on PR, communications and marketing best practices.
Cision Product News
Keep up with everything Cision. Check here for the most current product news.
Thought leadership and communications strategy for the C-suite written by the C-suite.
A blog for and about the media featuring trends, tips, tools, media moves and more.