Data mining & engineering, part 2
In last week’s post, I covered a bit on articulating a problem in data mining and feature engineering. This week, we’ll go further into my UW presentation and talk about production systems, with examples in text mining and sentiment. Again, all of this can be seen visually in my SlideShare presentation or, even better, in person next Wednesday at this Seattle Meet-Up >>
Production Systems: Prototypes, Expansion and Maintenance
“Productionalize” (or “productionize”) is a fancy verb that means “to bring something into production.” The word brings to mind a factory for automobiles, plastic bottles or ball-point pens. In the less tangible case of software systems, it’s a word you might hear someone use to talk about taking a prototype and building it out into a fully capable, scalable production solution. In the more specific case of analytics systems, the product is information, extracted from raw data and made usable.
Regardless of how little affection you might have for fifty-cent jargon like “productionalize,” you will need to be ready to attend to the problems found in production systems. In the paragraphs about feature engineering from last week’s post, I emphasized the importance of compromise, concision and apparent value. The merit of those qualities may not be apparent when doing initial research, but as the demand upon a system grows, any and every source of inertia gives way to a fissure or fracture in the running system.
Striking an appropriate balance is challenging and often ongoing. The temptation to build an awesome system may result in greater maintenance challenges three months from now. But approaching a problem too simplistically is hazardous for its lack of novelty, and thus value. Where the bar is set comes down to what your organization – and your customers – can support in terms of infrastructure and expertise.
- Data drift: How do patterns in your source data change over time? In text mining, examples of this occur every time a new slang word, meme, emoticon or idiom is introduced. Models fitted on corpora that do not include these features will drift into obsolescence.
- Bit rot: Does anybody even remember how to refit the model using the scripts or software that were used for the one you’ve got running right now? Did that research infrastructure survive when your company moved its data center? Are there missing third-party libraries that haven’t been available online in years? Will any of it even compile?
- Split maintenance: How do you keep the models used for research and prototyping in sync with those used in production?
Over time, a production system will have demands placed upon it that change the original behavior. Whether there are new business rules that change the shape of the data already coming in, or the addition of new streams of data that are different but serve the same purpose. For instance, a model that guesses the language of written documents might work very well for most inputs, but for need to be overridden under certain circumstances because there are edge cases that create bad customer experiences. Those special cases, exception lists and other surrounding logic tend to accrete and very seldom do they go away, which means that your model depreciates as soon as it hits the real world, very much like a car depreciates in value as soon as it leaves the dealer’s lot.
Example: Text Mining and Sentiment (or Tone)
In social media, opinions are expressed constantly. Most of these opinions may be said to have some kind of tone associated with them: “Jane loves her new pickup truck,” “Joseph dislikes the lines at his local bank branch,” “Jeremiah is ambivalent about a new offering at a fast food chain.” Detecting tone between the lines of text is a challenge that may be approached in different ways, making it an ideal example of when to evaluate compromise for a production system.
Natural Language Processing, or NLP, techniques are a common and well-known set of approaches to analyzing text for tone and sentiment. These techniques tend to rely upon what are called “probabilistic grammars” – a set of rules that determine the nature of a sentence by constructions of the various parts of speech that comprise the written language. NLP can be very powerful, but it is also costly and slow. Grammars for NLP engines may be available from vendors (though you’re more likely to buy the NLP engine itself and receive the grammars as part of the vendor’s solution). While powerful, the difficulty of working with grammars in social media might be prohibitive due to the frequency with which online written language drifts.
If you need to run tens of millions of documents per day through a text mining system and want to do it in a way that brings value while keeping costs down, vector space modeling might be more advantageous. We certainly make use of this at Visible. These models are sometimes called “statistical,” although this terminology is more of a branding convenience: any system that processes written language may be called a “natural language” system, and any system that deals in interpreting results based upon empirical evidence (including grammars) may be called “statistical”.
Vector space modeling diverges from traditional NLP modeling in that rather than attempting to understand language from the bottom up, it approaches classification problems from the top down. Typically, you start with a set of documents that have been labeled by humans. Fitting a model for these labels entails breaking the documents apart into individual words (or portions of words) and computing their participation in a label.
For instance, the word “hate” probably appears more often in documents that have been labeled as “Negative,” whereas “love” appears more in documents labeled “Positive”. After the model is fitted, new documents that have these words in them will tend to be classified with the same label as the original hand-scored documents. Sometimes the results are as simple as these two extremes, but it is much more common for a word to fall somewhere in the middle. When the scores of all words in a document are tallied up, the system makes a prediction about how it thinks a human might classify the document. This prediction becomes the document’s classification (in Visible’s model, this is one of Positive, Negative, Neutral or Mixed).
The outcomes of these two types of text analysis can differ widely on individual documents, but remain quite similar in the aggregate. The advantage of vector space modeling over grammatical modeling is that it is tremendously cheaper for an engineering organization to implement and maintain. It gets labeled documents into the hands of customers faster and with similar aggregate results. Grammatical methods are often more precise, but have high costs both for entry and maintenance.
Whatever problems you are trying to solve as business analysts, be mindful of the long-term cost of your approach. Remember that your model needs to pay for itself someday, which might mean prioritizing low cost over precision. You and your customers might be surprised at how much value can be had for relatively little investment. Inexpensive predictive systems, easy to build and maintain, can go a long way toward narrowing an enormous search space, getting relevant information into the hands of your customers quickly enough to give them the edge.
Communications Best Practices
Get the latest updates on PR, communications and marketing best practices.
Cision Product News
Keep up with everything Cision. Check here for the most current product news.
Thought leadership and communications strategy for the C-suite written by the C-suite.
A blog for and about the media featuring trends, tips, tools, media moves and more.