Humans vs. AI in a Sentiment Bout. And the Winner is…
Let’s Get Ready to Rummmmble…
There is a mismatch between claims and perception around automated sentiment analysis. Like a prize fighter, trash talking vendors make boastful, escalating, and ultimately ridiculous claims about their accuracy. At the same time, sentiment accuracy is the least satisfying feature of social media listening platforms according to Forrester research, with only a 45% satisfaction rate.
Sentiment evaluation is more complicated than it might first appear. [See our whitepaper on tricks vendors play to claim just about any accuracy level they choose, along with some best practices for your own evaluation]. Despite the challenge, let’s see if we can setup a “bout” that gives us some level of insight (I’m hiding the methodology used in double blind clinical trials behind a boxing metaphor to have a bit of fun with this post – I’m guessing you’ll let me know if it holds up).
And in this Corner…
In the blue trunks we have our “Human” contestant represented by Visible’s professional labeling practice. Visible has a very mature human labeling practice with large, dedicated teams of people that have been labeling social media posts for sentiment and other attributes for half a decade. We have the processes in place to do this with high quality (formal training for each project, multiple tiers of quality assurance, etc.) and have done this for thousands and thousands of projects, many for the top brands in the world.
And in the gold trunks we have our “AI” represented by the automated sentiment analysis in our Visible Intelligence platform. In addition to our labeling practice, Visible also has state of the art automated sentiment technology (you can read more about it here). We have a dedicated team of scientist on staff with deep experience in text analytics, natural language processing, machine learning, and information retrieval. Because of Visible’s extensive human labeling practice, our scientists have access to what is probably the deepest and broadest labeled corpus of sentiment data available anywhere and they know how to leverage it.
Our fighters compete by labeling tens of thousands of social media posts (twitter, facebook, blogs, forums, news, etc.) with sentiment toward the reputation of a single major financial institution during a single month. Our Human fighter got to train in the venue beforehand (received formal training on the customer specific problem) but our AI fighter didn’t (we used our production model which was not trained on this data).
Scoring the bout is a panel of about 20 social media professionals. These judges together audit about one thousand posts from the fight. To keep things fair and unbiased the referee ensures that each post
is audited by two independent judges and that each judge audits an even mix of Human and AI labeled posts. Really stretching the metaphor, neither the judges nor the referee knows which posts are labeled by our Human and which by our AI (double blind).
To the Scorecards…
After a 12 round slugfest… it’s a draw. There was no statistically significant difference in performance between the Human and AI labels (at the 1% significance level).
A closer look reveals that Sentiment is, well… subjective. Judges agreed with Human labels 74% of the time. They also agreed with our AI labels 74% of the time. Judges agreed with each other (remember, each post was labeled by two auditors) 73% of the time.
It’s interesting to look at the judge agreement in more depth. At least one of the two judges agreed with the Human label 91% of the time. I call this the optimistic accuracy (a customer that wants to give your label the benefit of the doubt might agree at this rate… “I guess I could see that”). Both of the two judges agreed with the Human label only 58% of the time. I call this the pessimistic accuracy (a customer who is looking for an excuse to complain about your label might agree at this rate… “see I told you it’s not much better than a coin flip”). Judge agreement with our AI was statistically the same as with our Humans, 91%/57%.
Announcer: Human, Human, can we get a moment? Were you happy with your performance in the fight?
Human: I’m a little disappointed. I thought I was the smarter fighter. I knew going in that AI’s speed and stamina were untouchable. I’m almost dead on my feet and I think he could go another hundred rounds. I have to give AI credit… I was surprised often he effectively dodged humor, irony, and informal language.
Announcer: Do you think you could win in a rematch?
Human: I might try a James T. Kirk approach… “Everything I say is a lie… now listen closely: I am lying”. I also think I could beat him on opposite day.
Announcer: AI, AI, your colleague Watson got a lot of recognition on Jeopardy. He has a cooler name, a fancy avatar, and a lot more street cred. Are you jealous of Watson?
Announcer: Um, okay, care to elaborate?
AI: 69% Positive
Announcer: Moving on, despite winning Jeopardy handily and easily answering some very “human” questions (the “church lady” from SNL), Watson occasionally gave some really strange answers. Putting Toronto in the US or answering ‘Kosher’ to ‘What do grasshoppers eat?’. Despite the props Human gave you for dealing with tricky human sentiment, you gave occasional weird answers too. Do you think you could pass a Turning test?
Announcer: Okay, let me guess. You feel “Mixed” about a rematch? Do you have anything to say to your fans that isn’t a sentiment label?
AI: Binary solo: 0-0-0-0-0-0-1…
Communications Best Practices
Get the latest updates on PR, communications and marketing best practices.
Cision Product News
Keep up with everything Cision. Check here for the most current product news.
Thought leadership and communications strategy for the C-suite written by the C-suite.
A blog for and about the media featuring trends, tips, tools, media moves and more.