Gender Inference in Practice
Recently I wrote about MITRE Corporation’s research  on gender inference from Twitter and the associated media coverage. To recap, while you can find signals about an author’s gender in what they write, a person’s name is the strongest single indication about their gender and analyzing their tweets provides only modest (although still useful) improvements.
I also mentioned we’re inferring author gender here at Visible. What I didn’t mention that we’re doing this across all types of social media metrics (e.g. Facebook, blogs, forums, reviews, comments) in addition to Twitter.
One place the actual text an author writes really helps is in situations where you don’t know the author’s name, which is quite common in forums (e.g. FlyerTalk, HowardForums, TechNet, or TheNest). I thought I’d share a little about our research on forums like these.
In forums you typically don’t know a person’s real name but you do have their screen name (see the example table below). We setup an experiment on Mechanical Turk to see how well people could guess the gender of a person based solely on their forum screen name. Turkers (using the consensus of 5 workers) thought they could tell an author’s gender from only their screen name about 60% of the time and were correct in their assessment about 77% of the time. A little math tells you people could guess gender with about 66% accuracy just using the screen name. (Unlike MITRE, we’re using a balanced dataset so guessing gets you 50% accuracy).
Example screen names:
We also built models (using a machine learning approach similar to the MITRE researchers) by training on about 700,000 authors that had self reported their gender in their forum profiles (about 3% of authors do this). By using n-grams, phonemes, and other features derived from the author name we achieved significantly better accuracy than people could obtain with the same data.
Next we added all posts written by an author (in the last year) to our models. This improved accuracy significantly over just using the screen name. Below are a few example word n-grams (the more understandable model features) that differentiate between males and females.
The Fast Company  and other articles seemed fascinated by the gender signal in the text of tweets even though the marginal value (value above using the author’s name) was relatively low for Twitter. Hopefully I’ve demonstrated how analyzing an author’s writing in other contexts like forums could make a big impact to accuracy (in addition to being “intriguing”).
The MITRE report details future work in applying their research to other genres like forum comments and other demographic features like age. I’m happy to report we have pursued both of these with some success and they are both components of Visible’s current software. Feel free to hit me up if you’d like to know more about our research.
Communications Best Practices
Get the latest updates on PR, communications and marketing best practices.
Cision Product News
Keep up with everything Cision. Check here for the most current product news.
Thought leadership and communications strategy for the C-suite written by the C-suite.
A blog for and about the media featuring trends, tips, tools, media moves and more.