For a while now I've wanted to work on a project to display information about the MBTA (Massachusetts Bay Transportation Authority) trains and buses by combining data from the MBTA themselves with tweets from the general public. This project would combine my love of Twitter, general bemusement with the adventure that is the MBTA, and interests in learning more about natural language processing and machine learning.
Some seem potentially useful, and straightforward in terms of the information they're trying to convey:
@MBTA is the blue line down again? We haven't moved from airport in 7 min— cannae (@BattleOfCannae) May 20, 2015
@MBTA_CR Outbound Fitchburg train is 10 min late getting to Porter Sq. Any update?— Keith Rubin (@krubinator) May 20, 2015
Some convey what could be relevant information, but are too vague to be of any use:
@MBTA no 66 bus at Harvard for 45 min during rush hour. Thanks! Hint: running 2 buses in a row won't help the dozens who are late— Chris Markle (@chrismneu) May 20, 2015
Some are relatively neutral; they share information about MBTA reform, links to news articles, or check-ins at train stations from other apps such as Swarm.
After scrolling for a bit, I popped the "MBTA" search query into a tweet sentiment analyzer and found something surprising:
The tweets were mostly positive.
As my examples above may indicate, almost all tweets about the MBTA are negative. Some are fairly neutral, as I mentioned before in regards to links and the like, but positive tweets are few and far between. Why were they showing up as positive?
Luckily this analyzer allows me to dig deeper and find out specifically which point corresponded to which tweet(s). The annotated version:
- Shout out to @MBTA_CR for consistently being unreliable, delays even when it's 64 degrees out #pathetic (link)
- #mbta Route 42 experiencing minor delays due to disabled bus 09:24 PM (link)
- On the subway is not where I want to find out my deodorant is failing! #commutingwoes #mbta #greenline (link)
- @mbta, Any reason Ashmont line isoving so slowly tonight? No reports on the muffled speakees. (link)
- @MBTA That's for the awesome communication about the problem with the train. Oh. Wait. No. All we got was an announcement of "HOP OFF!"(link)
- Why did the #MBTA report need to be a "rapid diagnosis" when dozens of similar reports already existed? http://www.bostonglobe.com/metro/2015/05/18/signature-findings-report-flawed/WjJksRzY2MFrraEFRdl1xL/story.html (link)
- @Chernickator: @mbta hammered guy on car 01742 red line towards Braintree fell on top of me. Nearly took my head off.
- @MBTA countdown display not working at Brookline village outbound
- @GlobeMetro I wouldn't give the @MBTA a dime of my money. Corrupt hacks. (link)
- I love when baseball season starts and the green line fills up with people who have literally never been on an mbta train in their life. (link)
- So why isn't the MBTA doing something smart like this???? http://t.co/5jMXhn8zT (link)
- don't know why I give my money to the @MBTA (link)
- @EricaMouraNEWS Hello Erica. Could you pls forward us a car number, specific location, & direction of travel? We would like to investigate. (link)
- Love the #Converse kitted out Orange Line cars, they should always be this bright inside! #MBTA (link)
After looking through, I realized that a large number of tweets being displayed in green on the sentiment analysis graph (examples 5–12) were not actually intended to be positive. Some of them seemed to be just completely misunderstood (for example, 8, 9, and 12). But many seemed to be sarcastic.
I wound up reading a lot about attempts to identify sarcasm in tweets, and found some really interesting articles. My favorites out of the papers I read today were "Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis" by Diana Maynard and Mark A. Greenwood, and "Identifying Sarcasm in Twitter: A Closer Look" by Roberto González-Ibáñez, Smaranda Muresan, and Nina Wacholder.
I'm definitely looking forward to playing around with this, but it's daunting. The tendency towards sarcasm and the ambiguous nature of many of the tweets may make this project extremely difficult. I'm hoping that starting it will at least give me a little more insight into natural language processing, machine learning, and all of the problems that are common in those two fields.
I'm also curious to see how other semantic analyzers perform. Are there some that incorporate the possibility of sarcasm? Particularly with Twitter, hashtags could be valuable (as mentioned in both articles above). Others may just be more accurate.