Language and Big Data: Mapping social media with words
Location data may allow smartphone apps to use cell tower and GPS to pinpoint where you are in the world, but there is something less obvious in your tweets and status updates that can give away where you are from: your word choice.
Text communication might remove accents from the equation, but some of the words we use in everyday speech are as telling as a southern twang or a car parked at Harvard yard.
Social media records everyday speech in digital format in a volume never seen before, and this wealth of real-time data has been a boon for computational linguists like Jacob Eisenstein of the Georgia Institute of Technology in Atlanta.
“Sometimes people write that social media is really noisy or random, and that’s something that I push back on a lot,” Eisenstein said at a conference held by the Stanford Institute for Research in the Social Sciences. “There’s a system of rules and constraints that govern language really at all levels, and that’s true of social media writing just as it’s true in any other form of linguistic expression.”
For real, for real
Using social media and GPS data, Eisenstein created a map of the United States that illustrates where certain words tend to be used the most. For example “yinz,” a word that means “you all,” is concentrated in the northeast around Pittsburgh, Pennsylvania. Frfr, an abbreviation meaning “for real, for real,” is most prevalent in the southeast, especially Georgia and South Carolina.
Eisenstein noted that online dialect extends to more than the words people use in real speech. For example, he found that emoticons are used four times as often in the areas around Los Angeles. Eisenstein also pointed out that people sometimes spell words differently according to how they pronounce them, such as using “goin” instead of “going” or “dat” instead of “that.”
Studies such as the ones conducted by Eisenstein can help produce better systems for natural language processing, which is used for everything from speech recognition to spell check. The data could also be used to create better targeted advertising, tailoring the language used in ads to the location of the intended audience.
photo credit: NASA GOES-13 Full Disk view of Earth July 14, 2010 via photopin (license)
Since you’re here …
Show your support for our mission with our one-click subscription to our YouTube channel (below). The more subscribers we have, the more YouTube will suggest relevant enterprise and emerging technology content to you. Thanks!
Support our mission: >>>>>> SUBSCRIBE NOW >>>>>> to our YouTube channel.
… We’d also like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.