How Regular Expressions Allowed
Text to Become Data
Ben Naval created a game to test your regular expressions prowess.
If your company's product is discussed on Twitter, how can you understand what millions of people are saying about it? If you are a politics wonk, how can you analyze the words politicians use in hundreds of speeches-like every single State of the Union address?
The term "data" brings to mind reams of numbers in a spreadsheet. For many, data connotes profit and loss statements, population numbers or baseball statistics. But to an increasing degree, the data that companies and researchers most want to analyze comes in the form of text.
To answer questions like those described above, researchers can now analyze text in completely novel ways. Along with the advent of modern computing, the development of regular expressions (or regex) has made it possible to work with text like data in a spreadsheet. While researchers have long been interested in text mining-the data processing and analysis of text-it used to be impossibly time consuming. Programming languages were not well equipped to work with text.
The emergence of regular expressions has enabled analysts to efficiently search for patterns in text and make sense of that information. This allows data scientists to understand the sentiment of tweets about a topic or product, social scientists to analyze political rhetoric, and anyone to make greater sense of our world.
Stephen Cole Kleene developed the foundational theory for regular expressions in the 1950s. Kleene is a father of modern computer science, and perhaps the most influential thinker about what makes for a "language" in mathematical terms. Though his work was abstract and philosophical, his theory of regular expressions led to a practical method of searching through text.
In the most basic terms, Kleene's insights led to language being symbolized mathematically. This made the ability to search through text documents on a computer dramatically more flexible and powerful.
Unlike numbers, text is messy. The word color, for example, can also be spelled colour. But with Kleene's symbols, a question mark placed inside a search can symbolize a term that includes zero or one of the preceding character. So "colou?r" means both "color" and "colour". The use of such symbols made it dramatically easier to search patterns in data.
Stephen Cole Kleene is the father of regular expressions and a giant of
computer science and mathematics.
A little more than a decade after Kleene elucidated his theory, legendary computer scientist Ken Thompson put regular expressions into practice. Thompson, then at Bell Labs, was the first to implement regular expressions within a text editor. This allowed data analysts and programmers to search for complex patterns within text. This proved incredibly useful, and the ability to use regular expression was soon implemented inside of all major programming languages.
These days, regular expressions are essential for working with text data. It is a tool used by data scientists and programmers alike for a variety of tasks.
Regular expressions help data scientists determine whether a set of text contains a certain pattern, and to fix data, like a misspelled name, that has been entered incorrectly. Regular expressions are also commonly used for web scraping. The html that makes up a web page is quite complicated, and the ability to parse that text with regular expressions makes it possible to collect information from web pages in a structured manner. The field of data journalism owes quite a debt to regex.
For programmers, one of the most common uses of regular expressions is validating emails, usernames and passwords. Usernames and passwords often have to be a certain length or include certain types of characters. Servers use regular expressions to check whether the text entered by a user is appropriate.
To demonstrate the power of regular expressions, we thought it would be illuminating to use them to analyze our own social media presence.
Every day, thousands of people tweet about their experience with Udemy. Though we can easily search for Udemy mentions on Twitter, it is more difficult to figure out how many of these tweets are retweets, and which users who tweet about us get the most retweets. This information might help us identify key people talking about Udemy. Using regular expressions allows us to do that.
Utilizing the programming language R and the twitter API, we collected 3,200
tweets mentioning Udemy on September 18th, 2015 (3,200 is the limit for any
given search from the Twitter API). We then used the following regular
expression to identify which of these tweets are retweets:
Essentially, this regular expression looks for all tweets that contain the terms "RT" or "via" and then some text with an @ sign. These are the most common patterns for a retweet. Using this expression, we discovered, for example, that the user @bobijanson retweeted a tweet by @AngularJSFan about a Udemy course on, not surprisingly, AngularJS.
The follow table displays the top 10 users in terms of whose Udemy related tweets were retweeted on that day.
The most retweeted user that day was @codek_tv, which is the twitter handle for Code Geek, a group that teaches an online class on Python. The third most retweeted user, @realKarenPrince, is a young adult fantasy writer and teacher of a Udemy class on the writing software Scrivener.
We can also look at the users who retweeted the most posts about Udemy.
We can also take retweets about us and look at them on a social graph. The following visualization displays the people who retweeted the 5 twitter users whose mentions of Udemy were most retweeted. For example, in the top right, we can see the names of the 9 different users who retweeted @UpStartup.
All of this business intelligence is only easily available because of regular expressions. Without regular expressions, we'd spend all day counting tweets by hand.
Whether it be for business, academics, and sometimes just for fun, the pattern matching and analysis made possible by regular expressions has expanded the world for everyone who works with text. Many data analysts are joyous about the possibilities.
Analysts are no longer stuck working with boring old numbers. Thanks to regular expressions, the rich information that is only available in text, is ours to explore.