Text analytics – techniques,language and opportunity

Abstract

This article takes a look at the many ways organizations are benefiting from text analytics and tries to clarify the language surrounding the process of turning text into data. Real life examples gathered from interviews and desk research illustrate how important text has become to understanding what is happening inside organizations. Increasingly, text and data are being used together to identify and solve business problems. This article provides an overview of the technologies behind analyzing text, practical applications of the field, and a guide to language used to describe the processes to conduct and use text analytics.

Keywords

Data analytics natural language processing predictive analysis rule-based systems sentiment analysis text analytics text analysis text mining

Introduction

Text analytics is maturing to be an essential component of organizational armoury in many sectors. Understanding its current position and potential is now essential for anyone working in the information field. My investigations to assess its status have drawn on my extensive experience as a journalist and my own experience while at McClatchy-Tribune (MCT), where I created a large rule-based classification system for news articles. My research for this article relied heavily on interviews with text analytics professionals and experts. I also conducted an extensive review of articles, blogs, and social media posts about text analytics. Its results confirm the importance of text analytics and its potential for those prepared to acquire new skills.^1,2

Beware the questions you ask and the information you use to answer them

Allen Thompson, Senior Vice President Corporate & Commercial Analytics & Reporting, Bank of America needed to understand the root causes driving customer responses to a survey about local bank branches. First he looked for common themes among the survey responses and comments, but this yielded no significant findings. Next Thompson decided to segment the customers and look at data about them, including ATM visits, bank products consumed, and cheques cashed. ‘We were tying what people did at branches to what they were saying in surveys,’ Thompson said. What they discovered was that ‘a huge percentage’ of the people who gave negative feedback about their local branches were paying some sort of fee.³

Thompson’s use of text analytics to solve a business question highlights a trend in the field. The traditional definition of text analytics is the process of deriving high quality information from text by turning text into data through the use of machines and software. Increasingly, practitioners such as Thompson are looking at text-as-data with numerical data to answer questions.

Meaning can be hard to find, just ask Camus

At its core, text analytics is breaking a stream of text into meaningful words or phrases, but meaningful is a relative term – how does one decide or discover just what information is important or meaningful? Some say that one way is to use text mining, which counts and groups words in various ways and looks at the pattern of word use within documents. This can tell you something about the document, according to Tom Reamy,⁴ founder of the KAPS group and consultant on text analytics and information architecture, ‘but none of that has anything to do with the meaning of those words, and text analytics deals with the meaning of words.’

The use of text analytics is particularly difficult when language is so fluid – words may have many different meanings, and there are many ways to describe the same thing. Samantha Zhang at PayPal deals with this reality every day when handling customer feedback that comes from surveys and PayPal’s websites. ‘Sometimes, on our more important issues, our customers don’t know how to articulate their views to us, and they won’t always use the same language,’ she said. This word use becomes a problem when trying to track issues across channels. Zhang and her team address this by using a combination of computer help and human investigation. ‘The software gives us clues to show us how we can look deeper, but we still have to do searches to find information we need,’ Zhang said.⁵

For Thompson, the words in the survey about local banks became meaningful after being looked at in conjunction with data. In fact, Thompson said, the surveys alone were meaningful only in that they expressed an opinion about something. Since the surveys were limited to the single subject, people were able to respond to questions about the local branches – with no option to talk about fees. This connection became meaningful only when a person asked the right question and looked to text plus data to answer it.

Just how does it do that?

The text analytics industry seems to have stalled at this computer-plus-human equation. Reamy agrees:

There seems to be a trade-off between software that needs to be easier to use but that can’t ever get any better and software that takes a lot of start-up and development time that can achieve a higher accuracy.

Generally speaking, the software that is easiest to operate uses natural language processing, which, as its name would indicate, tries to understand and interpret human language for computers. Natural language processing is based on machine learning, in which the system works to achieve an end through an adaptive process. It looks at how well something is defined using a pre-set measure, then it takes the data from a trusted, verified source and uses it to ‘train’ itself to work better. Machine learning can be responsive to exactly the parameters that designers intended, but it is responsive only to that which the designers anticipated. In other words, it works until something unanticipated changes, and then the machine needs new input to start learning again. When it fails, it is not always easy to discover why.

On the other side are rule-based systems, which tend to have a higher degree of accuracy under certain conditions and require more work by users, who have to write rules and often develop a taxonomy or ontology. They tend to be simple to use. In rule-based systems, people define a world, or domain, and once defined, the system is able to identify concepts in text that are meaningful to the domain. It is often the case that rule-based systems are used to auto-classify content, and the output of that classification is metadata describing the core concepts in the text that can be stored for later use. Broad domains can be difficult to define, and many academics say that to get true semantic knowledge using rule-bases requires limiting the domain.

Rule-based systems also have a problem accounting for concepts that the user has not anticipated, but it is easier to understand why the rule-base failed, and it is easier to adjust the rule-bases than to retrain documents in natural language processing (see Figure 1).

Figure 1.

Text analytics techniques.

What do you call it?

Systems based on both natural language processing and rule-bases can be used to pull information out of documents of all types – emails, texts, Tweets, blog posts – but, increasingly, what you do with that information is defining what you call the process of extracting it.

Noted text analytics expert, Seth Grimes,^6,7,8 makes a distinction between text analytics and text analysis, which he calls the difference between finding the pertinent information in a piece of text versus using the text to give you answers about other things.

Bank of America’s Thompson said that he and his team are also using text analytics for customer retention. They look at all types of customer feedback, including tasks customers conduct at banks, emails, phone conversations, letters, and surveys to predict client behavior. Thompson said he looks for indicators that a bank customer might leave, what he calls a leading indicator, instead of looking at data after a customer has left the bank, a lagging indicator.

This introduces into the conversation another type of analytics. Thompson used Grime’s distinction of text analysis to look at survey results. He then combined those results with data in order to conduct some text analytics. Now he is using the results combined with even more data to try to predict customer behavior. He calls all of this text analytics, but others would say it has tinges of predictive analytics (see Figure 2).

Figure 2.

Analytics techniques.

Predictive analytics uses statistics and other mathematical techniques to look at past data to predict future events.⁹ Generally, analysts look at many known data sets when trying to predict future events. So how does predictive analytics tie in with text analytics? The modern information-rich world is awash with text. The role of text analytics is to make sense of this text, and at its heart, produces quantitative data from text. In this way, text analytics acts to produce the data points that are used as inputs for a predictive analytics model.

One of the many inputs into predictive analytics processes are metrics of customer satisfaction derived from text. The development and assessment of these metrics is sentiment analysis, which uses text analytics to determine mood expressed in text such as social media posts, movie reviews, survey responses, customer emails, and so on. Businesses use sentiment analysis for an array of reasons, including understanding what drives consumer behavior, tracking how clients respond to new products, creating marketing schemes, and driving agile product development.

Unfortunately for businesses, sentiment is difficult to determine. Irony and sarcasm can be hard to detect in the spoken language, and machines have an even more difficult time of it. IBM¹⁰ and other software vendors say that they are making strides in this area. However, machines still have a difficult time with neutral language. Grimes points out that some language with no sentiment connotations can still be seen as positive by people – something machines would miss. He tested this theory with a Twitter poll on the statement, ‘I purchased a Honda today,’ which 55 per cent of respondents rated as positive. Advances are being made in sentiment analysis, however, and much of it revolves around looking at the placement of words in a statement instead of simply counting negative and positive words.

In addition to sentiment analysis, Reamy says organizations are using text analytics to enhance search and to turn text into data so that it can be used in a variety of applications. Finding the critical information in unstructured text and tagging it for later use and recall allows organizations to use unstructured text in a variety of applications. For example, Reamy consulted on a text analytics project for a company that processes 8 million proposals a year. Each proposal is processed and facts, such as the amount of the bid, the date the bid is due and the location of the project, are noted. Along with this information extraction, the company is using text analytics to classify the kinds of proposals they receive, so that they can be distributed correctly.

Real world needs

Organizations are finding many ways to use text analytics – and discovering the ‘important’ information in the text is often the goal. UNC Health Care¹¹ is using natural language processing from IBM¹² to sift through unstructured data, such as doctors’ notes, registration forms, discharge summaries, and phone call notes, pairing it with structured data and using it to target high-risk patients and design prevention programs for them. The health potential is huge, Accident Fund Insurance Company of America¹³ is using text analytics to search claim adjusters’ notes to identify health issues that were aggravating workers who had suffered injuries on the job. The US government is using text analytics to examine some of NASA’s unstructured data¹⁴ to find potential problems with flights by looking at historical pilot reports and mechanics’ logs and to scan social media for evidence of terrorism and biological threats.

Looking forward

It is clear that organizations are using text analytics to help people find insights into and solutions for real world issues. Human analysts simply cannot process the amount of data that we produce today, and machine assistance helps us uncover information we did not know we have. But what of the future of text analytics?

Industry watchers say that we need more text analytics workers to create and program software, to understand how to turn text into data and to open doors for text analytics at businesses and organizations. As text analytics grows, some speculate, we will see more vendors specializing in industry verticals, such as healthcare, insurance, oil and gas and manufacturing. Along with that, or perhaps preceding it, we are likely to see more products and less technology that has to be manipulated by users. We will also see analytics spread to all additional types of content, including video, voice, and photos.¹⁵

In the near term, it seems sure that organizations will continue to use text analytics and data to fuel discovery and growth. To make sense of data and text, it seems increasingly likely that we need to search across data types stored in different places and formats to find the information that is so valuable. This will let us act on it for marketing and business decisions.

At Smartlogic, we are seeing an increase in the need to analyze text across data silos and formats so that organizations can find content quickly and discover new connections with data. For example, we recently helped a health insurer discover patterns in home health care workers’ notes that led to simple environmental changes to prevent elderly patients from falling. We combined information pulled from structured data, such as medication type, age, and known health conditions, with information pulled from unstructured data gathered through text analytics, such as loose carpets and the absence of handrails. This process discovered patterns of falls among patients who were taking at least one drug known to increase the risk of falls, and whose house visit notes indicated environmental risks for falls. From there, we were able to identify patients who could benefit from home repairs and improvements and increased social intervention to reduce the chance of falls and their associated costs.

It is easy to see what we want text analytics to become; all we have to do is look to the sci-fi genre to find computers that understand what we say and that can converse with us. The state of the art is not there, but as enterprises see the deep benefits of text analytics, it is possible that we will get closer.