As a transitional step, this site will temporarily be made Read-Only from July 8th until the new community launch. During this time, you can still search and read articles and discussions.

While the community is read-only, if you have questions or issues requiring TIBCO review/response, please access the new TIBCO Community and select "Ask A Question."

You will need to register or log in or register to engage in the new community.

Analyze any Text Data in TIBCO Spotfire

Published:
7:33am May 17, 2022
 
 
Quick Links:
 

NLP Python Toolkit for TIBCO Spotfire (Exchange)

Spotfire Data Functions

Spotfire Mods

 

What is NLP and Why Should Businesses Care

 

Natural Language Processing (NLP) is a branch of Artificial Intelligence that focuses on the ability for machines to understand text. Text can refer to online documents, transcribed speech, real-time logs, and more. According to an article by Businesswire, The Global Natural Language Processing (NLP) Market Size is expected to grow from 11.6 Billion (USD) in 2020 to 35.1 Billion by 2026, at a Compound Annual Growth Rate (CAGR) of 20.3%. Some popular use cases for NLP include market intelligence, root cause analysis, automatic summarization, language translation, chatbots, and more. Moreover, there has been an influx of Open Source libraries and research papers in NLP, especially in the last decade. This allows analysts and data scientists to easily use the latest advancements (image below from John Snow Labs article)

Why Use the NLP Toolkit in TIBCO Spotfire

 

The NLP Python Toolkit for TIBCO Spotfire® is a versatile toolkit that performs a range of exploratory text analytics on any text. It includes two python data functions that use the popular NLP libraries NLTK and spaCy and their pre-trained langage models. The data functions inside the toolkit can be used via Spotfire’s no/lo-code data function Flyout interface and/or through the step-by-step instructions in the Spotfire DXP. It also shows how to visualize results to highlight quick insights from the text. Below is a quick video of configuring one of the data functions from the data function flyout - you simply need to specify the text column and set a few parameters.

 

What functionalities are in the NLP Toolkit?

 

Version 1 of the toolkit includes two data functions, ‘NLP Python Toolkit - Features’ and ‘NLP Python Toolkit - Entities and Sentiment’.

‘NLP Python Toolkit - Features’ preprocesses or cleans the text and engineers the text into n-grams. It includes the options to remove numbers and special characters (these are with respect to the English language). And, it removes stop words and performs text normalization.  Stop words are commonly used words in a language that do not provide much semantic meaning for a given task (ex. the, not, we, etc.). Text normalization standardizes words using either stemming or lemmatization. Stemming crops words to remove suffixes or prefixes while lemmatization morphs words to their base form. For example, stemming would convert the word ‘studies’ into ‘studi’ while lemmatization would convert the word ‘studies’ into ‘study’. 

The next step is to engineer the text into a numerical, structural format (i.e. text vectorization). This is done with an N-gram matrix. N-grams are word combinations or any continuous sequences of text. In the phrase, ‘The dog ran away’, you get the following n-grams: ‘the’, ‘dog’, ‘ran’, ‘away’, ‘the dog’, ‘dog ran’, ‘ran away’, ‘the dog ran’, etc. Usually it is common to go up to 4 word N-grams at max (i.e. unigrams, bigrams, trigrams, quadgrams). First, the top N-grams across all the text observations are retrieved. Then, each text observation has a score for each N-gram feature; this score is either by frequency or TF-IDF (term frequency - inverse document frequency). They are both simple measures and we suggest using the latter to better quantify N-gram importance. Altogether, this results in an N-gram matrix (rows are text observations, columns are N-gram features, and values are scores). From this N-gram matrix, the scores are used to identify the top N-gram keywords per each text observation.

‘NLP Python Toolkit - Entities and Sentiment’ performs the following: Named Entity Recognition (NER), Part-of-Speech tagging (POS), and Sentiment Analysis. The term ‘tagging' refers to models picking apart a word or groups of words and ascribing a label/tag with it. NER tags predefined entities from text (geographical regions, peoples’ names, languages or nationalities, etc.). POS tags each word with its grammatical tag (i.e. noun, adjective, etc.). Sentiment tags phrases with a polarity and subjectivity metric and assigns an overall polarity and subjectivity metric to each document. Sentiment analysis will give an overall polarity score (what we refer to as sentiment - i.e. how negative or positive something is on a scale from -1 to 1) to a text observation. It also tags any phrases that the model identifies as polar. So, each document will definitely have one ‘document polarity score’ and can have 0+ phrases each tagged with a score (‘phrase polarity score’). The sentiment analysis is done through a library called TextBlob and this is supported in spaCy. The underlying NER, POS, and Sentiment models work on raw text data and are pre-trained models from the spaCy library. spaCy is easy to use and has pre-trained models available for a variety of languages and tasks (others include token vectors, word vectors, stemming/lemmatization, dependency parsing, etc.). In this data function, you can specify the type and size of language model to use, so it is not specific to just English text, and it does not require preprocessing (recommended not to use preprocessing since ‘not happy’ converted to ‘happy’ impacts sentiment scoring…).

 

Sample Use Case - Yelp Dataset

Here is a quick overview of a sample NLP use case. This data is a subset of the Yelp dataset centered on businesses around New Jersey, Philadelphia, and Delaware. There are three different text categories in this dataset: Business Categories, Reviews, and Tips. Interestingly, the nature of the text categories differs in both length and semantic meaning.

Yelp - Search Functionality

First, we wanted a dashboard page that acts as a search engine or aggregates the text into organized hierarchies. We ran the ‘NLP Python Toolkit - Features’ data function on both the Tips and Categories text columns. The top N-grams from the Categories help filter the businesses into restaurants, services, etc. and clicking on any one further breaks them down into types (ex. Italian restaurants, plumbing services). The N-gram keywords can be used in the filter panel for specific search terms relevant to businesses: ‘accept cash’, ‘vegetarian’. We ran ‘NLP Python Toolkit - Entities and Sentiment’ to retrieve the NER  results on the Reviews text column. Looking at the ‘Nationalities or Religious Political Groups’ Label, we see that the tagged entities and corresponding text reviews filter across food cuisines.

Yelp - Evaluate Businesses

The next page is about evaluating businesses. It shows the Sentiment and POS results on the Reviews text column. Overall, it is used to show which businesses have positive and negative reviews and the corresponding phrases and POS tags help explain the model’s scores. This can potentially be a less biased metric compared to each user’s 1-5 scale rating.

Yelp - Business Recommendations

Now that we have explored and evaluated our business data, we want to optimize recommendations to our customers. Each review corresponds to a single user and single business. We ran the ‘NLP Python Toolkit - Features’ data function on the Reviews column. The N-gram matrix calculated and stored is a numerical representation of all the reviews (each review is a vector or row in the matrix). We used the K-Means Cluster Python Data Function for TIBCO Spotfire® to cluster the reviews. This results in ‘similar’ reviews being grouped together in a cluster (the term ‘similar’ is vague; here it is determined by distance in TF-IDF N-gram scores. This likely means ‘similar’ reviews in a cluster are grouped together because they share higher scores for certain N-gram features). We can then view any cluster and visualize its reviews by looking at the text directly and N-gram keywords in that cluster. Below, exploring a single cluster shows that its reviews discuss happy hours, bars, and drinks. Lastly, we can filter down to the reviews with positive sentiment scores and/or higher ratings (4 or 5 stars) in that cluster, and recommend their corresponding businesses to all the corresponding users who wrote this group of reviews.

 

Authors:

Sweta Kotha is a Data Scientist at TIBCO. Her experience spans data science, natural language processing, and biostatistics. She likes trying out new technologies and methods to address analytics challenges and is interested in effectively communicating with data.