As a transitional step, this site will temporarily be made Read-Only from July 8th until the new community launch. During this time, you can still search and read articles and discussions.

While the community is read-only, if you have questions or issues requiring TIBCO review/response, please access the new TIBCO Community and select "Ask A Question."

You will need to register or log in or register to engage in the new community.

Integration Mods for TIBCO® Data Science and TIBCO® Data Virtualization

21 new TIBCO Data Science - Team Studio operators that enable machine learning with Spark 2.4 for a wide range of data sources provided by TIBCO® Data Virtualization.

Compatible Products

TIBCO® Data Virtualization TIBCO® Data Science Team Studio

Provider

TIBCO Software

Compatible Versions

  • TIBCO® Data Science - Team Studio version 6.6
  • TIBCO® Data Virtualization (TIBCO® DV) version 8.4 and 8.5
  • Spark version 2.4 

License

BSD

Overview

TIBCO® Data Science - Team Studio version 6.6 can exploit the power of TIBCO® Data Virtualization (TIBCO® DV) in order to connect to a diverse set of data sources. In order to combine the power of TIBCO DV to seamlessly handle many data sources – including cloud data sources such as AWS S3 – with the machine learning capabilities of Spark 2.4, we provided 21 new operators designed to work with TIBCO DV data sources. This allows users to build a complete data science workflow whilst still being able to use standard database operators provided by TIBCO Data Science - Team Studio. The SQL queries are pushed down to TIBCO DV, and the machine learning executes in Spark. 

The new operators are provided in this TIBCO Exchange offering, and each operator is individually documented. The result is a set of scalable operators capable of processing large volumes of data.

This Exchange download includes the .jar files for the 21 Mods, along with an integration pack to set up the connectivity between TIBCO DV and TIBCO Data Science - Team Studio. Follow this Knowledge Base article for Mods installation guidelines.

Follow the provided installation instructions to configure TIBCO Data Science - Team Studio and TIBCO DV. 

 

Release(s)

Release 1.2.0

Published: July 2022

Release includes:

  • JAR file with 21 Team Studio operators extending the Team Studio functionality
  • Documentation for this integration pack
  • Documentation for all new operators
  • License information

What is new compared to previous version:

  • New Import Excel operator.
  • Bug fix for TDV ModelStore detection.

Release 1.1.0

Release 1.0.0

There are currently no reviews for this content.

Why not be the first to review it - click here to login

Overview of Integration Mods for TIBCO® Data Science and TIBCO® Data Virtualization

 

Pre-requisites

  • TIBCO® Data Science - Team Studio version 6.6
  • TIBCO® Data Virtualization (TIBCO® DV) version 8.4 or 8.5
  • Spark version 2.4 

 

Overview

Overall view

Figure 1. Overall view

Figure 1

shows how the Integration Mods (custom operators) for TIBCO DV in TIBCO Data Science - Team Studio version 6.6 are able to connect to a diverse set of data sources. The underlying data sources are accessed via TIBCO Data Virtualization. In order to combine the power of TIBCO DV to seamlessly handle many data sources – including cloud data sources such as AWS S3 – with the machine learning capabilities of Spark 2.4, we provided 21 new operators designed to work with TIBCO DV data sources. This allows users to build a complete data science workflow whilst still being able to use standard database operators provided by TIBCO Data Science - Team Studio. The SQL queries are pushed down to TIBCO DV, and the machine learning executes in Spark. The new operators are provided in the TIBCO Exchange offering and each operator is individually documented. 

The result is a set of scalable operators capable of processing large volumes of data.

In the following sections we describe the new operators, how to use them and how to combine them in a data science workflow.

 

Using the TIBCO DV/Spark Mods

 

How to add them to TIBCO Data Science - Team Studio

  • The .jar files for the 21 Mods, along with an integration pack to set up the connectivity between TIBCO DV and TIBCO Data Science - Team Studio are available on TIBCO Exchange. Follow this Knowledge Base article for Mods installation guidelines.
  • Follow the provided installation instructions to configure TIBCO Data Science - Team Studio and TIBCO DV. 

 

Where to find them in a workflow

Within an open workflow, look for the TDV category on the DATA palette to the left of the canvas (note that a TDV data source must be connected to the individual workflow file to be able to view the TDV operators in the dropdown):

The workflow TIBCO DV palette

Figure 2. The workflow TIBCO DV palette

There are 21 operators created specifically to work with TIBCO DV, these are listed in Figure 3 below:

Figure 3. The 21 integration Mods

 

Categories

The new 21 operators include 10 modelling operators: three Classification and three Regression modelling operators

  • Elastic-Net 
  • Random Forest
  • Gradient-Boosted Tree

three model assessment operators 

  • Confusion Matrix
  • Goodness of Fit 
  • Regression Evaluator

and a Predictor.

Other operators that produce models to be used in conjunction with the Predictor operator are 

  • K-Means Clustering
  • PCA
  • Isolation Forest.

Random Sampling and Sample Selector can be used to split the data into a Train and Test dataset.  Summary Statistics performs exploratory analysis.

There are also five import/export operators:

  • Export Model and Load Model are available for re-using the generated model within the workspace.
  • Export to TIBCO ModelOps and Export to File Storage provide the ability to export either a model or a dataset.
  • Import Excel adds the ability to import an Excel file stored in the same workspace into TIBCO DV for use in TIBCO Data Science - Team Studio.

More details are available in the subsequent sections.

 

Modelling Operators

The modeling operators are similar to the equivalent native product operators. However these new operators execute on Spark 2.4 and have additional features such as the ability to carry out a grid hyper-parameter optimization to return the best performing model. An example of how this is expressed in the input parameters is shown below where the parameters of a Random Forest Regression model are selected

Calling Random Forest Regression

Figure 4. Calling Random Forest Regression

In Figure 5 below, the input parameters related to hyper-parameter exploration are shown for the Random Forest Regression operator of Figure 4 above. In most hyper parameters, a choice of values is available by inputting it as a comma-separated list. The number of times the grid optimization is performed is expressed by the Number of Cross Validation Folds parameter at the bottom of Figure 5.

Subset of Random Forest Regression input parameters

Figure 5. Subset of Random Forest Regression input parameters

At the end of the run, the different parameters tested are displayed in the Results tab, together with the chosen metric (RMSE in this example - the choice of metric itself is an input parameter) and the best model is flagged. Figure 6 below shows a subset of the results: in this workflow a total of 64 parameter combinations were sampled 3 times.

Subset of grid optimization results

Figure 6. Subset of grid optimization results

A small exception applies to the Elastic Net operators (both Linear and Logistic Regression), which have the option to specify the values of one of their hyper parameters in a slightly different way. The Penalizing Parameter(ƛ) can be specified as usual by a comma separated list, like the one in Figure 7

Specifying ƛ with commas

Figure 7. Specifying ƛ with commas

Those will generate the values 0.0 0.5 and 1.0. An alternative is the form shown in Figure 8

Specifying ƛ with colons

Figure 8. Specifying ƛ with colons

This will generate 4 values, starting from 0.001 and ending at 0.1 but these values are now exponentially spaced. In this example, the values generated are 0.001, 0.004645, 0.02155, and 0.1. This can lead to a more optimal coverage of this parameter’s space.

Sample grid optimization results with colon style ƛ

Figure 9. Sample grid optimization results with colon style ƛ

Finally, the model assessment operators (Goodness of Fit and Confusion Matrix for classification tasks and Regression Evaluator for regression tasks), and the Predictor operator (for classification and regression, as well as clustering, anomaly detection and PCA) work very similarly to the native version 6.5 operators.

 

PCA, K-Means Clustering and Isolation Forest

Two of these three operators replace equivalent operators from TIBCO Data Science - Team Studio version 6.5.

PCA (Principal Components Analysis) has a simplified user interface. All the principal components (up to the number of selected columns in the input dataset) are generated, and the user can filter down to the required components, based on the desired amount of explained variance.

K-Means Clustering has the additional option of selecting the initialization method and, importantly, trying a range of possible K (the number of clusters to generate). The optimal K is calculated based on the Silhouette metric.

K-Means results with K–optimization metric

Figure 10. K-Means results with K–optimization metric

Isolation Forest is an entirely new operator that performs anomaly/outlier detection. A contamination level (expected number of anomalies) can be chosen in input. The output, once fed to a Predictor operator, contains columns to indicate the anomaly score and the resulting class (anomaly/not anomaly).

All three operators generate models that need to be applied to a dataset via the Predictor operator.

Example of generating and applying models

Figure 11. Example of generating and applying models

 

Random Sampling and Sample Selector

These operators are used to split a sample in two: a Train and a Test sample. It is important to note that Random Sampling/Sample Selector need to be step-run before running any downstream operators, as they do not inherit the schema (the column structure of the dataset that flows through). This is evidenced by the subsequent operators appearing in red.

 

Summary Statistics

Summary Statistics calculates descriptive statistics for the selected columns in a dataset. In addition to the capabilities of the native product operator, it can handle datetime columns, and outputs the coefficient of variation (the ratio between the standard deviation and the absolute value of the mean) of numeric variables, as shown in Figure 12 below.

Example Summary Statistics output

Example Summary Statistics output

Figure 12. Example Summary Statistics output

 

Import/export operators

There are 5 operators in this category: two of them enable exporting/ loading a model into/from the workspace. These are Export Model and Load Model. These operators store key information on the model in the form of a text file. The actual model persistence is provided by the ModelStore, a partitioned Parquet file that is located within TIBCO DV or optionally a table within a database accessed via TIBCO DV. For details on the ModelStore, please refer to the TIBCO Data Virtualization to TIBCO Data Science - Team Studio 6.6 Integration Documentation.

The Export to TIBCO ModelOps operator deploys trained models directly to TIBCO ModelOps.

The fourth operator, Export to File Storage, acts on a dataset (for instance, the output of a preceding operator) and writes it to the file storage specified in output, for instance AWS S3. The reason for this operator is to circumvent a current limitation of the native product operators when used with TIBCO DV. This limitation prevents writing a Table directly to certain file based data sources such as AWS S3, ADLS, HDFS. When using regular operators against file storage the View option for the Output Type parameter can be used, as shown in Figure 13. In this way results are not persisted to the file storage but instead calculated each time the operator is executed.

Storing a temporary result

Figure 13. Storing a temporary result

If it is necessary to write the results back to actual File Storage (for example for processing by other tools, or for performance reasons) the Export to File Storage Operator should be used.

The Export to File Storage operator allows the results set to be written (persisted) as a file to the file storage. Files can be written in either the csv or parquet file formats. The file storage can either be a file system associated with the Spark cluster (local) or outside of the Spark cluster (remote). In either case a data source for the file storage must have been created within the TIBCO DV instance being used by TIBCO Data Science - Team Studio.

NOTE: To write to Remote File Storage, the TIBCO DV instance must be at version 8.5 or greater.

The Export to File Storage operator has the input parameters shown in Figure 14. The filename must have either .csv or .parquet extension.

Figure 14. Storing a csv file

The Import Excel operator lets the user load an excel file from the workspace into the designated TIBCO DV destination folder. It has parameters for choosing which sheet, row and column blocks are to be imported, and can handle files with no column names, double quotes in column names or new lines in cell values.

Figure 15. Importing an excel file, input parameters

 

Compatibility with existing operators

Most Explore and Transform database operators in the product will work with TIBCO DV sources – please see the product documentation for details.  Since a TIBCO DV data source appears as a database source, the database flavor of the operators will be used. As mentioned in the previous section, for these operators the value of the Output Type input parameter needs to be specified as View if the underlying data source is file based (such as S3, ADLS or HDFS).

The new operators can connect native product operators. For instance, in a scenario where we want to do some exploratory data analysis, we might build a workflow like the one in Figure 16:

Example of connecting to native operators

Figure 16. Example of connecting to native operators

In this example we have used a dataset stored in AWS S3 and attached it to the TIBCO DV/Spark Summary Statistics, as well as to the native BoxPlot and Aggregation operators.

 

Putting it all together

Let’s now look at an example of how these new operators fit together in an end-to-end data science process. Figure 17 shows a screenshot of a workflow that generates a classification model for the UCI Adult income dataset:

Sample data science workflow

Figure 17. Sample data science workflow

The input dataset is stored in AWS S3. Above the input dataset we notice two Explore operators: the new Summary Statistics and the Box Plot. The first one gives us an idea of the distribution of the columns in the dataset.

Summary Statistics output detail

Figure 18. Summary Statistics output detail

There are 23 columns in the dataset. Two of these, education and education_num look suspiciously similar. We can test the idea that there is a one-to-one relationship between these two variables using Box Plot. The result, shown below in Figure 19, proves this hypothesis:

Box Plot of education vs education_num

Figure 19. Box Plot of education vs education_num

We can therefore safely exclude one of these columns from the modeling. We choose to keep education_num as it is numeric and ordered, so it does not need substantial pre-processing.

Among the other columns, we also choose to ignore fnlwgt (Final Weight, a weighting factor that is usually discarded in analyzing this dataset) and native_country. The latter could be included at the price of increasing the memory allocation of the modeling operators, as any categorical column needs to be turned into a set of dummy variables within the modeling operators and this one has many more distinct values than the other categorical variables.

In order to simplify the setting of input parameters for the modeling operators, we use a Column Filter to remove these columns from the data stream.

 

At this point we are ready to split the dataset into a Train and Test subset, using the Random Sampling and Sample Selection TIBCO DV operators. We then apply both an Elastic-Net Logistic Regression and a Random Forest Classification TIBCO DV operator. As mentioned, these modeling operators have an in-built customizable grid search for the best model. They also have in-built encoding of categorical variables, and, the case of Elastic-Net, normalization of predictors. The modeling operators store a Spark model object, which records all the transformations  applied within the modeling operators. Therefore any pre-processing calculated by the modeling operator on the Train set will be automatically applied on the Test set.

The Goodness of Fit TIBCO DV operator gives us an idea of how the two models performed:

Goodness of Fit result

Figure 20. Goodness of Fit result

The Random Forest appears to have performed overall slightly better, and we therefore choose to export it to the workspace, using the Export Model TIBCO DV operator. For demo purposes, we re-import it into this workflow (normally one would use a separate workflow) using the Load Model TIBCO DV operator, and re-apply it to the Adult dataset as a whole. As a final test of the results, we use a Frequency operator to compare the distribution of the actual versus the predicted values. We see from Figure 21 below that because of the marked imbalance in the training dataset, the model does overall a good job although, because of the imbalance in the dataset, it performs slightly less well at predicting high income.

Frequency distribution of actual income band

Frequency distribution of predicted income band

Figure 21. Frequency distribution of actual and predicted income band

New in release 1.2.0

  • New Import Excel operator.
  • Bug fix for TDV ModelStore detection.

New in release 1.1.0

  • Added support for the following Spark configurations:
    • Kerberos for Spark with Yarn cluster manager
    • Spark Standalone cluster manager
    • Spark running locally on the Team Studio server.
  • Added ability to export models to TIBCO ModelOps.
  • Added support for TIBCO DV 8.5.
  • ModelStore is now managed by TIBCO DV on the local file system or designated database.
  • Bug fixes for normalization in Elastic-Net Logistic Regression, Elastic-Net Linear Regression and K-Means Clustering.
  • Simplified ModelStore setup in the installation scripts.
  • New option in Export to File Storage to choose an external file system, for example a S3 bucket not associated with the Spark cluster.

Known Issues and Limitations

  • When using a Kerberized Spark environment, another Database will be required to store the temporary tables created by TeamStudio workflows.
  • When the scripts .jar file is loaded, the following warnings may appear - they can be safely ignored:

Warning messages

Figure 22. Warning messages

  • Because of a Spark requirement, a dataset with column names containing character(s) in the list: " ,;{}()\n\t=" (this includes variable names with a space) cannot be written to data sources. Please replace those characters with characters acceptable to Spark, either by modifying the schema before uploading data to the database source, or by replacing the illegal column names with a Variable operator before further processing. 
  • For database operators, the value of the Output Type input parameter needs to be specified as View if the underlying data source is file based (such as S3, ADLS or HDFS).
  • Random Sampling/Sample selector operators need to be step-run before running any downstream operators, as they do not inherit the schema.
  • You need to make sure that the @default_schema workflow variable that is used in the Output Schema parameter in the integration Mods is pointing to the desired database schema. This can also be redirected while setting up the input parameter for the individual operator.
  • The tables or views are created in the selected schema as the result of the operation of certain operators.  These tables and views will not be removed automatically when the nodes or workflows are deleted.  They can accumulate over time and a housekeeping process should be run periodically to review and remove these tables and views.
  • For further details, please see the Product Release Notes and the TIBCO Data Virtualization to TIBCO Data Science - Team Studio 6.6 Integration Documentation.

 

View the Wiki Page