Installing RapidMiner Radoop on RapidMiner Studio

RapidMiner Radoop is client software with an easy-to-use graphical interface for processing and analyzing big data on a Hadoop cluster. It can be installed on RapidMiner Studio and/or RapidMiner Server, and provides a platform for editing and running ETL, data analytics, and machine learning processes in a Hadoop environment. RapidMiner Radoop runs on any platform that supports Java.

Integrating RapidMiner Radoop into the RapidMiner advanced analytics suite is as easy as downloading the extension and making some configuration changes. The following instructions describe the process for installing the RapidMiner Radoop extension.

Prerequisites

The installation instructions assume that you have completed the following tasks. If any of these prerequisites have not yet been met, be sure to finish them before proceeding with the installation.

You need RapidMiner Studio, and optionally, RapidMiner Server installed. If necessary, see the instructions for RapidMiner Studio installation or RapidMiner Server installation.
Contact us to purchase a RapidMiner Radoop license.
RapidMiner Radoop requires connection to a properly configured Hadoop cluster. See Hadoop cluster requirements and supported Hadoop distributions.
RapidMiner Radoop supports Apache Hive or Impala. The system must be installed on a Hadoop cluster. See the supported data warehouse systems.
Make sure that RapidMiner Radoop can connect to your Hadoop cluster. After installing RapidMiner Radoop and creating connections, refer to networking setup for more information.

Verifying port availability for RapidMiner Radoop

RapidMiner Radoop requires access to a variety of ports on the cluster. Make note of your port assignments for later use when configuring cluster connections and security settings. The table in the networking setup section lists the default port assignments for various components.

Hadoop cluster requirements

RapidMiner Radoop requires a connection to a properly configured Hadoop cluster where it will execute all of its main data processing operations and store the data related to these processes. The cluster contains the following components:

  • a supported Hadoop distribution, which consists of an HDFS and MapReduce/YARN
  • a distributed data warehouse system (Hive or Impala)
  • Java 8 or newer on the cluster nodes (necessary for applying most RapidMiner models in-Hadoop)
  • optionally, Apache Spark. Below you can find detailed descriptions about the Spark requirements on the cluster.
RapidMiner Radoop supports all Spark versions from 1.2.0. The MLlib machine learning operators are compatible with every Spark version. The other Spark operators needs Spark 1.5.0 or newer assembly on the cluster. Please note that cluster security is supported for Spark 1.5.0 starting from Radoop 2.7.

Using all Spark operators
Apache Spark 1.5.0 was released in September 2015 and is not yet included in all Hadoop distributions. If you want to use every Spark operator, and your Hadoop cluster does not have 1.5 or above, then it needs to be installed on the cluster manually. You can do so by downloading it from the Apache Spark download page. Please take care that the package type should meet your cluster setup.

  • For Hadoop 2.6 or later (you need to change the download link and the path for older Hadoop versions):
  • hadoop fs -mkdir -p /tmp/spark
    wget -O /tmp/spark-1.5.2-bin-hadoop2.6.tgz http://d3kbcqa49mib13.cloudfront.net/spark-1.5.2-bin-hadoop2.6.tgz
    tar xzvf /tmp/spark-1.5.2-bin-hadoop2.6.tgz -C /tmp/
    hadoop fs -put /tmp/spark-1.5.2-bin-hadoop2.6/lib/spark-assembly-1.5.2-hadoop2.6.0.jar /tmp/spark/

For using the Spark Script operator, you need to have Python 2.6+ or Python 3.4+ (for PySpark scripts) and R 3.1+ (for SparkR scripts) installed on the cluster nodes. To be able to use MLlib functions in Python, please also install the numpy package. Because of PARQUET-136 Hive version 1.2.0 or later is recommended.

Consider the following differences between using Hive and Impala as the query engine for RapidMiner Radoop.

The following list contains the features unsupported by the Impala 1.2.3 release.
  • Sort operator: Impala does not support the ORDER BY clause without a LIMIT specified. You may use the Hive Script operator to perform a sort by using an explicit LIMIT clause as well. (The ORDER BY clause is supported in Impala 1.4.0 and later.)
  • Generate Rank operator: Impala does not support the RANK and DENSE_RANK clause.
  • Add Noise operator: Add Noise is not supported on Impala.
  • Nominal to Numerical operator: Unique integers method of Nominal to Numerical is not supported on Impala.
  • Pivot Table operator: Pivot Table is not supported on Impala.
  • Apply Model operator: Model application with Impala is not supported.
  • Update Model and Naive Bayes operators: On Impala, RapidMiner Radoop does not support Naive Bayes learning or model updating by operator.
  • Correlation Matrix, Covariance Matrix, and Principal Component Analysis operators: The CORR() function is not supported by Impala.
  • Performance operators: The Performance (Regression) operator is not supported on Impala. For the Performance (Classification) operator, only the following criterions are supported on Impala: Accuracy, Classification Error, and Kappa.
  • Aggregation functions: Some aggregation functions are not supported by Impala. This may affect Generate Attributes, Normalize, and Aggregate operators. For these limitations, RapidMiner Radoop provides design-time errors, even though Impala allows you to run them.
  • No advanced Hive settings: You cannot set advanced Hive parameters for an Impala connection.
  • Killing a process: Stopping a process does not kill the current job on the cluster (but also does not start a new process).

Hadoop cluster considerations

Although RapidMiner Radoop easily connects to all supported platform, you may require special settings if you encounter a problem when trying to use it with one of the listed distributions. Details can be found in the Distribution Specific Notes section. This section lists a few considerations that you should be aware of when choosing an HDFS or data warehousing platform:

A MapR Hadoop cluster requires the additional installation of MapR client software. See the MapR distribution notes for instructions on configuring RapidMiner Radoop so that it can gain access to the appropriate JAR files for the MapR client.
RapidMiner Radoop supports the DataStax Enterprise platform, but due to licensing issues cannot include any part of the otherwise freely available DataStax package in its installer. You must obtain the DataStax software and accompanying dse.jar (or dse-<>version>.jar)) file, and copy it to a local directory of the client. To configure a RapidMiner Radoop connection to a DataStax cluster, refer to the DSE distribution notes.
Cloudera Impala is an open-source query engine over Apache Hadoop. It provides a low-latency interface to data stored in the HDFS for SQL queries, making RapidMiner Radoop usage closer to the experience of using it in a single host environment. While Cloudera Impala can provide much faster response time than Hive, it does not support all the features of HiveQL.

Evaluate the Impala limitations to determine whether it is an acceptable alternative for your organization. For example, if you need advanced features (like model scoring), you must use Hive. If you use both Hive and Impala, consult the Impala Documentation for information on sharing metadata between the two frameworks. If using both, metadata used in Impala must be reloaded to reflect any metadata changes (such as creating new tables) made in Hive. (This can be done by enabling the reload impala metadata parameter of the Radoop Nest.)

Installing RapidMiner Radoop on RapidMiner Studio

The RapidMiner Radoop client installation is straight-forward, assuming the prerequisites are met and the appropriate ports are available. The extension can be easily RapidMinerinstalled from the Marketplace.

If you are using RapidMiner Radoop 2.5 or earlier version, or if you want to install the extension manually, follow the steps below.

Manual extension install

In Step 3, you will move the files to:

There are two options for the installation, please choose one.

For enabling the plugin for all users on a machine (global install), move the files into the install folder at lib/plugins.

In case of RapidMiner Studio versions 6.4 and later, for enabling the plugin only for a single user, move the files to.RapidMiner/extensions/ at the user home folder. If the extensions folder does not exist, create it.

For Mac users running RapidMiner Studio versions 6.4 and later, move the files into.RapidMiner/extensions/. If the extensions folder does not exist, create it. Note that RapidMiner Studio creates .RapidMiner as a hidden folder, so you must set your Mac to display hidden files and folders if you cannot see it.

For Mac users running RapidMiner Studio versions prior to 6.4, move the files into the install folder at lib/plugins.

The process is as follows:

    1. If necessary, quit RapidMiner Studio.

    2. Download the RapidMiner Radoop plugin, a JAR file, from the location specified in your confirmation email.

    3. Move the following files to the RapidMiner Studio directory on the host system:

    • the downloaded RapidMiner Radoop JAR file (rapidminer-Radoop-onsite-.jar);
    • if using RapidMiner Radoop 2.5 or earlier version, your RapidMiner Radoop license file found in the confirmation email (radoop.license). Note: this license file does not work starting from version 2.6.

    4. With both the license (for version 2.5 or earlier) and JAR files moved, start RapidMiner.

If the extension has been successfully intalled, Hadoop Data appears in the middle, as a new view, in the RapidMiner Studio startup window:

That’s it. Now that RapidMiner Radoop is installed, see the section on configuring connections to complete the installation.

Considering security

Consider the following security measures to secure your HDFS and data warehouse infrastructure:

  • Apply the firewall settings for your data warehouse system (optional but recommended).
  • Use Kerberos or Apache Sentry for securing your cluster. See the Hadoop security section for security configuration suggestions.