Apache Phoenix for Cloudera CDH

Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets.

What the above statement means for developers or data scientists is that you can “talk” SQL to your HBase cluster. Sounds good right? Setting up Phoenix on Cloudera CDH can be really frustrating and time-consuming. I wrapped-up references from across the web with my own findings to have both play nice.

Building Apache Phoenix

Because of dependency mismatch for the pre-built binaries, supporting Cloudera’s CDH requires to build Phoenix using the versions that match the CDH deployment. The CDH version I used is CDH4.7.0. This guide applies for any version of CDH4+.

Note: You can find CDH components version in the “CDH Packaging and Tarball Information” section for the “Cloudera Release Guide”. Current release information (CDH5.2.1) is available in this link.

Preparing Phoenix build environment

Phoenix can be built using maven or gradle. General instructions can be found in the “Building Phoenix Project” webpage.

Before building Phoenix you need to have:

  • JDK v6 (or v7 depending which CDH version are you willing to support)
  • Maven 3
  • git

Checkout correct Phoenix branch

Phoenix has two major release versions:

  • 3.x – supports HBase 0.94.x   (Available on CDH4 and previous versions)
  • 4.x – supports HBase 0.98.1+ (Available since CDH5)

Clone the Phoenix git repository

git clone https://github.com/apache/phoenix.git

Work with the correct branch

git fetch origin
git checkout 3.2

Modify dependencies to match CDH

Before building Phoenix, you will need to modify the dependencies to match the version of CDH you are trying to support. Edit phoenix/pom.xml and do the following changes:

Add Cloudera’s Maven repository

+    <repository>
+        <id>cloudera</id>
+        https://repository.cloudera.com/artifactory/cloudera-repos/
+    </repository>

Change component versions to match CDH’s.

     
-    <hadoop-one.version>1.0.4</hadoop-one.version>
-    <hadoop-two.version>2.0.4-alpha</hadoop-two.version>
+    <hadoop-one.version>2.0.0-mr1-cdh4.7.0</hadoop-one.version>
+    <hadoop-two.version>2.0.0-cdh4.7.0</hadoop-two.version>
     <!-- Dependency versions -->
-    <hbase.version>0.94.19
+    <hbase.version>0.94.15-cdh4.7.0
     <commons-cli.version>1.2</commons-cli.version>
-    <hadoop.version>1.0.4
+    <hadoop.version>2.0.0-cdh4.7.0
     <pig.version>0.12.0</pig.version>
     <jackson.version>1.8.8</jackson.version>
     <antlr.version>3.5</antlr.version>
     <log4j.version>1.2.16</log4j.version>
     <slf4j-api.version>1.4.3.jar</slf4j-api.version>
     <slf4j-log4j.version>1.4.3</slf4j-log4j.version>
-    <protobuf-java.version>2.4.0</protobuf-java.version>
+    <protobuf-java.version>2.4.0a</protobuf-java.version>
     <commons-configuration.version>1.6</commons-configuration.version>
     <commons-io.version>2.1</commons-io.version>
     <commons-lang.version>2.5</commons-lang.version>

Change target version, only if you are building for Java 6. CDH4 is built for JRE 6.

           <artifactId>maven-compiler-plugin</artifactId>
           <version>3.0</version>
           <configuration>
-            <source>1.7</source>
-            <target>1.7</target>
+            <source>1.6</source>
+            <target>1.6</target>
           </configuration>

Phoenix building

Once, you have made the changes you are set to build Phoenix. Our CDH4.7.0 cluster uses Hadoop 2, so make sure to activate the hadoop2 profile.

mvn package -DskipTests -Dhadoop.profile=2

If everything goes well, you should see the following result:

[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Phoenix .................................... SUCCESS [2.729s]
[INFO] Phoenix Hadoop Compatibility ...................... SUCCESS [0.882s]
[INFO] Phoenix Core ...................................... SUCCESS [24.040s]
[INFO] Phoenix - Flume ................................... SUCCESS [1.679s]
[INFO] Phoenix - Pig ..................................... SUCCESS [1.741s]
[INFO] Phoenix Hadoop2 Compatibility ..................... SUCCESS [0.200s]
[INFO] Phoenix Assembly .................................. SUCCESS [30.176s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 1:02.186s
[INFO] Finished at: Mon Dec 15 13:18:48 PET 2014
[INFO] Final Memory: 45M/1330M
[INFO] ------------------------------------------------------------------------

Phoenix Server component deployment

Since Phoenix is a JDBC layer on top of HBase a server component has to be deployed on every HBase node. The goal is to have Phoenix server component added to HBase classpath.

You can achieve this goal either by copying the server component directly to HBase’s lib directory, or copy the component to an alternative path then modify HBase classpath definition.

For the first approach, do:

cp phoenix-assembly/target/phoenix-3.2.3-SNAPSHOT-server.jar /opt/cloudera/parcels/CDH/lib/hbase/lib/

Note: In this case CDH is a synlink to the current active CDH version.

For the second approach, do:

cp phoenix-assembly/target/phoenix-3.2.3-SNAPSHOT-server.jar /opt/phoenix/

Then add the following line to /etc/hbase/conf/hbase-env.sh

/etc/hbase/conf/hbase-env.sh
export HBASE_CLASSPATH_PREFIX=/opt/phoenix/phoenix-3.2.3-SNAPSHOT-server.jar

Wether you’ve used any of the methods, you have to restart HBase. If you are using Cloudera Manager, restart the HBase service.

To validate that Phoenix is on HBase class path, do:

sudo -u hbase hbase classpath | tr ':' '\n' | grep phoenix

Phoenix server validation

Phoenix provides a set of client tools that you can use to validate the server component functioning. However, since we are supporting CDH4.7.0 we’ll need to make few changes to such utilities so they use the correct dependencies.

phoenix/bin/sqlline.py:

sqlline.py is a wrapper for the JDBC client, it provides a SQL console interface to HBase through Phoenix.

index f48e527..bf06148 100755
--- a/bin/sqlline.py
+++ b/bin/sqlline.py
@@ -53,7 +53,8 @@ colorSetting = "true"
 if os.name == 'nt':
     colorSetting = "false"
-java_cmd = 'java -cp "' + phoenix_utils.hbase_conf_path + os.pathsep + phoenix_utils.phoenix_client_jar + \
+extrajars="/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-collections-3.2.1.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-auth-2.0.0-cdh4.7.0.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.0.0-cdh4.7.0.jar:/opt/cloudera/parcerls/CDH/lib/oozie/libserver/hbase-0.94.15-cdh4.7.0.jar"
+java_cmd = 'java -cp ".' + os.pathsep + extrajars + os.pathsep + phoenix_utils.hbase_conf_path + os.pathsep + phoenix_utils.phoenix_client_jar + \
     '" -Dlog4j.configuration=file:' + \
     os.path.join(phoenix_utils.current_dir, "log4j.properties") + \
     " sqlline.SqlLine -d org.apache.phoenix.jdbc.PhoenixDriver \

phoenix/bin/psql.py:

psql.py is a wrapper tool that can be used to create and populate HBase tables.

index 34a95df..b61fde4 100755
--- a/bin/psql.py
+++ b/bin/psql.py
@@ -34,7 +34,8 @@ else:
 # HBase configuration folder path (where hbase-site.xml reside) for
 # HBase/Phoenix client side property override
-java_cmd = 'java -cp "' + phoenix_utils.hbase_conf_path + os.pathsep + phoenix_utils.phoenix_client_jar + \
+extrajars="/opt/cloudera/parcels/CDH/lib/hadoop/lib/commons-collections-3.2.1.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-auth-2.0.0-cdh4.7.0.jar:/opt/cloudera/parcels/CDH/lib/hadoop/hadoop-common-2.0.0-cdh4.7.0.jar:/opt/cloudera/parcerls/CDH/lib/oozie/libserver/hbase-0.94.15-cdh4.7.0.jar"
+java_cmd = 'java -cp ".' + os.pathsep + extrajars + os.pathsep + phoenix_utils.hbase_conf_path + os.pathsep + phoenix_utils.phoenix_client_jar + \
     '" -Dlog4j.configuration=file:' + \
     os.path.join(phoenix_utils.current_dir, "log4j.properties") + \
     " org.apache.phoenix.util.PhoenixRuntime " + args

After you have done such changes you can test connectivity by issuing the following commands:

./bin/sqlline.py zookeeper.local
Setting property: [isolation, TRANSACTION_READ_COMMITTED]
issuing: !connect jdbc:phoenix:zookeeper.local none none org.apache.phoenix.jdbc.PhoenixDriver
Connecting to jdbc:phoenix:zookeeper.local
14/12/16 19:26:10 WARN conf.Configuration: dfs.df.interval is deprecated. Instead, use fs.df.interval
14/12/16 19:26:10 WARN conf.Configuration: hadoop.native.lib is deprecated. Instead, use io.native.lib.available
14/12/16 19:26:10 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
14/12/16 19:26:10 WARN conf.Configuration: topology.script.number.args is deprecated. Instead, use net.topology.script.number.args
14/12/16 19:26:10 WARN conf.Configuration: dfs.umaskmode is deprecated. Instead, use fs.permissions.umask-mode
14/12/16 19:26:10 WARN conf.Configuration: topology.node.switch.mapping.impl is deprecated. Instead, use net.topology.node.switch.mapping.impl
14/12/16 19:26:11 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
14/12/16 19:26:11 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
14/12/16 19:26:12 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
14/12/16 19:26:12 WARN conf.Configuration: fs.default.name is deprecated. Instead, use fs.defaultFS
Connected to: Phoenix (version 3.2)
Driver: PhoenixEmbeddedDriver (version 3.2)
Autocommit status: true
Transaction isolation: TRANSACTION_READ_COMMITTED
Building list of tables and columns for tab-completion (set fastconnect to true to skip)...
77/77 (100%) Done
Done
sqlline version 1.1.2
0: jdbc:phoenix:zookeeper.local>

Then, you can either issue SQL-commands or Phoenix-commands.

0: jdbc:phoenix:zookeeper.local> !tables
+------------------------------------------+------------------------------------------+------------------------------------------+---------------------------+
|                TABLE_CAT                 |               TABLE_SCHEM                |                TABLE_NAME                |                TABLE_TYPE |
+------------------------------------------+------------------------------------------+------------------------------------------+---------------------------+
| null                                     | SYSTEM                                   | CATALOG                                  | SYSTEM TABLE              |
| null                                     | SYSTEM                                   | SEQUENCE                                 | SYSTEM TABLE              |
| null                                     | SYSTEM                                   | STATS                                    | SYSTEM TABLE              |
| null                                     | null                                     | STOCK_SYMBOL                             | TABLE                     |
| null                                     | null                                     | WEB_STAT                                 | TABLE                     |
+------------------------------------------+------------------------------------------+------------------------------------------+---------------------------+