21 Mar 2013

Hadoop, Hive, Hbase installation on MAC OS X

hadoop, hbase, hive 0 Comment

This is a post about configuring Hadoop, Hive and HBase on MAC as a single node installation


  • What is hive?: Hive is a data warehousing infrastructure based on Hadoop
  • What is Hbase?: Its a distributed, versioned, column-oriented NoSQL data store, modeled after Googles Bigtable. used to host very large tables — billions of rows *times* millions of columns.
  • What is hadoop?: Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware using map-reduce programming paradigm.

Hadoop Installation

I found this excellent article on installing it on mac.  I choose to install under ~/applications directory.  I’ll be using this directory for Hive and Hbase as well..

   mkdir ~/Downloads/apache.org
   cd ~/Downloads/apache.org
   wget http://apache.techartifact.com/mirror/hadoop/common/hadoop-1.1.2/hadoop-1.1.2-bin.tar.gz

   cd ~/applications; tar xzvf ~/Downloads/apache.org/hadoop-1.1.2-bin.tar.gz
   cd  hadoop-1.1.2;

configured conf/hadoop-env.sh to set correct JAVA_HOME and also set HADOOP_OPTS to avoid “Unable to load realm info from SCDynamicStore” warning, and also set the heap size..

# The java implementation to use.  Required.
# export JAVA_HOME=/usr/lib/j2sdk1.5-sun
export HADOOP_OPTS="-Djava.security.krb5.realm=OX.AC.UK -Djava.security.krb5.kdc=kdc0.ox.ac.uk:kdc1.ox.ac.uk"
export JAVA_HOME=$(/usr/libexec/java_home)

# The maximum amount of heap to use, in MB. Default is 1000.

modify conf/core-site.xml

           <description>A base for other temporary directories.</description>

modify conf/hdfs-site.xml


modify conf/mapred-site.xml


ensure that we can ssh into our host without password..

   $ ssh localhost

   # If you cannot ssh to localhost without a passphrase, execute the following commands:
   $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
   $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

time to start hadoop

   $ bin/hadoop namenode -format
   $ bin/start-all.sh

we should now be able to verify if hadoop is up using NameNode, JobTracker links

some useful commands, tests &

   $ bin/hadoop fs -put conf input
   $ bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'
   $ bin/hadoop fs -get output output
   $ bin/hadoop fs -cat output/*

some aliases for convenience, which can go into .bashrc

   alias hput="hadoop fs -put"
   alias hcat="hadoop fs -cat"
   alias hls="hadoop fs -ls"
   alias hrmr="hadoop fs -rmr"

to stop hadoop, not that we want to stop it now.. here is the command, if we need to..

   $ bin/stop-all.sh

Hbase installtion

Here are some useful links that helped in setting up the installation..

  • http://hbase.apache.org/book.html
  • http://hbase.apache.org/book.html#configuration
  • http://hbase.apache.org/book.html#standalone
  • http://hbase.apache.org/book.html#confirm
  • http://hbase.apache.org/book.html#quickstart

quick install..

  cd ~/Downloads/apache.org; 
  wget http://apache.techartifact.com/mirror/hbase/hbase-0.94.5/hbase-0.94.5.tar.gz

  cd ~/applications; 
  tar xzvf ~/Downloads/apache.org/hbase-0.94.5.tar.gz
  cd hase-0.94.5;

modify conf/hbase-site.xml to point to hadoop and provide zookeeper quourum


start hbase


verify hbase and start hbase shell

bin/hbase shell
1.9.3-p194 :001 > help
1.9.3-p194 :002 > list


Hive Installation

Here are some useful links to understand above Hive.

I would recommend opening a new terminal to install Hive.  here are quick instructions..

   cd ~/Downloads/apache.org/;
   wget http://apache.techartifact.com/mirror/hive/hive-0.10.0/hive-0.10.0-bin.tar.gz

   cd ~/applications/;
   tar xzvf ~/Downloads/apache.org/hive-0.10.0-bin.tar.gz

   export HADOOP_HOME=~/applications/hadoop-1.1.2;
   export HIVE_HOME=~/applications/hive-0.10.0-bin;

   $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
   $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
   $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
   $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse

   # this should start the hive shell, you should be able to create tables in hive..

in order for hadoop to play well with hive & hbase table definitions, it required hadoop to be aware of some additional hbase and zookeeper libraries.. this mail talks about some options with auxlib and HADOOP_CLASSPATH, I could not get it to work with HADOOP_CLASSPATH, may need to spend a little bit more time..

  cd $HIVE_HOME;
     > add jar /Users/vineeln/applications/hbase-0.94.5/lib/protobuf-java-2.4.0a.jar;
     > add jar /Users/vineeln/applications/hbase-0.94.5/hbase-0.94.5.jar;
     > add jar lib/zookeeper-3.4.3.jar;
     > add jar lib/hive-hbase-handler-0.10.0.jar;
  # was able to create hbase related tables after this..     

Creating External Tables in Hive

Lets create a sample text file and add it in hadoop

  $ for i in {1..100}; do   echo "$i,name-${i},${i}" >> /tmp/datafile.txt; done
  $ hadoop fs -mkdir /user/vineeln/datafiles
  # there should be a directory
  $ hadoop fs -ls /user/vineeln/datafiles  
  # put the generated file into hadoop
  $ hadoop fs -put /tmp/datafile.txt /user/vineeln/datafiles/
  # check that the file exists
  $ hadoop fs -ls /user/vineeln/datafiles/

time to create a link table in hive pointing to external files

hive> CREATE TABLE user(id INT, name STRING, age INT) ROW FORMAT
                   LINES TERMINATED BY '\n' 
                   STORED AS TEXTFILE
                   LOCATION '/user/vineeln/datafiles';
Time taken: 0.038 seconds

And view it using SQL commands..

hive> select id, name, age from user;
Job 0: Map: 1   HDFS Read: 1303 HDFS Write: 1092 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
1	name-1	1
2	name-2	2
3	name-3	3
4	name-4	4
5	name-5	5
94	name-94	94
95	name-95	95
96	name-96	96
97	name-97	97
98	name-98	98
99	name-99	99
100	name-100	100
Time taken: 8.343 seconds

Hive with HBase

Here is a good article about the integration

lets first create an hive table pointing to hbase:user event table with “cfl” column family with two columns name, age

hive> set hbase.zookeeper.quorum=localhost;
hive> CREATE TABLE hbase_user(key int, name string, age int)
    > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
    > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:name,cfl:age")
    > TBLPROPERTIES ("hbase.table.name" = "user")
    > ;
Time taken: 89.281 seconds

verify in hbase shell

1.9.3-p194 :006 > list 'user'
TABLE                                                                                                       user                                                                                                        
1 row(s) in 0.0260 seconds
1.9.3-p194 :009 > describe 'user'
DESCRIPTION                                                                        ENABLED                                      
 {NAME => 'user', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATION_ true                                         
 SCOPE => '0', VERSIONS => '3', COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL =>                                              
  '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'},                                              
  {NAME => 'cfl', BLOOMFILTER => 'NONE', REPLICATION_SCOPE => '0', VERSIONS => '3'                                              
 , COMPRESSION => 'NONE', MIN_VERSIONS => '0', TTL => '2147483647', BLOCKSIZE => '                                              
 65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}                                                                          
1 row(s) in 0.0330 seconds

1.9.3-p194 :010 > 

lets insert some data into hbase tables

hive> INSERT OVERWRITE TABLE hbase_user select * from user where age is not null;
100 Rows loaded to hbase_user
MapReduce Jobs Launched: 
Job 0: Map: 1   HDFS Read: 2679 HDFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
Time taken: 8.282 seconds

# select from hbase:user linked table is also possible..

hive> select key, name, age from hbase_user;
Job 0: Map: 1   HDFS Read: 245 HDFS Write: 1376 SUCCESS
Total MapReduce CPU Time Spent: 0 msec
1	name-1	1
10	name-10	10
hive >

verify in hbase shell

1.9.3-p194 :007 > scan 'user'
ROW                               COLUMN+CELL                                                                                   
 1             column=cf1:name,timestamp=1363944503406,value=name-1                                        
 1             column=cfl:age, timestamp=1363944503406,value=1 
 99             column=cf1:name,timestamp=1363944503406,value=name-99
 99             column=cfl:age, timestamp=1363944503406,value=99

1.9.3-p194 :008 > get 'user', 99
COLUMN        CELL                                                                                          
 cf1:name     timestamp=1363944503406,value=name-99                                                        
 cfl:age      timestamp=1363944503406, value=99                                                             
2 row(s) in 0.0230 seconds

1.9.3-p194 :009 >

Reference Articles for good reading

  • http://blogs.msdn.com/b/brandonwerner/archive/2011/11/13/how-to-set-up-hadoop-on-os-x-lion-10-7.aspx
  • http://www.ibm.com/developerworks/data/library/techarticle/dm-1209hadoopbigdata/
  • http://hadoop.apache.org/docs/stable/single_node_setup.html
  • http://mail-archives.apache.org/mod_mbox/hive-user/201103.mbox/%3CAANLkTingqLGKnQmiZgoi+SZFNExgCaT8CAqTOvf8JmG7@mail.gmail.com%3E
  • https://issues.apache.org/jira/browse/HADOOP-7489
  • http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en//archive/bigtable-osdi06.pdf