Archive

Archive for the ‘Hadoop’ Category

Increasing disk size on Hadoop Cloudera’s VM

December 17, 2012 Leave a comment

Cloudera is giving a nice solution if you want to play with hadoop ecosystem (and other cloudera’s add-ons), which is a virtualized single-node Hadoop cluster. The VM is available for VMWare, KVM and Virtual Box and can be downloaded from Cloudera download site

Lately I faced the problem that the VM predefined size was not enough (25GB) and I needed to increase the disk space, something that sound trivial, cost me several hours to figure out how to manage (especially when you are not an Linux admin, and when the graphical user interface of the virtualized guest OS is missing some system functionality).

So below are the instruction for that. I’ll show how to increase the disk size from 25GB to 100GB on VMWare image (using VMWare Player), of CDH 4.x

  1. In the VMWare player, when the cloudera image is shutdown, go to the VM settings, Hardware tab, select Hard Disk devices select the utility button and the expand option. In the dialog set the new size (here 100 GB), press ok
    clouderavm (increase vm disk)_step1
    Once completed (this operation will take several minutes, depending on the disk size), a new popup will inform you that the virtualized disk size was increase but you need to modify your guest OS to use the new size. In order to do so we will need to perform several admin operation in the guest OS which is in our case CentOS linux distrib.
  2. The next step is to modify the boot option to start the guest OS without any services and no graphical UI, we will in fact start the guest OS as run level 1 (single user mode). So start the VM, and at the boot screen press any key to enter the GRUB boot manager (you will have 3 seconds to do so). When the GRUB nenu shows, go to the entry for cloudera demo vm, press ‘e‘ to edit the entry, go to the kernel entry and press again ‘e‘, then add ‘1‘ at the end of the line, press enter then ‘b‘ to boot to the newly modified option.
    clouderavm_(increase VM disk) step2a clouderavm (increase vm disk)step2
  3. Once booted, login as root (password is cloudera), check the disk size using ‘df -h’ you will see that the disk size is still 25GB, using ‘fdisk -l’ shows the physical disk and it’s allocated partition, we can see that the physical disk (/dev/sda) already reflects the increase of the size but not the partition (/dev/sda1) .
    clouderavm_step3a
    We will change this using fdisk and resize2fs commands
  4. So a the prompt type: ‘fdisk /dev/sda‘, we will delete and recreate the partition, pay attention that the newly created partition needs to start at the same size than the one we delete, in order to note its starting point in fdisk prompt press ‘p‘. In the previous screenshot /dev/sda1 started at 1.
    • in fdisk prompt press’d‘ to delete the partition, since there is only one partition it will be automatically deleted.
    • press ‘n’ to create a new partition, then ‘p‘ for primary partition, then ‘1‘ for the partition number, put your previous starting number or press ‘enter‘ to pick the default, put the end size or press ‘enter‘ to pick the default max size (here 100GB)
    • at this stage a new partition /dev/sda1 should have been created.
    • type ‘w‘ to write the file partition change to disk.
    • quit fdisk using ‘q‘ command
  5. reboot as before (you can use ‘reboot‘ command for that) and at the start screen proceed like in step #2 to boot as run level 1), login as root again. Now using ‘df -h’ will still show a disk size 25GB, but using ‘fdisk -l‘ will show that the new partition as a size of 100GB as shown below.
    clouderavm_step3a

    In order to resolve this, at the command prompt type ‘resize2fs /dev/sda1‘. Once the resize ends, your increased disk space should be reflected via ‘df -h‘, below 99GB.

That’s it. You can now reboot as usual and enjoy your increased disk space.

 

Pig and Hbase integration

September 15, 2011 1 comment

The Hadoop ecosystem contains a lot of sub project. Hbase and Pig are just some of them.

Hbase is the Hadoop database, allowing to manage your data in a table way more than in a file way.

Pig is a scripting language that will generate on the fly map reduce job to get the data you need. It is very compact compared to hand writing map reduce job.

One of the nice thing between Pig and Hbase is that they can be integrated. Thanks to recent patch committed.

The documentation is not well updated yet (currently almost relate to the patch itself) some can be found on some post like here but they all lack of details explanation. Even the Cloudera distribution CDH3 indicates support for this integration but no sample can be found.

Below I describe the installation and configuration steps to make the integration works, provide and example and finally expose some of the limits of the current release (0.8)

  1. First, install the map reduce components (Job tracker and Task tracker). One Job tracker and many task tracker as you have data nodes. Each distribution may provide different procedure for the installation, I’m using the Cloudera CDH3 distrib, which for the map reduce installation is well documented.
  2. Now proceed with the Pig installation,  it is also easy as long you are not trying the integration with Hbase.  You need only to install pig on the client side, you do not need to install it on each Data Node neither on the Name Node, but just on the machine where you want to run the pig program.
  3. Check your installation by entering the the grunt shell (just enter ‘pig’ from the shell).
  4. Now the tricky part – In order to use Pig/Hbase integration you in fact need to make Map Reduce jobs aware of Hbase classes, otherwise you will have “ClassNotFoundException” or worst the zookeeper exception like “org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase” during execution. The way to perform this easily without coping the hbase configurations into your hadoop configuration dir, is by using hadoop-env.sh and hbase to print its own classpath.
    So add to your hadoop-env.sh file file the following

    #define the location of hbase
    export HBASE_HOME=/usr/lib/hbase
    #Customize the classpath to make map reduce job aware of Hbase
    export HADOOP_CLASSPATH="`${HBASE_HOME}/bin/hbase classpath`:$HADOOP_CLASSPATH"
    </span>
  5. You will also need pig to be aware of Hbase configuration, for this you can use the HBASE_CONF_DIR environment variable (for CDH release), which is configured by default to be /etc/hbase/conf,

Ok your installation should be fine now, so let’s do an example…. For this example let assume we have stored in HBase a schema named TestTable, and column family named A, we have also several fields named field0, field1,…, and we want to extract this information and store it into ‘results/extract’.  In this case the pig script will looks like:

my_data = LOAD 'hbase://TestTable' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('A:field0 A:field1', '-loadKey=true') as (id, field0, field1);

store my_data into 'results/extract' using PigStorage(';');

So the above script indicate that the my_data relation will contains the fields “field0, field1” and the ID (due to the -loadKey parameter). These fields will be stored as id, field0, field1 under the ‘result/extract’ folder and values will be separated by semicolon.

You can also use some comparison operator on the key. The current operator supported are lt, lte, gt, gte for lower than, lower than or equal, greater than and greather than or equal.

 Note: There is no support for logical operator, you can use more than one comparison operator which are chained as AND.

Limitations:

The current HBaseStorage, does not allow the usage of wildcard, that is if you need all the fields in a row, you need to enumerate them. Wildcard are supported in version 0.9.

You can use HBaseStorage to store back the records in HBase nevertheless the HBase usage is incosistent a bug was already opened on this.